{“状态”：“确定”，“消息类型”：“工作”，“信息版本”：“1.0.0”，“邮件”：{“索引”：{“日期-部件”：[[2023,12,1]]，“日期-时间”：“2023-12-01T00:05:52Z”，“时间戳”：1701389152274}，“发布者位置”：“美国纽约州纽约市”，“引用-计数”：24，“发布商”：“ACM”，“许可证”：[{“开始”：{-“日期-部分”：[2019,10,13]]，date-time“：”2019-10-13T00:00:00Z“，”时间戳“：1570924800000}，“content-version”：“vor”，“delay-in-days”：0，“URL”：“http://www.acm.org\/publications\/policys\/corpyright_policy#Background”}]，“content-domain”：{“domain”:[“dl.acm.org”]，“crossmark-restriction”：true}，”short-container-title“：[]，”published-print“：{”date-parts“：[2019,10,13]}”，“DOI”：“10.1145\/3356464.3357704“，”type“：”procesdings-article“，“created”：{“date-parts”：[[2019,10,31]]，“date-time”：“2019-10-31T12:20:52Z”，“timestamp”：157252452000}，“update-policy”：“http://\/dx.doi.org\/10.1145\/crossmark-policy”，“source”：《Crossref》，“is-referenced-by-count”：0，“title”：[“用于学习连续域中确定性策略的高效强化学习算法”]，“prefix”：“10.1145”，“author”：[{“given”：“Matthieu”，“family”：“Zimmer”，“sequence”：“first”，“affiliation”：[}，{“name”：“Shanghai Jiang University，Shanghao，China”}]}，}“giving”：“Paul”，”family“：”Weng“，”sequence“：”additional“，”affiliance“:[{”name“：“Shang Jianghai University”，China，}]}]，“member”：“320”，“published on”：{“date-parts”：[2019,10,13]]}，“参考”：[{“key”：“e_1_3_2_1_1_1”，“非结构化”：“Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S。Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dan Man\u00e9 Rajat Monga Sherry Moorek Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul TuckerVincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete典狱长Martin Wattenberg Martin Wicke Yuan and Xiaoqiang Zheng。2015年，TensorFlow：异构系统上的大规模机器学习。http:\/\/tensorflow.org\/软件可从tensorflow.org.Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S获得。Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dan Man\u00e9 Rajat Monga Sherry Moorek Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul TuckerVincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete典狱长Martin Wattenberg Martin Wicke Yuan and Xiaoqiang Zheng。2015.TensorFlow：异构系统上的大规模机器学习。http://\tensorflow.org“}，{“key”：“e_1_3_2_1_1”，“unstructured”：“Prafulla Dhariwal Christopher Hesse Oleg Klimov Alex Nichol Matthias Plappert Alec Radford John Schulman Szymon Sidor Yuhuai Wu和Peter Zhokhov，2017。OpenAI基线。https:\/\/github.com//openai\/baselines。普拉福拉·达里瓦尔（Prafulla Dhariwal）、克里斯托弗·黑塞·奥列格·克里莫夫（Christopher Hesse Oleg Klimov）、亚历克斯·尼科尔·马提亚斯·普拉佩特（Alex Nichol Matthias Plappert）、拉德福德（Alec Radford）、约翰·舒尔曼（John Schulman。2017.OpenAI基线。https:\/\/github.com//openai\/baselines。“｝，｛”key“：”e_1_3_2_1_3_1“，”volume title“：”Impala：具有重要性加权参与者学习器架构的可扩展分布式深度rl。arXiv预印本arXiv:1802.01561“，”作者“：”Espeholt Lasse“，”年份“：”2018“，”非结构化“：”Lasse Espeholt、Hubert Soyer、Remi Munos、Karen Simonyan、Volodymir Mnih、Tom Ward、Yotam Doron、Vlad Firoiu、Tim Harley、Iain Dunning，2018年。Impala：具有重要性加权actor-learner架构的可扩展分布式deep-rl。arXiv预印arXiv:1802.01561（2018）。Lasse Espeholt、Hubert Soyer、Remi Munos、Karen Simonyan、Volodymir Mnih、Tom Ward、Yotam Doron、Vlad Firoiu、Tim Harley、Iain Dunning等，2018年。Impala：具有重要性加权参与者学习器架构的可扩展分布式深度rl。arXiv预印arXiv:1802.01561（2018）。“}，{”key“：”e_1_3_2_1_4_1“，”volume-title“：”actor-critic方法中的寻址函数近似错误。arXiv预打印arXiv:1802.09477“，”author“：”Fujimoto Scott“，”year“：”2018“，”unstructured“：”Scott Fujimoton、Herke van Hoof和David Meger。2018。在actor-critic方法中处理函数近似错误。arXiv预印arXiv:1802.09477（2018）。Scott Fujimoto、Herke van Hoof和David Meger。2018.解决actor-critic方法中的函数近似错误。arXiv预印arXiv:1802.09477（2018）。“}，{”key“：”e_1_3_2_1_5_1“，”volume-title“：”Q-Prop：“非政策批评家的样本有效政策梯度。arXiv预印本arXiv:1611.02247“，”author“：”顾世祥“，”year“：”2016“，”unstructured“：”顾世祥、蒂莫西·利克拉普、邹宾·加拉马尼、理查德·特纳和谢尔盖·莱文。2016。Q-Prop：一位非政策评论家的高效政策梯度示例。arXiv预印arXiv:1611.02247（2016）。顾世祥（Shixiang Gu）、蒂莫西·利利克拉普（Timothy Lillicrap）、邹宾·加拉马尼（Zoubin Ghahramani）、理查德·特纳（Richard E.Turner）和谢尔盖·莱文（Sergey Levine）。2016年，Q-Prop：非政策评论家的高效政策梯度示例。arXiv预印arXiv:1611.02247（2016）。“}，{”key“：”e_1_3_2_1_6_1“，”volume-title“：”内插政策梯度：合并对政策和非政策梯度的估计以进行深度强化学习。arXiv预印本arXiv:1706.00387“，”author“：”顾世祥“，”year“：”2017“，”unstructured“：”顾世祥，Timothy Lillicrap，Zoubin Ghahramani，Richard e。Turner、Bernhard Sch\u00f6lkopf和Sergey Levine。2017年，内插政策梯度：合并政策内和政策外梯度估计，用于深度强化学习。arXiv预印arXiv:1706.00387（2017）。顾世祥（Shixiang Gu）、蒂莫西·利利克拉普（Timothy Lillicrap）、邹宾·加赫拉马尼（Zoubin Ghahramani）、理查德·特纳（Richard E.Turner）、伯恩哈德·舒尔科夫（Bernhard Sch\u00f6lkopf）和谢尔盖·莱文（Sergey Levine）。2017年，内插政策梯度：合并政策内和政策外梯度估计，用于深度强化学习。arXiv预印arXiv:1706.00387（2017）。“}，{”key“：”e_1_3_2_1_7_1“，”volume-title“：”Caffe:快速特征嵌入的卷积体系结构。arXiv预印本arXiv:1408.5093“，”author“：”贾阳庆“，”year“：”2014“，”unstructured“：”Yangqing Jia、Evan Shelhamer、Jeff Donahue、Sergey Karayev、Jonathan Long、Ross Girschick、Sergio Guadarrama和Trevor Darrell。2014 . Caffe：快速特征嵌入的卷积架构。arXiv预印本arXiv:1408.5093（2014）。贾阳庆、埃文·谢尔哈默、杰夫·多纳休、谢尔盖·卡拉耶夫、乔纳森·朗、罗斯·吉希克、塞尔吉奥·瓜达拉马和特雷弗·达雷尔。2014.Caffe:快速功能嵌入的卷积架构。arXiv预印本arXiv:1408.5093（2014）。“}，{”key“：”e_1_3_2_1_8_1“，”doi-asserted-by“：”publisher“，“doi”：“10.1137 \/S0363012901385691”}，“key”：“e_1_ 3_2_1 _9_1”，“volume-title”：“持续控制与深度强化学习。ICLR”，”author“：”Lillicrap Timothy P“，”year“2016”，“unstructured”：“Timothy-P.Lillicrop，Jonathan J。亨特、亚历山大·普里泽尔、尼古拉斯·赫斯、汤姆·埃雷斯、尤瓦尔·塔萨、大卫·西尔弗和达安·维斯特拉。2016年，通过深度强化学习进行持续控制。ICLR（2016）。arXiv:1509.02971蒂莫西·利利克拉普（Timothy P.Lillicrap）、乔纳森·亨特（Jonathan J.Hunt）、亚历山大·普里泽尔（Alexander Pritzel）、尼古拉·海斯（Nicolas Heess）、汤姆·埃雷斯（Tom Erez）、尤瓦尔·塔萨（Yuval Tassa。2016年。通过深度强化学习进行持续控制。ICLR（2016）。arXiv:1509.02971“}，{”key“：”e_1_3_2_1_10_1“，”doi-asserted-by“：”publisher“，“doi”：“10.1038\/nature14236”}，“key”：“e_1_ 3_2_11_1”，“volume-title”：“Bellemare”，“author”：“Munos Remi”，“year”：“2016”，“unstructured”：“雷米·穆诺斯（Remi Munos）、汤姆·斯特普顿（Tom Stepleton）、安娜·哈鲁图扬扬（Anna Harutyunyan）和马克·G（Marc G）。贝勒马尔。2016 . 安全高效的非政策强化学习。arXiv预印arXiv:1606.02647（2016）。雷米·穆诺斯（Remi Munos）、汤姆·斯特普顿（Tom Stepleton）、安娜·哈鲁图扬扬（Anna Harutyunyan）和马克·贝勒马尔（Marc G.Bellemare）。2016年，安全高效的非政策强化学习。arXiv预印arXiv:1606.02647（2016）。“}，{”key“：”e_1_3_2_12_1“，”unstructured“：”Art B.Owen.2013。蒙特卡罗理论方法和实例。阿特·欧文。2013年，蒙特卡罗理论方法和示例。“}，{”key“：”e_1_3_2_13_1“，”volume-title“：”非保单政策评估的资格跟踪“，”author“：”Precup Doina“，“year”：“2000”，“unstructured”：“Doina Precup.2000”。非保单政策评估的资格跟踪。计算机科学系教师出版丛书（2000），80。多伊娜·普雷库普。2000.非政策政策评估的资格跟踪。计算机科学系教员出版丛书（2000），80.“}，{”key“：”e_1_3_2_14_1“，”volume-title“：”使用广义优势估计的高维连续控制。arXiv预印本arXiv:1506.02438“，”author“：”Schulman John“，“year”：“2015”，“unstructured”：“约翰·舒尔曼（John Schulman）、菲利普·莫里茨（Philipp Moritz）、谢尔盖·莱文（Sergey Levine）、迈克尔·乔丹（Michael Jordan）和彼得·阿贝尔（Pieter Abbeel）。2015年，使用广义优势估计进行高维连续控制。arXiv预印arXiv:1506.02438（2015）。约翰·舒尔曼（John Schulman）、菲利普·莫里茨（Philipp Moritz）、谢尔盖·莱文（Sergey Levine）、迈克尔·乔丹（Michael Jordan）和彼得·阿贝尔（Pieter Abbeel）。2015年，使用广义优势估计进行高维连续控制。arXiv预印arXiv:1506.02438（2015）。“｝，｛”key“：”e_1_3_2_1_15_1“，”volume title“：”近端策略优化算法.CoRR abs\/1707.06347“，”author“：”Schulman John“，”year“：”2017“，”nonstructured“：”John Schulman，Filip Wolski，Prafulla Dhariwal，Alec Radford，and Oleg Klimov.2017“。近似策略优化算法。CoRR abs \/1707.06347（2017）。arXiv:1707.06347 http://\/arXiv.org\/abs\/1707.06347-约翰·舒尔曼、菲利普·沃尔斯基、普拉福拉·达里瓦尔、亚历克·拉德福德和奥列格·克里莫夫。2017.近似策略优化算法。CoRR abs \/1707.06347（2017）。arXiv:1707.06347 http://\/arXiv.org\/abs\/1707.06347}，{“key”：“e_1_3_2_1_16_1”，“volume-title”：“第31届国际机器学习会议论文集”，“author”：“Silver David”，《year》：“2014”，“unstructured”：“David Silver、Guy Lever、Nicolas Heess、Thomas Degres、Daan Wierstra和Martin Riedmiller。2014 . 确定性策略梯度算法。第31届机器学习国际会议论文集（2014），387-395。David Silver、Guy Lever、Nicolas Heess、Thomas Degres、Daan Wierstra和Martin Riedmiller。2014.确定性政策梯度算法。第31届机器学习国际会议论文集（2014），387--395。“}，{”key“：”e_1_3_2_17_1“，”volume-title“：”Barto“，”author“：”Sutton Richard S.“，“year”：“1998”，”unstructured“：”Richard S.Sutton和Andrew G.Barto。1998.强化学习：导论（自适应计算和机器学习）一本布拉德福德的书。理查德·萨顿（Richard S.Sutton）和安德鲁·巴托（Andrew G.Barto）。1998年，强化学习：导论（自适应计算和机器学习）。布拉德福德的一本书。“}，{”key“：”e_1_3_2_18_1“，”first-page“：”1057“，”article-title“：”函数逼近强化学习的策略梯度方法“，”volume“：“12”，“author”：“Sutton Richard S.”，“year”：“1999”，“unstructured”：“Richard S。萨顿、大卫·麦卡莱斯特、萨廷德·辛格和伊莎·曼苏尔。1999 . 基于函数逼近的强化学习策略梯度方法。《神经信息处理系统进展》12（1999），1057--1063。https:\/\/doi.org\/10.1.37.9714理查德·萨顿（Richard S.Sutton）、大卫·麦卡莱斯特（David Mcallester）、萨丁德·辛格（Satinder Singh）和伊莎·曼苏尔（Yishay Mansour）。1999.函数逼近强化学习的政策梯度方法。《神经信息处理系统进展》12（1999），1057--1063。https:\/\/doi.org\/10.1.37.9714“，”journal-title“：“神经信息处理系统的进展”}，{“key”：“e_1_3_2_19_1”，“doi-asserted-by”：“publisher”，”doi“：”10.1109\/ADPRL.2007.368199“}，”{“密钥”：“e_1_3_2_1_20_1”、“volume-title”：“带经验重播的示例高效演员评论。arXiv预印本arXiv:1611.01224”，“author”：“Wang Ziyu”，“年份”：“2016年，“非结构化”：“Ziyu Wang、Victor Bapst、Nicolas Heess、Volodymyr Mnih、Remi Munos、Koray Kavukcuoglu和Nando DE Freitas。2016年，《高效演员-评论与经验回放》（Sample Efficient Actor-Critic with Experience Replay）。arXiv预印本arXiv:1611.01224（2016）。王子育、维克托·巴普斯特、尼古拉斯·海斯、沃洛德米尔·姆尼、雷米·穆诺斯、科雷·卡武科格鲁和南多·德弗里塔斯。2016年，《高效演员-评论与经验回放》（Sample Efficient Actor-Critic with Experience Replay）。arXiv预印本arXiv:1611.01224（2016）。“}，{”key“：”e_1_3_2_1_21_1“，”volume-title“：”连接强化学习的简单统计梯度算法。机器学习8，3-4“，”author“：”Williams Ronald J“，”year“：”1992“，”unstructured“：”Ronald J.Williams。1992。用于连接主义强化学习的简单统计梯度跟随算法。机器学习8，3-4（1992），229--256。罗纳德·威廉姆斯。1992年。连接强化学习的简单统计梯度允许算法。机器学习8，3-4（1992），229--256。“}，{“key”：“e_1_3_2_1_22_1”，“volume-title”：“Neural Fitted Actor-Critic.In European Symposium on Artificial Neural Networks，Computational Intelligence and Machine learning”，“author”：“Zimmer Matthieu”，“year”：“2016”，“unstructured”：“Matthieu Zimmer、Yann Boniface和Alain Dutech。2016 . 适合神经的演员-评论家。在欧洲人工神经网络、计算智能和机器学习研讨会上。Matthieu Zimmer、Yann Boniface和Alain Dutech。2016年，神经适应演员-关键。在欧洲人工神经网络、计算智能和机器学习研讨会上。“}，{”key“：”e_1_3_2_1_23_1“，”doi-asserted-by“：”publisher“，“doi”：“10.1109\/DEVLRN.2018.8761021”}，“key”：“e_1_3_2_1_24_1”，“doi-assert-by”：“publisher”，”doi“：”10.24963\/ijcai.2019\/625“}]，“event”：{“name”：“DAI'19:第一届分布式人工智能国际会议”，“location”：“Beijing China”，“缩写”：“DAI’19”}，“集装箱标签”：[“第一届分布式人工智能国际会议论文集“]，“original-title”：[]，“link”：[{“URL”：“https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/335646.3357704”，“content-type”：“unspecified”，“content-version”：“vor”，“intended-application”：“similarity-checking”}]，“deposed”：{“date-parts”：[2023,11,30]]，“date-time”：2023-11-30T20:47:15Z“，”timestamp“：1701377235000}，”score“：1，”resource“：{”primary”：{“URL”：“https:\/\/dl.acm.org\/doi\/10.1145\/335646.3357704”}}，“subtitle”：[]，“shorttitle”：[]，“issued”：{“date-parts”：[2019,10,13]]}，《references-count》：24，“alternative-id”：[“10.1145\/33564”.3357704“，“10.1145\/3356464”]，“URL”：“http:\/\/dx.doi.org\/10.1145\/335646.3357704“，”关系“：{}，”主题“：[]，”发布“：{”日期部分“：[[2019,10,13]]}，“断言”：[{”值“：”2019-10-13“，”顺序“：2，”名称“：”发布“，”标签“：”已发布“，“组”：{“名称”：“publication_history”，“标签”：“发布历史”}}]}}