{“状态”：“确定”，“消息类型”：“工作”，“信息版本”：“1.0.0”，“邮件”：{“索引”：{“日期-部件”：[[2024,6,15]]，“日期-时间”：“2024-06-15T04:40:24Z”，“时间戳”：1718426424580}，“参考-计数”：39，“出版商”：“国家科学院院院刊”，“问题”：“48”，“许可证”：[{“开始”:{“日期-零件”：[2021,2,17]]，”日期-时间“”：“2021-02-17T00:00:00Z”，“timestamp”：1613520000000}，“content-version”：“vor”，“delay-in-days”：184，“URL”：“https:\/\/www.pnas.org\/site\/aboutpnas\/licenses.xhtml”}]，“content-domain”：{“domain”:[“www.pnas.org”]，“crossmark-restriction”：true}，”short-container-title“：[”Proc.Natl.Acad.Sci.U.S.A.“]，“published preint“：{”date-parts“：[[2020,12]]}，”抽象“:"强化学习与深度学习相结合是解决当前难以解决的重要顺序决策问题的一种很有前途的方法。需要克服的一个障碍是此类学习系统所需的数据量。在本文中，我们建议通过分而治之的方法来解决这个问题。我们认为，复杂的决策问题可以自然地分解为多个任务，这些任务依次或并行展开。通过将每项任务与奖励函数相关联，可以在标准的强化学习形式中无缝地进行问题分解。我们这样做的具体方法是概括强化学习中的两个基本操作：政策改进和政策评估。这些操作的通用版本允许利用某些任务的解决方案来加速其他任务的解决。如果任务的奖励函数可以很好地近似为之前求解的任务奖励函数的线性组合，那么我们可以将强化学习问题简化为更简单的线性回归。当情况并非如此时，代理仍然可以利用任务解决方案与环境交互并了解环境。这两种策略都大大减少了解决强化学习问题所需的数据量<\/jats:p>“，”DOI“：”10.1073\/pnas.1907370117“，”type“：”journal-article“，”created“：{”date-parts“：[[2020,8,18]]，”date-time“：”2020-08-18T00:54:01Z“，”timestamp“：1597712041000}，”page“：referenced-by-count“：37，”title“：[”使用通用策略更新快速强化学习“]，”前缀“：”10.1073“，”卷“：”117“，”作者“：[{”ORCID“：”http://\/ORCID.org\/0000-0001-6168-6972“，”authenticated-ORCID“：false，”给定“：”Andr\u00e9“，”family“：”Barreto“，”sequence“：”first“，”affiliation“：[[{“name”：“DeepMind，London EC4A 3TW，United Kingdom；”}]}，{“ORCID”：“”http://\/orcid.org\/00000-0002-6181-5452“，”authenticated-orcid“：false，”given“：”Shaobo“，”family“：”Hou“，”sequence“：”additional“，”affiliation“：[{“name”：”DeepMind，London EC4A 3TW，United Kingdom；“}]}，{“given”：“Diana”，“family”：“Borsa大卫“family：“Silver”，“sequence”：“additional”，“affiliation”：[{“name”：“DeepMind，London EC4A 3TW，United Kingdom；”}]}，{“given”：“Doina”，“family”：“Precup”，“serquence”:“additionable”，“feliation”:[{”name“DeepMind，Lond EC4A 3ATW，联合王国；”}，“name”:“加拿大蒙特利尔麦吉尔大学计算机科学学院，QC H3A 0E9”}]]，“member”：“341”，“published-online“：{“date-parts”：[[2020,8,17]]}，“reference”：[{“key”：“e_1_3_4_1_2”，“volume-title”：“强化学习：简介”，“author”：“Sutton R.S.”，“year”：“2018”，“unstructured”：“R.S.Sutton，A.G.Barto，《强化学习：导论》（麻省理工学院出版社，2018年）10.1177\/0278364913495721“}，{“key”：“e_1_3_4_3_2”，“doi-asserted-by”：“publisher”，“doi”：“10.1126\/science.275.5306.1593”}，“key“：”e_1_ 3_4_2“，”doi-assert-by“：”publisher“，”doi“：”10.1038\/nn1309机器人”，“作者”：“Pilarski P.M.”，“年份”：“2011”，“非结构化“：”P.M.Pilarski，《IEEE康复机器人国际会议（IEEE，2011）通过动作关键强化学习对肌电假体控制器进行在线人体训练》，第1_u20137页。“}，{“key”：“e_1_3_4_6_2”，“doi-asserted-by”：“publisher”，“doi”：“10.1145 \/203330.203343”}doi由“：”publisher“，”doi“：”10.1126\/science.1259433“}，｛”key“：”e_1_3_4_8_2“，”首页“：”463“，”卷标题“：”国际机器学习会议记录“，”作者“：”Randl\u00f8v J.“，”年份“：”1998“，”非结构化“：”J.Randl\u00f8v，P。Alstr \u00f8m，《使用强化学习和塑形学习驾驶自行车》，载于《国际机器学习会议论文集》（Morgan Kaufmann Publishers，Inc.，1998），第463\u2013471页。“}，{”key“：”e_1_3_4_9_2“，”first page“：“799”，”volume-title“：”Advances in Neural Information Processing Systems（NIPS）“作者”：“Ng A.Y.”，“year”：“2003”，“unstructured”：“A.Y.Ng，H.J.Kim，M.I.Jordan，S.Sastry，\u201c通过强化学习实现的自动直升机飞行\u201d in Advances in Neural Information Processing Systems（NIPS）（麻省理工学院出版社，2003），pp.799\u2013806。”}，{“key”：“e_1_3_4_10_2”，“doi asserted by”：“publisher”，“doi”：“10.1038\/nature14236”}，{key“：”e_1_3_4_11_2“，”doi-asserted-by“：”publisher“，”doi“：”10.1038\/nature16961“}，{“key”：“e_1_ 3_4,12_2”，“doi-assert-by”：“publisher”，“doi”：“10.1038\/nature24270”}，}“key”：“e_1_3_4_13_2”，“doi-asserted-by 3_4_14_2“，“volume-title”：“AAAI春季研讨会系列”，“author”：“Tsividis P.”，“year”：“2017年”，“非结构化”：“P.Tsividis，T.Pouncy，J.Xu，J.Tenenbaum，S.Gershman，《Atari中的人类学习》，AAAI春季研讨会系列（AAAI出版社，2017）。“}，{”key“：”e_1_3_4_15_2“，”volume-title“：”生物体的行为：实验分析“，”author“：”Skinner B.F.“，”year“：”1938“，”unstructured“：”B.F.Skinner，”Behavior of Organism:An Experimental Analysis“（Appleton Century，1938）。“}非结构化“：”C.L.Hull，《行为原则》（阿普尔顿世纪，纽约州纽约市，1943年）。“}，{”key“：”e_1_3_4_17_2“，”first page“：（Morgan Kaufmann Publishers，Inc.，2000），第663\u2013670页。“}，{”key“：”e_1_3_4_18_2“，”doi-asserted-by“：”publisher“，”doi“：”10.1002\/9780470316887“}”，{“key”：“e_1_ 3_4_ 19_2”，“volume-title”：“动态编程”，“author”：“Bellman R.e.”，“year”：“1957”，“unstructured”：“R.e.Bellman，《动态编程》（普林斯顿大学出版社，1957）。”}，“key“：”e_1_3_4_20_2“，“volume-title”：“神经动态编程”，“author”：“Bertsekas D.P.”，“year”：“1996”，“unstructured”：“D.P.Bertsekos，J.N.Tsitsiklis，《神经动态编程》（Athena Scientific，1996）。“}，{”key“：”e_1_3_4_21_2“，”first page“：“560”，”volume-title“：”Proceedings of the International Conference on Machine Learning（ICML）“，”author“：”Munos R.“，”year“：”2003“，”unstructured“：”R.Munos，\u201c《国际机器学习会议论文集》中近似策略迭代的误差界\u201d（AAAI出版社，2003），第560\u2013567页。“}，{“key”：“e_1_3_4_22_2”，“doi-asserted-by”：“publisher”，“doi”：“10.1007\/BF00115009”}，“key“：”e_1_ 4_23_2“，”doi-assert-by“：”publisher“，”doi“：”10.1007\/FF00992698“年份”：“1960年”，“非结构化”：“R。霍华德，《动态规划和马尔可夫过程》（麻省理工学院出版社，1960年）。“｝，｛”key“：”e_1_3_4_25_2“，”doi asserted by“：”publisher“，”doi“：”10.1162\/neco.1994.6.6.1185“｝，｛”key“：”e_1_3_4_26_2“，”首页“：”761“，”卷标题“：”自治代理和多代理系统国际联合会议“，”作者“：”Sutton R.S.“，”年份“：”2011“，”非结构化“：”R.S。Sutton，《Horde:International Joint Conference on Autonomous Agents and Multiagent Systems》（国际自治代理和多代理系统基金会，2011年），第761\u2013768页，“}，{”key“：”e_1_3_4_27_2“，”first page“，”volume-title“：“国际机器学习会议（ICML）”，“author”：“Schaul T.”，“year”：“2015”，“unstructured”：“T.Schaul，D.Horgan，K.Gregor，D.Silver，\u201cUniversal value function approsors \u201d in International Conference on Machine Learning（ICML2015）（PMLR，2015），第37卷，pp.1312\u20131320。”}，{“key”：“e_1_3_4_28_2”，“首页”：“4055“，”volume-title“：”Advances in Neural Information Processing Systems（NIPS）“，”author“：”Barreto A.“，”year“：”2017“，”unstructured“：”A.Barreto，\u201cSuccessor features for transfer in reinforcement learning \u201d in Advances on Neural Information Processing Systems（NIPS）（Curran Associates，Inc.，2017），pp.4055\u20134065“}，{”key“：”e_1_3_4_29_2“，”doi-asserted-by“：”publisher“，”doi“：”10.1162\/neco.1993.5.4.613“}，{“key”：“e_1_ 3_4_30_2”，“first page”：”501“，“volume-title”：“Proceedings of the International Conference on Machine”，“author”：“Barreto A.”，“year”：“2018”，“unstructured”：“A.”。Barreto，\u201c使用后继特征和广义策略改进的深度强化学习的转移\u201d，《国际机器会议论文集》（PMLR，2018），第80卷，第501\u2013510.页。“｝，｛”键“：”e_1_3_4_31_2“，”首页“：”13052“，”卷标题“：”神经信息处理系统（NeurIPS）的进展“，”作者“：”Barreto A.“，”year“：”2019“，”unstructured“：”A.Barreto，\u201cThe option keyboard:《神经信息处理系统进展》（NeurIPS）（Curran Associates，Inc.，2019），第13052\u201313062页。“}，{”key“：”e_1_3_4_32_2“，”doi-asserted-by“：”publisher“，”doi“：”10.1023\/A:1025696116075“}”，{“key“：”e_1_3_4_33_2“，”volume-title“：”Deep Learning“，”author“：”Goodfellow I.“，”year“：”2016“，”unstructured“：”I.Goodfelow，Y.Bengio，A.Courville，《深度学习》（麻省理工出版社，2016）。“}，{”key“：”e_1_3_4_34_2“，”doi-asserted-by“：”publisher“，”doi“：”10.1023\/A:1007379606734“}、{”密钥“：”e_1_3_4_35_2“，“doi-assert-by”：“publisher”，“doi”：“10.1016\/S0004-3702（99）00052-1”}，“key”：“e_1\4_36_2”，“首页”：“640”，“volume-title”：“神经信息的进展”处理系统（NIPS）”，“作者”：“Thrun S.”，“年份”：“1996年”，“非结构化”：“S。Thrun，\u201c学习第N件事比学习第一件事容易吗？\u201d in Advances in Neural Information Processing Systems（NIPS）（麻省理工学院出版社，1996年），第640\u2013646页？id=S1VWjiRcKX。2020年8月5日查阅。“}，{”key“：”e_1_3_4_38_2“，”首页“：”2911“，”volume-title“：”机器学习国际会议论文集“，”author“：”Hunt J.“，”year“：”2019“，”unstructured“：”J.Hunt，A.Barreto，T.Lillicrap，N。Heess，\u201c使用散度校正合成熵策略\u201d，《国际机器学习会议论文集》（PMLR，2019），第97卷，第2911页\u20132920。“}，{“key”：“e_1_3_4_39_2”，“unstructured”：“S。Hansen \u201cFast task inference with varianting internal sequence features \u201d in International Conference on Learning Representations（ICLR）（2020）。https:\/\/openreview.net\/forum？id=BJeAHkrYDS。2020年8月5日访问。“}]，”container-title“：[”Proceedings of the National Academy of Sciences“]，”original-title”：[]，”language“：”en“，”link“：[{”URL“：”https:\/\/pnas.org\/doi\/pdf\/101073\/pnas.1907370117“，”content-type“：”unspecified“，”content-version“：”vor“，”intended-application“：”similarity-checking“}”，“deposed”：{“date-parts”：[2022,4,13]]，“date-time“：”2022-04-13T09:22:37Z“，”timestamp“：1649841757000}，”score“：1，”resource“：{主要”：{“URL”：“https:\//pnas.org\/doi\/full\/101073\/pnas.1907370117”}}，“副标题”：[]，“短标题”：[]，“已发布”：{-“日期部分”：[2020,8,17]]}，《参考计数》：39，“日志发布”：}“发布”：“48”，“published-print”：{“date-parts”：[[2020,12]]}}，“alternative-id”：[“10.1073\/pnas.1907370117“]，”URL“：”http://\/dx.doi.org\/101073\/panas.19073700117“，”关系“：{}，”ISSN“：[”0027-8424“，”1091-6490“]，“ISSN-type”：[{“value”：“0027-8424'，”type“：”print“}，{“value”：“1091-6490'，”类型“：”electronic“}]，”subject“：[]，”published“：{“date-parts”：[[2020,8,17]]}，“断言”：[{“value”：“2020-08-17”，“order”：2，“name”：“published”，“label”：“已发布“，”组“：{“name”：“publication_history”，“label”：“publication history”}}]}}