Simple view
Full metadata view
Authors
Statistics
Accidental exploration through value predictors
reinforcement learning
value predictors
exploration
DOI artykułu: 10.4467/20838476SI.18.009.10414(nie jest aktywne)
Infinite length of trajectories is an almost universal assumption in the theoretical foundations of reinforcement learning. In practice learning occurs on finite trajectories. In this paper we examine a specific result of this disparity, namely a strong bias of the time-bounded Every-visit Monte Carlo value estimator. This manifests as a vastly different learning dynamic for algorithms that use value predictors, including encouraging or discouraging exploration. We investigate these claims theoretically for a one dimensional random walk, and empirically on a number of simple environments. We use GAE as an algorithm involving a value predictor and evolution strategies as a reference point.
| dc.abstract.en | Infinite length of trajectories is an almost universal assumption in the theoretical foundations of reinforcement learning. In practice learning occurs on finite trajectories. In this paper we examine a specific result of this disparity, namely a strong bias of the time-bounded Every-visit Monte Carlo value estimator. This manifests as a vastly different learning dynamic for algorithms that use value predictors, including encouraging or discouraging exploration. We investigate these claims theoretically for a one dimensional random walk, and empirically on a number of simple environments. We use GAE as an algorithm involving a value predictor and evolution strategies as a reference point. | pl |
| dc.affiliation | Wydział Matematyki i Informatyki | pl |
| dc.contributor.author | Leśniak, Damian - 165389 | pl |
| dc.contributor.author | Kisielewski, Tomasz - 175553 | pl |
| dc.date.accession | 2024-02-19 | pl |
| dc.date.accessioned | 2021-10-04T16:34:34Z | |
| dc.date.available | 2021-10-04T16:34:34Z | |
| dc.date.issued | 2018 | pl |
| dc.date.openaccess | 0 | |
| dc.description.accesstime | w momencie opublikowania | |
| dc.description.additional | DOI artykułu: 10.4467/20838476SI.18.009.10414(nie jest aktywne) | pl |
| dc.description.physical | 107-127 | pl |
| dc.description.version | ostateczna wersja wydawcy | |
| dc.description.volume | 27 | pl |
| dc.identifier.doi | 10.4467/20838476SI.18.009.10414 | pl |
| dc.identifier.eissn | 2083-8476 | pl |
| dc.identifier.issn | 1732-3916 | pl |
| dc.identifier.project | ROD UJ / O | pl |
| dc.identifier.uri | https://ruj.uj.edu.pl/xmlui/handle/item/279475 | |
| dc.identifier.weblink | https://www.ejournals.eu/Schedae-Informaticae/2018/Volume-27/art/13932/ | pl |
| dc.language | eng | pl |
| dc.language.container | eng | pl |
| dc.rights | Dodaję tylko opis bibliograficzny | * |
| dc.rights.licence | CC-BY-NC-ND | |
| dc.rights.uri | * | |
| dc.share.type | otwarte czasopismo | |
| dc.subject.en | reinforcement learning | pl |
| dc.subject.en | value predictors | pl |
| dc.subject.en | exploration | pl |
| dc.subtype | Article | pl |
| dc.title | Accidental exploration through value predictors | pl |
| dc.title.journal | Schedae Informaticae | pl |
| dc.type | JournalArticle | pl |
| dspace.entity.type | Publication |