-
Notifications
You must be signed in to change notification settings - Fork 35
/
Copy pathREADME.txt
executable file
·452 lines (385 loc) · 23.8 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
Inverse Reinforcement Learning Toolkit
Sergey Levine, 2011
1. Introduction
2. Installation
3. Usage
3.1 Running a single test
3.2 Running a transfer test
3.3 Running a series of tests
3.4 Using human generated examples
4. Directory overview
4.1 IRL algorithms
4.2 MDP solvers
4.3 Example domains
4.4 General
5. Calling conventions
5.1 IRL algorithms
5.2 MDP solvers
5.3 Example domains
6. License
1. Introduction
This MATLAB package contains a collection of inverse reinforcement learning
algorithms and a framework for evaluating these algorithms on a variety of
simple Markov decision processes. The package is distributed primarily as the
reference implementation for the Gaussian Process Inverse Reinforcement Learning
algorithm described in the paper "Nonlinear Inverse Reinforcement Learning with
Gaussian Processes" (Levine, Popovic, Koltun, NIPS 2011), but is also intended
to provide a general-purpose framework for researchers interested in inverse
reinforcement learning. This file describes the contents of the package,
provides instructions regarding its use, instructions for extending the package
with new algorithms, solvers, and example domains, and finally contains the
license under which the package may be used and distributed.
2. Installation
The IRL toolkit was built with MATLAB R2009b. Earlier version of MATLAB may be
sufficient, but were not tested. In addition to MATLAB R2009b and various
toolboxes, which may include the Statistics and Parallelism toolboxes, the
package requires the following external MATLAB scripts:
- minFunc, by Mark Schmidt, included with this package (Creative Commons by-nc,
see <http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html> for details)
- CVX, by Michael C. Grant and Stephen P. Boyd, included with this package (GPL)
- plot2svg, by Juerg Schwizer, included with this package (FreeBSD)
With the exception of CVX, none of the scripts require installation.
Installation consists of two easy steps:
1. Extract all files in irl_toolkit.zip into a desired directory.
2. Run Utilities/cvx_setup to install CVX.
3. Usage
This section briefly describes how to use the IRL toolkit. The toolkit is
designed to be modular and extendable. Most sessions will begin by running
"addpaths" in the main directory, to add all necessary subdirectories to the
path. Optionally, you may save your MATLAB path to avoid needing to do this, but
this is left up to you, since some users may prefer not to clutter up their
path.
3.1 Running a single test
An example script for running a test is provided under the name singletest.m in
the main directory. This script calls "addpaths" to add the necessary paths, and
then runs a single test using GPIRL on a 32x32 objectworld environment. The
function "runtest" creates a random environment of a specified type, creates
synthetic examples (or, optionally, uses provided human generated examples), and
calls the specified IRL algorithm to determine the reward function. The result
of the IRL algorithm is evaluated according to a set of provided metrics, and
the results are returned in a single structure. The arguments of runtest are the
following:
1. Algorithm name. This always corresponds to the name of the directory
containing the algorithm, which is also the prefix of every function for that
algorithm.
2. Algorithm parameters. This is a struct containing the desired parameters for
this algorithm. For more information, see <algorithm name>defaultparams.m in
the desired algorithm's directory.
3. MDP type. This is the MDP solver used to generate synthetic examples. By
default, two options are available: standardmdp, which uses the standard
Bellman equations, and linearmdp, which uses the LMDP framework.
4. Example name. This is the directory name/prefix of the environment to be
tested.
5. Example parameters. These are the parameters of the test environment. See
<example name>defaultparams.m in the desired example's directory.
6. Test parameters. These are parameters of the test itself, including the
number of examples. See General/setdefaultparams.m for details.
A quick way to view the results of the test is provided by the printresults
function. The visualize function renders the results of the test to the screen,
with a side by side comparison of the true and learned reward functions.
3.2 Running a transfer test
To run a transfer test -- that is, to test how well a learned reward function
generalizes to another state space, -- use the runtransfertest function. The
arguments of this function are the following:
1. IRL results returned by runtest. If runtest returns test_result, this is
test_result.irl_result
2. Algorithm name, as before.
3. MDP type.
4. Example name. Should be the same as the example in the original test.
5. Example parameters. These may be different from the original test (in
fact, they should be to test any interesting transfer). However, some
parameters may be incompatible -- for example, in the objectworld, the
number of features depends on "n", so testing transfer with a different
values of "n" will not work when using discrete features.
6. Transfer test parameters. See General/settransferdefaultparams.m for
details.
The results of the transfer test can similarly be printed and viewed using
printresults and visualize.
3.3 Running a series of tests
The functions runtestseries and runtransferseries can be used to run a large
series of tests, using a variety of algorithms and parameter settings. The
package contains scripts for graphing and saving the results of such tests.
Examples of how to use these functions can be found in the NIPS11_Tests
directory, which contains the scripts that were used to produce the graphs in
the 2011 NIPS paper (note that, due to code changes since publication and
randomized initialization, the results produced by these scripts may differ
slightly from those presented in the paper). All of these scripts use the
Parallelism toolbox to run multiple tests in parallel for efficiency, and call
the saveresults function at the end to save the results from the test into a
directory with a name corresponding to the test and the current time and date.
Once this directory is created, the results can be graphed using the
renderresults function, which takes the directory name as the first argument.
The second argument specifies whether to only graph the results (1), or whether
to also render the reward function of each test (0), which takes significantly
longer.
3.4 Using human generated examples
The directory Human_Demos contains some demonstrations of highway policies by a
human user. To pass these examples to a test, set the "true_examples" field of
the test parameters struct (last argument to runtest) to
human_result.example_samples after loading the desired demo. Note that the test
should be done using the same example parameters, which can be obtained from
human_result.mdp_params.
To generate your own human demonstrations, use the tools in HumanControl.
Specifically, the runhumantrial function will bring up an interface for
generating a human demonstration. This interface is currently "turn based",
although a real time interface may be added in the future. The arguments for
this function are the following:
1. Example name.
2. Example parameters.
3. Human trial parameters specifying the number and length of examples. See
HumanControl/humantrialdefaultparams.m
The function returns a human_result struct, which can be saved and used for
future tests.
4. Directory overview
This section presents an overview of the directories in the package, briefly
describing each included IRL algorithm, MDP solver, and example domain.
4.1 IRL algorithms
Note that, with the exception of FIRL and GPIRL, I cannot vouch for the accuracy
of my implementations of prior algorithms. While I made a best faith effort to
implement these methods correctly based on the authors' descriptions, these are
my own implementations and should not be viewed as the definitive reference
implementations of the prior algorithms.
- AN - Abbeel & Ng's projection algorithm for inverse reinforcement learning.
Note that, since Abbeel & Ng's algorithm does not return a single reward, this
implementation is somewhat less accurate than the original algorithm, because a
single reward function must be selected to return for evaluation. For details,
see "Apprenticeship Learning via Inverse Reinforcement Learning" (Abbeel and Ng,
ICML 2004)
- FIRL - The Feature Construction for Inverse Reinforcement Learning algorithm.
Only compatible with discrete features (the "continuous" parameter must be set
to 0 for the current example environment). For details, see "Feature
Construction for Inverse Reinforcement Learning" (Levine, Popovic, Koltun, NIPS
2010)
- GPIRL - The Gaussian Process Inverse Reinforcement Learning algorithm. For
continuous features, it is recommended to use the warped kernel by setting the
"warp_x" parameter to 1. For details, see "Nonlinear Inverse Reinforcement
Learning with Gaussian Processes" (Levine, Popovic, Koltun, NIPS 2011)
- LEARCH - An implement of the "Learning to Search" algorithm. This is a "best
effort" implementation, since the algorithm is quite complicated and has many
free parameters. The current version can run either log-linear mode or
nonlinear, with decision trees of logistic regression to create nonlinear
features. For details, see "Learning to Search: Functional Gradient Techniques
for Imitation Learning" (Ratliff, Silver, Bagnell, Autonomous Robots 27 (1)
2009)
- MaxEnt - the maximum entropy IRL algorithm. For details, see "Maximum Entropy
Inverse Reinforcement Learning" (Ziebart, Mass, Bagnell, Dey, AAAI 2008)
- MMP - the maximum margin planning algorithm. Note that this implementation
uses the QP formulation of MMP rather than subgradient methods, since the QP
version is already very fast on the examples in this package. For details, see
"Maximum Margin Planning" (Ratliff, Bagnell, Zinkevich, ICML 2006)
- MMPBoost - MMP with feature boosting, using decision trees of a configurable
depth. As with LEARCH, this is a "best effort" implementation, since the
algorithm has many free parameters. For details, see "Boosting Structured
Prediction for Imitation Learning" (Ratliff, Bradley, Bagnell, Chestnutt, NIPS
2007)
- MWAL - The game-theoretic MWAL algorithm. For details, see "A Game-Theoretic
Approach to Apprenticeship Learning" (Syed, Schapire, NIPS 2008)
- OptV - A provisional implementation of the OptV algorithm. Since OptV learns
value functions rather than reward functions, this algorithm cannot use the
provided features, and instead simply learns a value for each state. This makes
it unable to perform transfer, and produces (unfairly) poor results compared to
the other methods. It should be noted that this is a symptom of the framework
rather than the algorithm, since the toolkit is designed for testing algorithms
that learn rewards rather than value functions. For details, see "Inverse
Optimal Control with Linearly-Solvable MDPs" (Dvijotham, Todorov, ICML 2010)
4.2 MDP solvers
The toolkit comes with two MDP solvers. The standard solver (StandardMDP) uses
value iteration to recover a value function, which in turn specifies a
deterministic optimal policy for the MDP. The linear MDP solver (LinearMDP) uses
"soft" value iteration as described in Ziebart's PhD thesis, and corresponds to
linearly-solvable MDPs (see Dvijotham & Todorov 2010). This model produces
stochastic examples, which may be viewed as "suboptimal" under the standard MDP
model.
4.3 Example domains
Three example domains are included in this toolkit. They are described below:
- Gridworld - This is an NxN gridworld, with 5 actions per state corresponding
to moving in each direction and staying in place. The "determinism" parameter
specifies the probability that each action will "succeed." If the action
"fails," a different random action is taken instead. The reward function of the
gridworld consists of BxB blocks of cells (forming a "super grid"). The features
are indicators for x and y values being below various integer values. Note that
this environment does not support transfer experiments.
- Objectworld - This is a gridworld populated with objects. Each object has one
of C inner and outer colors, and the objects are placed at random. There are 2C
continuous features, each giving the Euclidean distance to the nearest object
with a specific inner or outer color. In the discrete feature case, there are
2CN binary features, each one an indicator for a corresponding continuous
feature being less than some value D. The true reward is positive in states that
are both within 3 cells of outer color 1 and 2 cells of outer color 2, negative
within 3 cells of outer color 1, and zero otherwise. Inner colors and all other
outer colors are distractors.
- Highway - This is the highway environment described in "Nonlinear Inverse
Reinforcement Learning with Gaussian Processes." Note that the number of cars on
the highway must be specified manually. There is a somewhat hacky algorithm for
placing cars that avoids creating roadblocks. This algorithm will stall if it
cannot be placed the specified number of cars, so don't specify too many. This
may be fixed in a future release. It is also possible to specify more than 2
categories or classes of cars, though this functionality is currently untested.
4.4 General
The following are miscallaneous directories in the toolkit:
- Evaluation - This directory contains implementations of the various metrics
that can be used to evaluate the IRL result. Each function is a particular
metric.
- General - These are general testing scripts, including runtest,
runtransfertest, etc.
- HumanControl - Interface for generating human demonstrations.
- NIPS11_Tests - Scripts for reproducing test results from "Nonlinear Inverse
Reinforcement Learning with Gaussian Processes."
- Testing - Scripts for graphing, saving, and visualizing test results from
series tests.
- Utilities - External utilities, including minFunc and CVX.
5. Calling conventions
The IRL toolkit is intended to be extendable and modular. This section describes
the calling convention for various modules, in order to describe how additional
components can be added. So long as the new component implements the specified
functions, it will be usable with the toolkit.
5.1 IRL algorithms
IRL algorithms should must implement two functions: <name>run and
<name>transfer, where <name> is the name of the algorithm that will be used to
identify it when calling runtest. Additionally, all current algorithms implement
<name>defaultparams, but this is a convenience. The run function must accept the
following arguments:
1. algorithm_params - The parameters of the algorithm. Some parameters may be
missing, so it is advisable to implement a <name>defaultparams function to fill
them in (for example, see gpirldefaultparams.m).
2. mdp_data - A specification of the example domain. mdp_data includes the
number of states, the number of actions, the transition function (specified by
sa_s and sa_p), and so forth. For details, see (for example) gridworldbuild.m
3. mdp_model - The name of the current MDP model (standardmdp or linearmdp).
The final returned policy must be computed from the reward function using this
model, but most algorithms will not use this parameter anywhere else.
4. feature_data - Information about the features. The main paramater of this
struct is feature_data.splittable (forgive the "legacy" name), which is a matrix
containing the value of each feature at each state.
5. example_samples - A cell array of the examples, where exmaple_samples{i,t}
is time step t along trajectory i. Each entry in the cell array has two numbers:
example_samples{i,t}(1) is the state, and example_samples{i}{t}(2) is the
action.
6. true_features - The true features that form a linear basis for the reward.
Most algorithms (that are not cheating) will ignore this, but it may be useful
for establishing a baseline comparison.
7. verbosity - The desired amount of output. 0 means no output, 1 means
"medium" output, and 2 means "verbose" output.
Additionally, IRL algorithms must return the struct irl_result containing the
result of the computation. This struct has the following fields:
1. r - the learned reward function, with mdp_data.states rows and
mdp_data.actions columns
2. v - the resulting value function, returned by <mdp_model>solve
3. p - the corresponding policy (as above)
4. q - the corresponding q function (as above)
5. r_itr - the reward at each iteration of the algorithm, useful for debugging;
if unused, set to {{r}}
6. model_itr - the model at each iteration of the algorithm, to be used for
transfer
7. model_r_itr - a second optional reward at each iteration of the algorithm,
useful for debugging; if unused, set to r_itr
8. p_itr - the policy at each iteration; if unused, set to {{p}}
9. model_p_itr - a second optional policy at each iteration of the algorithm,
useful for debugging; if unused, set to p_itr
10. time - the time the algorithm spent computing the reward function
(optional, set to 0 if unused)
The transfer function must use the learned model to transfer the reward function
to a different state space. The output of this function is identical to
<name>run. The arguments are the following:
1. prev_result - The irl_result structure returned by a <name>run call.
2. mdp_data - MDP definition for the new state space.
3. mdp_model - MDP model, as before.
4. feature_data - Feature data for the new state space.
5. true_feature_map - True features for the new state space (again, not used
except when "cheating").
6. verbosity - As before.
If both <name>run and <name>transfer are implemented, the algorithm should be
usable in all tests without further modification.
5.2 MDP solvers
Additional MDP solvers can also be added to the framework, for example for
solving problems with continuous state spaces. The MDP solver must implement 5
functions: <name>solve, <name>frequency, <name>action, <name>step, and
<name>compare.
The function <name>solve must find the solution to the specified MDP. It has two
arguments: mdp_data, the MDP definition, and the states x actions reward r. The
definition mdp_data is constructed by the example building function (see 5.3).
The output is a struct "mdp_solution", which has fields "v", "q", and "p",
corresponding to the value function, Q function, and policy. Note that these
fields will only be used by other functions relating to your MDP solver, so they
can contain whatever information you choose.
The function <name>frequency finds the visitation frequency of all states in the
MDP. This function is used by many metrics. The inputs are mdp_data and
mdp_solution (from a previous call to <name>solve), and the output is a column
vector with 1 entry for each state, giving that state's expected visitation
count under the policy in mdp_solution.
The functions <name>action and <name>step sample an action at a particular state
and execute that action, respectively. <name>action takes as input mdp_data,
mdp_solution, and a state s. The output is an action a, which is either the
optimal action under a deterministic policy or a sampled action under a
stochastic policy. The function <name>step takes mdp_data, mdp_solution, a state
s, and an action a, and returns the resulting state after taking action a in
state s, which may be sampled if the MDP is nondeterministic.
Finally, the function <name>compare compares two policies, given as arguments p1
and p2, and returns the amount of discrepancy between them. In the case of a
standard MDP, this is currently the number of states in which the policies
disagree. In the case of the linear MDP, this is the sum over all states of the
probability that p1 and p2 will take different actions. This function is only
used when evaluating some metrics.
5.3 Example domains
Example domains are only required to implement two functions: <name>build and
<name>draw. The building function has a single argument - the mdp_params
structure passed to runtest that specifies the parameters of the desired
example. The outputs are the following:
1. mdp_data - A structure containing the definition of the MDP. It must specify
the following fields:
- states - the number of states
- actions - the number of actions in each state
- discount - the discount factor of the MDP (currently all examples are
infinite horizon discounted reward)
- sa_s - 3D matrix of size states x actions x K, where K is any number. The
entry sa_s(s,a,k) specifies the state to transition to when taking action a in
state s, and sampling destination k (see below)
- sa_p - 3D matrix as above, where sa_p(s,a,k) gives the probability of
transitioning to sa_s(s,a,k) on action a in state s
2. r - A states x actions matrix specifying the true reward function of the
example environment.
3. feature_data - A struct containing two fields regarding the features of the
example:
- splittable a states x features matrix containing the value of each feature
at each state
- stateadjacency - A sparse states x states matrix with "1" for each pair of
states that are "adjacent" (have an action to transition from one to the other);
this is currently only used by FIRL.
4. true_feature_map - A (sparse) states x features matrix that gives the values
of the "true" features that can be linearly combined to obtain r.
The drawing function has no output, and takes the following input arguments:
1. r - The reward function to draw.
2. p - The policy to draw (may be deterministic or stochastic).
3. mdp_params - The mdp_params struct passed to <name>build.
4. mdp_data - The mdp_data struct returned by <name>build.
5. feature_data - The feature_data struct returned by <name>build.
6. model - The MDP model (standardmdp or linearmdp).
The drawing function must draw some visualization of the specified example to
the current axes.
6 License
The license included below governs the terms of use for this software. Please
direct any correspondence regarding the software to [email protected].
The IRL toolkit is created by Sergey Levine, copyright 2011. If you are using
the software in an academic publication, you may cite it as "IRL Toolkit" with a
reference to the webpage from which it was obtained.
Copyright (c) 2011, Sergey Levine
All rights reserved.
This software is made available under the Creative Commons
Attribution-Noncommercial License, viewable at
http://creativecommons.org/licenses/by-nc/3.0/. You are free to use, copy,
modify, and re-distribute the work. However, you must attribute any
re-distribution or adaptation in the manner specified below, and you may not
use this work for commercial purposes without the permission of the author.
Any re-distribution or adaptation of this work must contain the author's name
(Sergey Levine) and a link to the software's original webpage.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.