有許多強化學習的庫,其中最受歡迎的是 OpenAI 的 gym,它可以提供單一代理的環境。還有一些有用的環境庫,如 Petting Zoo。
Gym#
🔗 原文鏈接: https://zhuanlan.zhihu.com/p/482821112
⏰ 剪存時間:2024-04-24 13:28:37 (UTC+8)
✂️ 本文檔由飛書剪存一鍵生成
Gym#
在基於強化學習模型編寫代碼時,很重要的一個環節是編寫與環境 (environment) 之間的交互的代碼。Gym 是 OpenAI 公司為強化學習愛好者提供的一個開源庫,用於開發和比較強化學習算法。Gym 的特點是它不對 Agent 做任何假設,並且與任何數值計算庫兼容,例如 TensorFlow 或 Theano。用戶可以用 Gym 來制定適合於自己模型的 Gym Environment。
Spaces#
在真正構造強化學習模型時,需要許多參數來對環境進行刻畫,而這些參數的數據類型、取值範圍、默認值等都是不盡相同的,這些不同的參數需要進行歸類才能較好地進行處理,而 Gym 使用 Spaces 類為這些不同的數據類型提供支持。
CartPole 的例子#
一個經典的 CartPole 問題使用 Gym 的代碼如下,它描述了一個小車在平面上左右移動以保證杠杆不倒下的場景。
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
env.close()
輸出
[-0.061586 -0.75893141 0.05793238 1.15547541]
[-0.07676463 -0.95475889 0.08104189 1.46574644]
[-0.0958598 -1.15077434 0.11035682 1.78260485]
[-0.11887529 -0.95705275 0.14600892 1.5261692 ]
[-0.13801635 -0.7639636 0.1765323 1.28239155]
[-0.15329562 -0.57147373 0.20218013 1.04977545]
Episode finished after 14 timesteps
[-0.02786724 0.00361763 -0.03938967 -0.01611184]
[-0.02779488 -0.19091794 -0.03971191 0.26388759]
[-0.03161324 0.00474768 -0.03443415 -0.04105167]
Spaces 的應用#
在上面的示例中,我們一直在從環境的 action_space 中進行隨機採樣操作。但這些 action 到底是什麼呢?每個環境都附帶一個和這個環境所需要的類型相匹配的 Space,它們描述 actions 和 observations 的格式:action_space, observation_space。如
import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)
其中 Discrete 允許非負數的固定範圍,因此在這種情況下,有效的 action 為 0 或 1。
該類的具體用法如下
class Discrete(Space[int]):
r"""A discrete space in :math:`\{ 0, 1, \\dots, n-1 \}`.
A start value can be optionally specified to shift the range
to :math:`\{ a, a+1, \\dots, a+n-1 \}`.
Example::
>>> Discrete(2) # {0, 1}
>>> Discrete(3, start=-1) # {-1, 0, 1}
"""
而 Box 描述的是一個 n 維的實數空間 Rn \mathbb {R}^n,可以指定上下限,也可以不指定上下限。具體用法如下:
class Box(Space[np.ndarray]):
"""
A (possibly unbounded) box in R^n. Specifically, a Box represents the
Cartesian product of n closed intervals. Each interval has the form of one
of [a, b], (-oo, b], [a, oo), or (-oo, oo).
There are two common use cases:
* Identical bound for each dimension::
>>> Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
Box(3, 4)
* Independent bound for each dimension::
>>> Box(low=np.array([-1.0, -2.0]), high=np.array([2.0, 4.0]), dtype=np.float32)
Box(2,)
"""
def __init__(
self,
low: Union[SupportsFloat, np.ndarray],
high: Union[SupportsFloat, np.ndarray],
shape: Optional[Sequence[int]] = None,
dtype: Type = np.float32,
seed: Optional[int] = None,
)
小結:Box 和 Discrete 是自定義環境中使用最多的兩個類。除此之外 Spaces 類內還有許多其他的類,這些將在下一小節講到。
其他類型的 Spaces#
除了 Box 與 Discrete 外,Spaces 還提供了其他類型的數據結構,所有數據結構如下:
__all__ = [
"Space",
"Box",
"Discrete",
"MultiDiscrete",
"MultiBinary",
"Tuple",
"Dict",
"flatdim",
"flatten_space",
"flatten",
"unflatten",
]
Dict 是一個字典類型的數據結構,它可以將不同的數據結構嵌入進來,具體使用方法如下:
class Dict(Space[TypingDict[str, Space]], Mapping):
"""
A dictionary of simpler spaces.
Example usage:
self.observation_space = spaces.Dict({"position": spaces.Discrete(2), "velocity": spaces.Discrete(3)})
Example usage [nested]:
self.nested_observation_space = spaces.Dict({
'sensors': spaces.Dict({
'position': spaces.Box(low=-100, high=100, shape=(3,)),
'velocity': spaces.Box(low=-1, high=1, shape=(3,)),
'front_cam': spaces.Tuple((
spaces.Box(low=0, high=1, shape=(10, 10, 3)),
spaces.Box(low=0, high=1, shape=(10, 10, 3))
)),
'rear_cam': spaces.Box(low=0, high=1, shape=(10, 10, 3)),
}),
'ext_controller': spaces.MultiDiscrete((5, 2, 2)),
'inner_state':spaces.Dict({
'charge': spaces.Discrete(100),
'system_checks': spaces.MultiBinary(10),
'job_status': spaces.Dict({
'task': spaces.Discrete(5),
'progress': spaces.Box(low=0, high=100, shape=()),
})
})
})
"""
MultiBinary 是一個只包含 0,1 的高維數據結構,它的具體使用方法如下:
class MultiBinary(Space[np.ndarray]):
"""
An n-shape binary space.
The argument to MultiBinary defines n, which could be a number or a `list` of numbers.
Example Usage:
>> self.observation_space = spaces.MultiBinary(5)
>> self.observation_space.sample()
array([0, 1, 0, 1, 0], dtype=int8)
>> self.observation_space = spaces.MultiBinary([3, 2])
>> self.observation_space.sample()
array([[0, 0],
[0, 1],
[1, 1]], dtype=int8)
"""
MultiDiscrete 與 MultiBinary 類似,不同的是它允許更多的整數存在,具體使用方法如下:
class MultiDiscrete(Space[np.ndarray]):
"""
- The multi-discrete action space consists of a series of discrete action spaces with different number of actions in each
- It is useful to represent game controllers or keyboards where each key can be represented as a discrete action space
- It is parametrized by passing an array of positive integers specifying number of actions for each discrete action space
Note: Some environment wrappers assume a value of 0 always represents the NOOP action.
e.g. Nintendo Game Controller
- Can be conceptualized as 3 discrete action spaces:
1) Arrow Keys: Discrete 5 - NOOP[0], UP[1], RIGHT[2], DOWN[3], LEFT[4] - params: min: 0, max: 4
2) Button A: Discrete 2 - NOOP[0], Pressed[1] - params: min: 0, max: 1
3) Button B: Discrete 2 - NOOP[0], Pressed[1] - params: min: 0, max: 1
- Can be initialized as
MultiDiscrete([ 5, 2, 2 ])
"""
Tuple與dict類似,具體使用方法如下:
```python
class Tuple(Space[tuple], Sequence):
"""
A tuple (i.e., product) of simpler spaces
Example usage:
self.observation_space = spaces.Tuple((spaces.Discrete(2), spaces.Discrete(3)))
"""