強化學習中的環境

有許多強化學習的庫，其中最受歡迎的是 OpenAI 的 gym，它可以提供單一代理的環境。還有一些有用的環境庫，如 Petting Zoo。

Gym#

🔗 原文鏈接： https://zhuanlan.zhihu.com/p/482821112
⏰ 剪存時間：2024-04-24 13:28:37 (UTC+8)
✂️ 本文檔由飛書剪存一鍵生成

Gym#

在基於強化學習模型編寫代碼時，很重要的一個環節是編寫與環境 (environment) 之間的交互的代碼。Gym 是 OpenAI 公司為強化學習愛好者提供的一個開源庫，用於開發和比較強化學習算法。Gym 的特點是它不對 Agent 做任何假設，並且與任何數值計算庫兼容，例如 TensorFlow 或 Theano。用戶可以用 Gym 來制定適合於自己模型的 Gym Environment。

Spaces#

在真正構造強化學習模型時，需要許多參數來對環境進行刻畫，而這些參數的數據類型、取值範圍、默認值等都是不盡相同的，這些不同的參數需要進行歸類才能較好地進行處理，而 Gym 使用 Spaces 類為這些不同的數據類型提供支持。

CartPole 的例子#

一個經典的 CartPole 問題使用 Gym 的代碼如下，它描述了一個小車在平面上左右移動以保證杠杆不倒下的場景。

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

輸出

[-0.061586   -0.75893141  0.05793238  1.15547541]
[-0.07676463 -0.95475889  0.08104189  1.46574644]
[-0.0958598  -1.15077434  0.11035682  1.78260485]
[-0.11887529 -0.95705275  0.14600892  1.5261692 ]
[-0.13801635 -0.7639636   0.1765323   1.28239155]
[-0.15329562 -0.57147373  0.20218013  1.04977545]
Episode finished after 14 timesteps
[-0.02786724  0.00361763 -0.03938967 -0.01611184]
[-0.02779488 -0.19091794 -0.03971191  0.26388759]
[-0.03161324  0.00474768 -0.03443415 -0.04105167]

Spaces 的應用#

在上面的示例中，我們一直在從環境的 action_space 中進行隨機採樣操作。但這些 action 到底是什麼呢？每個環境都附帶一個和這個環境所需要的類型相匹配的 Space，它們描述 actions 和 observations 的格式：action_space, observation_space。如

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)

其中 Discrete 允許非負數的固定範圍，因此在這種情況下，有效的 action 為 0 或 1。
該類的具體用法如下

class Discrete(Space[int]):
    r"""A discrete space in :math:`\{ 0, 1, \\dots, n-1 \}`.
    A start value can be optionally specified to shift the range
    to :math:`\{ a, a+1, \\dots, a+n-1 \}`.
    Example::
        >>> Discrete(2)            # {0, 1}
        >>> Discrete(3, start=-1)  # {-1, 0, 1}
    """

而 Box 描述的是一個 n 維的實數空間 Rn \mathbb {R}^n，可以指定上下限，也可以不指定上下限。具體用法如下：

class Box(Space[np.ndarray]):
    """
    A (possibly unbounded) box in R^n. Specifically, a Box represents the
    Cartesian product of n closed intervals. Each interval has the form of one
    of [a, b], (-oo, b], [a, oo), or (-oo, oo).
    There are two common use cases:
    * Identical bound for each dimension::
        >>> Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
        Box(3, 4)
    * Independent bound for each dimension::
        >>> Box(low=np.array([-1.0, -2.0]), high=np.array([2.0, 4.0]), dtype=np.float32)
        Box(2,)
    """
    def __init__(
        self,
        low: Union[SupportsFloat, np.ndarray],
        high: Union[SupportsFloat, np.ndarray],
        shape: Optional[Sequence[int]] = None,
        dtype: Type = np.float32,
        seed: Optional[int] = None,
    )

小結：Box 和 Discrete 是自定義環境中使用最多的兩個類。除此之外 Spaces 類內還有許多其他的類，這些將在下一小節講到。

其他類型的 Spaces#

除了 Box 與 Discrete 外，Spaces 還提供了其他類型的數據結構，所有數據結構如下：

__all__ = [
    "Space",
    "Box",
    "Discrete",
    "MultiDiscrete",
    "MultiBinary",
    "Tuple",
    "Dict",
    "flatdim",
    "flatten_space",
    "flatten",
    "unflatten",
]

Dict 是一個字典類型的數據結構，它可以將不同的數據結構嵌入進來，具體使用方法如下：

class Dict(Space[TypingDict[str, Space]], Mapping):
    """
    A dictionary of simpler spaces.
    Example usage:
    self.observation_space = spaces.Dict({"position": spaces.Discrete(2), "velocity": spaces.Discrete(3)})
    Example usage [nested]:
    self.nested_observation_space = spaces.Dict({
        'sensors':  spaces.Dict({
            'position': spaces.Box(low=-100, high=100, shape=(3,)),
            'velocity': spaces.Box(low=-1, high=1, shape=(3,)),
            'front_cam': spaces.Tuple((
                spaces.Box(low=0, high=1, shape=(10, 10, 3)),
                spaces.Box(low=0, high=1, shape=(10, 10, 3))
            )),
            'rear_cam': spaces.Box(low=0, high=1, shape=(10, 10, 3)),
        }),
        'ext_controller': spaces.MultiDiscrete((5, 2, 2)),
        'inner_state':spaces.Dict({
            'charge': spaces.Discrete(100),
            'system_checks': spaces.MultiBinary(10),
            'job_status': spaces.Dict({
                'task': spaces.Discrete(5),
                'progress': spaces.Box(low=0, high=100, shape=()),
            })
        })
    })
    """

MultiBinary 是一個只包含 0，1 的高維數據結構，它的具體使用方法如下：

class MultiBinary(Space[np.ndarray]):
    """
    An n-shape binary space.
    The argument to MultiBinary defines n, which could be a number or a `list` of numbers.
    Example Usage:
    >> self.observation_space = spaces.MultiBinary(5)
    >> self.observation_space.sample()
        array([0, 1, 0, 1, 0], dtype=int8)
    >> self.observation_space = spaces.MultiBinary([3, 2])
    >> self.observation_space.sample()
        array([[0, 0],
               [0, 1],
               [1, 1]], dtype=int8)
    """

MultiDiscrete 與 MultiBinary 類似，不同的是它允許更多的整數存在，具體使用方法如下：

class MultiDiscrete(Space[np.ndarray]):
    """
    - The multi-discrete action space consists of a series of discrete action spaces with different number of actions in each
    - It is useful to represent game controllers or keyboards where each key can be represented as a discrete action space
    - It is parametrized by passing an array of positive integers specifying number of actions for each discrete action space
    Note: Some environment wrappers assume a value of 0 always represents the NOOP action.
    e.g. Nintendo Game Controller
    - Can be conceptualized as 3 discrete action spaces:
        1) Arrow Keys: Discrete 5  - NOOP[0], UP[1], RIGHT[2], DOWN[3], LEFT[4]  - params: min: 0, max: 4
        2) Button A:   Discrete 2  - NOOP[0], Pressed[1] - params: min: 0, max: 1
        3) Button B:   Discrete 2  - NOOP[0], Pressed[1] - params: min: 0, max: 1
    - Can be initialized as
        MultiDiscrete([ 5, 2, 2 ])
    """
Tuple與dict類似，具體使用方法如下：
```python
class Tuple(Space[tuple], Sequence):
    """
    A tuple (i.e., product) of simpler spaces
    Example usage:
    self.observation_space = spaces.Tuple((spaces.Discrete(2), spaces.Discrete(3)))
    """

參考資料#

https://zhuanlan.zhihu.com/p/482821112