強化学習における環境

強化学習のためのライブラリは多くありますが、最も人気のあるものは OpenAI の gym で、これは強化学習のための単一エージェント環境を提供します。また、Petting Zoo のような便利な環境ライブラリもあります。

Gym#

🔗 原文リンク： https://zhuanlan.zhihu.com/p/482821112
⏰ 剪存時間：2024-04-24 13:28:37 (UTC+8)
✂️ 本文書は飛書剪存によって一鍵生成されました

Gym#

強化学習モデルに基づいてコードを書く際に重要なステップの一つは、環境（environment）とのインタラクションのコードを書くことです。Gym は OpenAI が強化学習愛好者のために提供するオープンソースライブラリで、強化学習アルゴリズムの開発と比較に使用されます。Gym の特徴は、エージェントに対して何の仮定もせず、TensorFlow や Theano などの任意の数値計算ライブラリと互換性があることです。ユーザーは Gym を使用して、自分のモデルに適した Gym 環境を定義できます。

Spaces#

強化学習モデルを実際に構築する際には、環境を特徴付けるために多くのパラメータが必要であり、これらのパラメータのデータ型、値の範囲、デフォルト値などは異なります。これらの異なるパラメータは、より良く処理するために分類する必要があり、Gym は Spaces クラスを使用してこれらの異なるデータ型をサポートします。

CartPole の例#

クラシックな CartPole 問題の Gym を使用したコードは以下の通りで、これは小車が平面上で左右に移動してレバーが倒れないようにするシーンを描写しています。

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("エピソードは{}タイムステップ後に終了しました".format(t+1))
            break
env.close()

出力

[-0.061586   -0.75893141  0.05793238  1.15547541]
[-0.07676463 -0.95475889  0.08104189  1.46574644]
[-0.0958598  -1.15077434  0.11035682  1.78260485]
[-0.11887529 -0.95705275  0.14600892  1.5261692 ]
[-0.13801635 -0.7639636   0.1765323   1.28239155]
[-0.15329562 -0.57147373  0.20218013  1.04977545]
エピソードは14タイムステップ後に終了しました
[-0.02786724  0.00361763 -0.03938967 -0.01611184]
[-0.02779488 -0.19091794 -0.03971191  0.26388759]
[-0.03161324  0.00474768 -0.03443415 -0.04105167]

Spaces の応用#

上記の例では、環境の action_space からランダムサンプリングを行ってきました。しかし、これらの action とは一体何でしょうか？各環境には、その環境に必要なタイプに一致する Space が付随しており、これらは actions と observations の形式を記述します： action_space, observation_space 。例えば

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)

ここで、Discrete は非負数の固定範囲を許可するため、この場合、有効な action は 0 または 1 です。
このクラスの具体的な使い方は以下の通りです。

class Discrete(Space[int]):
    r"""A discrete space in :math:`\{ 0, 1, \\dots, n-1 \}`.
    A start value can be optionally specified to shift the range
    to :math:`\{ a, a+1, \\dots, a+n-1 \}`.
    Example::
        >>> Discrete(2)            # {0, 1}
        >>> Discrete(3, start=-1)  # {-1, 0, 1}
    """

また、Box は n 次元の実数空間 Rn \mathbb {R}^n を記述し、上下限を指定することも、指定しないこともできます。具体的な使い方は以下の通りです：

class Box(Space[np.ndarray]):
    """
    A (possibly unbounded) box in R^n. Specifically, a Box represents the
    Cartesian product of n closed intervals. Each interval has the form of one
    of [a, b], (-oo, b], [a, oo), or (-oo, oo).
    There are two common use cases:
    * Identical bound for each dimension::
        >>> Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
        Box(3, 4)
    * Independent bound for each dimension::
        >>> Box(low=np.array([-1.0, -2.0]), high=np.array([2.0, 4.0]), dtype=np.float32)
        Box(2,)
    """
    def __init__(
        self,
        low: Union[SupportsFloat, np.ndarray],
        high: Union[SupportsFloat, np.ndarray],
        shape: Optional[Sequence[int]] = None,
        dtype: Type = np.float32,
        seed: Optional[int] = None,
    )

小結：Box と Discrete はカスタム環境で使用される最も多い 2 つのクラスです。それ以外にも Spaces クラス内には多くの他のクラスがあり、これらは次の小節で説明します。

他のタイプの Spaces#

Box と Discrete の他に、Spaces は他のタイプのデータ構造も提供しています。すべてのデータ構造は以下の通りです：

__all__ = [
    "Space",
    "Box",
    "Discrete",
    "MultiDiscrete",
    "MultiBinary",
    "Tuple",
    "Dict",
    "flatdim",
    "flatten_space",
    "flatten",
    "unflatten",
]

Dict は異なるデータ構造を埋め込むことができる辞書型のデータ構造で、具体的な使用方法は以下の通りです：

class Dict(Space[TypingDict[str, Space]], Mapping):
    """
    A dictionary of simpler spaces.
    Example usage:
    self.observation_space = spaces.Dict({"position": spaces.Discrete(2), "velocity": spaces.Discrete(3)})
    Example usage [nested]:
    self.nested_observation_space = spaces.Dict({
        'sensors':  spaces.Dict({
            'position': spaces.Box(low=-100, high=100, shape=(3,)),
            'velocity': spaces.Box(low=-1, high=1, shape=(3,)),
            'front_cam': spaces.Tuple((
                spaces.Box(low=0, high=1, shape=(10, 10, 3)),
                spaces.Box(low=0, high=1, shape=(10, 10, 3))
            )),
            'rear_cam': spaces.Box(low=0, high=1, shape=(10, 10, 3)),
        }),
        'ext_controller': spaces.MultiDiscrete((5, 2, 2)),
        'inner_state':spaces.Dict({
            'charge': spaces.Discrete(100),
            'system_checks': spaces.MultiBinary(10),
            'job_status': spaces.Dict({
                'task': spaces.Discrete(5),
                'progress': spaces.Box(low=0, high=100, shape=()),
            })
        })
    })
    """

MultiBinary は 0 と 1 のみを含む高次元データ構造で、その具体的な使用方法は以下の通りです：

class MultiBinary(Space[np.ndarray]):
    """
    An n-shape binary space.
    The argument to MultiBinary defines n, which could be a number or a `list` of numbers.
    Example Usage:
    >> self.observation_space = spaces.MultiBinary(5)
    >> self.observation_space.sample()
        array([0, 1, 0, 1, 0], dtype=int8)
    >> self.observation_space = spaces.MultiBinary([3, 2])
    >> self.observation_space.sample()
        array([[0, 0],
               [0, 1],
               [1, 1]], dtype=int8)
    """

MultiDiscrete は MultiBinary に似ていますが、異なるのはより多くの整数が存在することを許可する点です。具体的な使用方法は以下の通りです：

class MultiDiscrete(Space[np.ndarray]):
    """
    - The multi-discrete action space consists of a series of discrete action spaces with different number of actions in each
    - It is useful to represent game controllers or keyboards where each key can be represented as a discrete action space
    - It is parametrized by passing an array of positive integers specifying number of actions for each discrete action space
    Note: Some environment wrappers assume a value of 0 always represents the NOOP action.
    e.g. Nintendo Game Controller
    - Can be conceptualized as 3 discrete action spaces:
        1) Arrow Keys: Discrete 5  - NOOP[0], UP[1], RIGHT[2], DOWN[3], LEFT[4]  - params: min: 0, max: 4
        2) Button A:   Discrete 2  - NOOP[0], Pressed[1] - params: min: 0, max: 1
        3) Button B:   Discrete 2  - NOOP[0], Pressed[1] - params: min: 0, max: 1
    - Can be initialized as
        MultiDiscrete([ 5, 2, 2 ])
    """
Tupleはdictに似ており、具体的な使用方法は以下の通りです：
```python
class Tuple(Space[tuple], Sequence):
    """
    A tuple (i.e., product) of simpler spaces
    Example usage:
    self.observation_space = spaces.Tuple((spaces.Discrete(2), spaces.Discrete(3)))
    """

参考文献#

https://zhuanlan.zhihu.com/p/482821112