Env in reinforcement learning

There are many lib for reinforcement learning, the most popular one is gym by openai, which can provide single agent env for rl. And there are also useful env lib like Petting Zoo.

Gym#

🔗 原文链接： https://zhuanlan.zhihu.com/p/482821112
⏰ 剪存时间：2024-04-24 13:28:37 (UTC+8)
✂️ 本文档由飞书剪存一键生成

Gym#

在基于强化学习模型编写代码时，很重要的一个环节是编写与环境 (environment) 之间的交互的代码。Gym 是 OpenAI 公司为强化学习爱好者提供的一个开源库，用于开发和比较强化学习算法。 Gym 的特点是它不对 Agent 做任何假设，并且与任何数值计算库兼容，例如 TensorFlow 或 Theano。用户可以用 Gym 来制定适合于自己模型的 Gym Environment。

Spaces#

在真正构造强化学习模型时，是需要许多个参数来对环境进行刻画，而这些参数的数据类型，取值范围，默认值等都是不尽相同的，这些不同的参数需要进行归类才能较好的进行处理，而 Gym 使用 Spaces 类为这些不同的数据类型提供支持。

CartPole 的例子#

一个经典的 CartPole 问题使用 Gym 的代码如下，它描述了一个小车在明面上左右移动以保证杠杆不倒下的场景。

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

输出

[-0.061586   -0.75893141  0.05793238  1.15547541]
[-0.07676463 -0.95475889  0.08104189  1.46574644]
[-0.0958598  -1.15077434  0.11035682  1.78260485]
[-0.11887529 -0.95705275  0.14600892  1.5261692 ]
[-0.13801635 -0.7639636   0.1765323   1.28239155]
[-0.15329562 -0.57147373  0.20218013  1.04977545]
Episode finished after 14 timesteps
[-0.02786724  0.00361763 -0.03938967 -0.01611184]
[-0.02779488 -0.19091794 -0.03971191  0.26388759]
[-0.03161324  0.00474768 -0.03443415 -0.04105167]

Spaces 的应用#

在上面的示例中，我们一直在从环境的 action_space 中进行随机采样操作。但这些 action 到底是什么呢？每个环境都附带一个和这个环境所需要的类型相匹配的 Space ，它们描述 actions 和 observations 的格式： action_space, observation_space 。如

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)

其中 Discrete 允许非负数的固定范围，因此在这种情况下，有效的 action 为 0 或 1。
该类的具体用法如下

class Discrete(Space[int]):
    r"""A discrete space in :math:`\{ 0, 1, \\dots, n-1 \}`.
    A start value can be optionally specified to shift the range
    to :math:`\{ a, a+1, \\dots, a+n-1 \}`.
    Example::
        >>> Discrete(2)            # {0, 1}
        >>> Discrete(3, start=-1)  # {-1, 0, 1}
    """

而 Box 描述的为一个 n 维的实数空间 Rn \mathbb {R}^n ，可以指定上下限，也可以不指定上下限。具体用法如下：

class Box(Space[np.ndarray]):
    """
    A (possibly unbounded) box in R^n. Specifically, a Box represents the
    Cartesian product of n closed intervals. Each interval has the form of one
    of [a, b], (-oo, b], [a, oo), or (-oo, oo).
    There are two common use cases:
    * Identical bound for each dimension::
        >>> Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)
        Box(3, 4)
    * Independent bound for each dimension::
        >>> Box(low=np.array([-1.0, -2.0]), high=np.array([2.0, 4.0]), dtype=np.float32)
        Box(2,)
    """
    def __init__(
        self,
        low: Union[SupportsFloat, np.ndarray],
        high: Union[SupportsFloat, np.ndarray],
        shape: Optional[Sequence[int]] = None,
        dtype: Type = np.float32,
        seed: Optional[int] = None,
    )

小结： Box 和 Discrete 是自定义环境中使用最多的两个类。除此之外 Spaces 类内还有许多其他的类，这些将在下一小节讲到。

其他类型的 Spaces#

除了 Box 与 Discrete 外，Spaces 还提供了其他类型的数据结构，所有数据结构如下：

__all__ = [
    "Space",
    "Box",
    "Discrete",
    "MultiDiscrete",
    "MultiBinary",
    "Tuple",
    "Dict",
    "flatdim",
    "flatten_space",
    "flatten",
    "unflatten",
]

Dict 是一个字典类型的数据结构，它可以将不同的数据结构嵌入进来，具体使用方法如下：

class Dict(Space[TypingDict[str, Space]], Mapping):
    """
    A dictionary of simpler spaces.
    Example usage:
    self.observation_space = spaces.Dict({"position": spaces.Discrete(2), "velocity": spaces.Discrete(3)})
    Example usage [nested]:
    self.nested_observation_space = spaces.Dict({
        'sensors':  spaces.Dict({
            'position': spaces.Box(low=-100, high=100, shape=(3,)),
            'velocity': spaces.Box(low=-1, high=1, shape=(3,)),
            'front_cam': spaces.Tuple((
                spaces.Box(low=0, high=1, shape=(10, 10, 3)),
                spaces.Box(low=0, high=1, shape=(10, 10, 3))
            )),
            'rear_cam': spaces.Box(low=0, high=1, shape=(10, 10, 3)),
        }),
        'ext_controller': spaces.MultiDiscrete((5, 2, 2)),
        'inner_state':spaces.Dict({
            'charge': spaces.Discrete(100),
            'system_checks': spaces.MultiBinary(10),
            'job_status': spaces.Dict({
                'task': spaces.Discrete(5),
                'progress': spaces.Box(low=0, high=100, shape=()),
            })
        })
    })
    """

MultiBinary 是一个只包含 0，1 的高维数据结构，它的具体使用方法如下：

class MultiBinary(Space[np.ndarray]):
    """
    An n-shape binary space.
    The argument to MultiBinary defines n, which could be a number or a `list` of numbers.
    Example Usage:
    >> self.observation_space = spaces.MultiBinary(5)
    >> self.observation_space.sample()
        array([0, 1, 0, 1, 0], dtype=int8)
    >> self.observation_space = spaces.MultiBinary([3, 2])
    >> self.observation_space.sample()
        array([[0, 0],
               [0, 1],
               [1, 1]], dtype=int8)
    """

MultiDiscrete 与 MultiBinary 类似，不同的是它允许更多的更多的整数存在，具体使用方法如下：

class MultiDiscrete(Space[np.ndarray]):
    """
    - The multi-discrete action space consists of a series of discrete action spaces with different number of actions in each
    - It is useful to represent game controllers or keyboards where each key can be represented as a discrete action space
    - It is parametrized by passing an array of positive integers specifying number of actions for each discrete action space
    Note: Some environment wrappers assume a value of 0 always represents the NOOP action.
    e.g. Nintendo Game Controller
    - Can be conceptualized as 3 discrete action spaces:
        1) Arrow Keys: Discrete 5  - NOOP[0], UP[1], RIGHT[2], DOWN[3], LEFT[4]  - params: min: 0, max: 4
        2) Button A:   Discrete 2  - NOOP[0], Pressed[1] - params: min: 0, max: 1
        3) Button B:   Discrete 2  - NOOP[0], Pressed[1] - params: min: 0, max: 1
    - Can be initialized as
        MultiDiscrete([ 5, 2, 2 ])
    """
Tuple 与dict类似，具体使用方法如下：
class Tuple(Space[tuple], Sequence):
    """
    A tuple (i.e., product) of simpler spaces
    Example usage:
    self.observation_space = spaces.Tuple((spaces.Discrete(2), spaces.Discrete(3)))
    """

References#

https://zhuanlan.zhihu.com/p/482821112