Compositions¶
This module promotes reward components and termination conditions as first-class objects. Those building blocks that can be plugged onto an existing pipeline by composition to keep everything modular, from the task definition to the low-level observers and controllers.
This modular approach allows for standardization of usual metrics. Overall, it greatly reduces code duplication and bugs.
- class gym_jiminy.common.bases.compositions.AbstractReward(env, name)[source]¶
Bases:
ABC
Abstract class from which all reward component must derived.
This goal of the agent is to maximize the expectation of the cumulative sum of discounted reward over complete episodes. This holds true no matter if its sign is always negative (aka. reward), always positive (aka. cost) or indefinite (aka. objective).
Defining cost is allowed by not recommended. Although it encourages the agent to achieve the task at hands as quickly as possible if success is the only termination condition, it has the side-effect to give the opportunity to the agent to maximize the return by killing itself whenever this is an option, which is rarely the desired behavior. No restriction is enforced as it may be limiting in some relevant cases, so it is up to the user to make sure that its design makes sense overall.
- Parameters:
env (InterfaceJiminyEnv) – Base or wrapped jiminy environment.
name (str) – Desired name of the reward.
- property name: str¶
Name uniquely identifying every reward.
It will be used as key not only for storing reward-specific monitoring and debugging information in ‘info’, but also for adding the underlying quantity to the ones already managed by the environment.
- property is_terminal: bool | None¶
Whether the reward is terminal, non-terminal, or indefinite.
A reward is said to be “terminal” if only evaluated for the terminal state of the MDP, “non-terminal” if evaluated for all states except the terminal one, or indefinite if systematically evaluated no matter what.
All rewards are supposed to be indefinite unless stated otherwise by overloading this method. The responsibility of evaluating the reward only when necessary is delegated to compute. This allows for complex evaluation logics beyond terminal or non-terminal without restriction.
Note
Truncation is not consider the same as termination. The reward to not be evaluated in such a case, which means that it will never be for such episodes.
- abstract property is_normalized: bool¶
Whether the reward is guaranteed to be normalized, ie it is in range [0.0, 1.0].
- abstract compute(terminated, info)[source]¶
Compute the reward.
Note
Return value can be set to None to indicate that evaluation was skipped for some reason, and therefore the reward must not be taken into account when computing the total reward. This is useful when the reward is undefined or simply inappropriate in the current state of the environment.
Warning
It is the responsibility of the practitioner overloading this method to honor flags ‘is_terminated’ (if not indefinite) and ‘is_normalized’. Failing this, an exception will be raised.
- Parameters:
- Returns:
Scalar value if the reward was evaluated, None otherwise.
- Return type:
float | None
- class gym_jiminy.common.bases.compositions.QuantityReward(env, name, quantity, transform_fn, is_normalized, is_terminal)[source]¶
Bases:
AbstractReward
,Generic
[ValueT
]Convenience class making it easy to derive reward components from generic quantities.
All this class does is applying some user-specified post-processing to the value of a given multi-variate quantity to return a floating-point scalar value, eventually normalized between 0.0 and 1.0 if desired.
- Parameters:
env (InterfaceJiminyEnv) – Base or wrapped jiminy environment.
name (str) – Desired name of the reward. This name will be used as key for storing current value of the reward in ‘info’, and to add the underlying quantity to the set of already managed quantities by the environment. As a result, it must be unique otherwise an exception will be raised.
quantity (Tuple[Type[InterfaceQuantity[ValueT]], Dict[str, Any]]) – Tuple gathering the class of the underlying quantity to use as reward after some post-processing, plus any keyword-arguments of its constructor except ‘env’, and ‘parent’.
transform_fn (Callable[[ValueT], float] | None) – Transform function responsible for aggregating a multi-variate quantity as floating-point scalar value to maximize. Typical examples are np.min, np.max, lambda x: np.linalg.norm(x, order=N). This function is also responsible for rescaling the transformed quantity in range [0.0, 1.0] if the reward is advertised as normalized. The Radial Basis Function (RBF) kernel is the most common choice to derive a reward to maximize from errors based on distance metrics (See radial_basis_function for details.). None to skip transform entirely if not necessary.
is_normalized (bool) – Whether the reward is guaranteed to be normalized after applying transform function transform_fn.
is_terminal (bool | None) – Whether the reward is terminal, non-terminal or indefinite. A terminal reward will be evaluated at most once, at the end of each episode for which a termination condition has been triggered. On the contrary, a non-terminal reward will be evaluated systematically except at the end of the episode. Finally, a indefinite reward will be evaluated systematically. The value 0.0 is returned and no ‘info’ will be stored when reward evaluation is skipped.
- property is_terminal: bool | None¶
Whether the reward is terminal, non-terminal, or indefinite.
A reward is said to be “terminal” if only evaluated for the terminal state of the MDP, “non-terminal” if evaluated for all states except the terminal one, or indefinite if systematically evaluated no matter what.
All rewards are supposed to be indefinite unless stated otherwise by overloading this method. The responsibility of evaluating the reward only when necessary is delegated to compute. This allows for complex evaluation logics beyond terminal or non-terminal without restriction.
Note
Truncation is not consider the same as termination. The reward to not be evaluated in such a case, which means that it will never be for such episodes.
- property is_normalized: bool¶
Whether the reward is guaranteed to be normalized, ie it is in range [0.0, 1.0].
- class gym_jiminy.common.bases.compositions.MixtureReward(env, name, components, reduce_fn, is_normalized)[source]¶
Bases:
AbstractReward
Base class for aggregating multiple independent reward components as a single one.
- Parameters:
env (InterfaceJiminyEnv) – Base or wrapped jiminy environment.
name (str) – Desired name of the total reward.
components (Tuple[AbstractReward, ...]) – Sequence of reward components to aggregate.
reduce_fn (Callable[[Tuple[float | None, ...]], float | None]) – Transform function responsible for aggregating all the reward components that were evaluated. Typical examples are cumulative product and weighted sum.
is_normalized (bool) – Whether the reward is guaranteed to be normalized after applying reduction function reduce_fn.
- components: Tuple[AbstractReward, ...]¶
List of all the reward components that must be aggregated together.
- property is_terminal: bool | None¶
Whether the reward is terminal, ie only evaluated at the end of an episode if a termination condition has been triggered.
The cumulative reward is considered terminal if and only if all its individual reward components are terminal.
- class gym_jiminy.common.bases.compositions.EpisodeState(value)[source]¶
Bases:
IntEnum
Specify the current state of the ongoing episode.
- CONTINUED = 0¶
No termination condition has been triggered this step.
- TERMINATED = 1¶
The terminal state has been reached.
- TRUNCATED = 2¶
A truncation condition has been triggered.
- class gym_jiminy.common.bases.compositions.AbstractTerminationCondition(env, name, grace_period=0.0, *, is_truncation=False, is_training_only=False)[source]¶
Bases:
ABC
Abstract class from which all termination conditions must derived.
Request the ongoing episode to stop immediately as soon as a termination condition is triggered.
There are two cases: truncating the episode or reaching the terminal state. In the former case, the agent is instructed to stop collecting samples from the ongoing episode and move to the next one, without considering this as a failure. As such, the reward-to-go that has not been observed will be estimated via a value function estimator. This is already what happens when collecting sample batches in the infinite horizon RL framework, except that the episode is not resumed to collect the rest of the episode in the following sample batched. In the case of a termination condition, the agent is just as much instructed to move to the next episode, but also to consider that it was an actual failure. This means that, unlike truncation conditions, the reward-to-go is known to be exactly zero. This is usually dramatic for the agent in the perspective of an infinite horizon reward, even more as the maximum discounted reward grows larger as the discount factor gets closer to one. As a result, the agent will avoid at all cost triggering terminal conditions, to the point of becoming risk averse by taking extra security margins lowering the average reward if necessary.
- Parameters:
env (InterfaceJiminyEnv) – Base or wrapped jiminy environment.
name (str) – Desired name of the termination condition. This name will be used as key for storing the current episode state from the perspective of this specific condition in ‘info’, and to add the underlying quantity to the set of already managed quantities by the environment. As a result, it must be unique otherwise an exception will be raised.
grace_period (float) – Grace period effective only at the very beginning of the episode, during which the latter is bound to continue whatever happens. Optional: 0.0 by default.
is_truncation (bool) – Whether the episode should be considered terminated or truncated whenever the termination condition is triggered. Optional: False by default.
is_training_only (bool) – Whether the termination condition should be completely by-passed if the environment is in evaluation mode. Optional: False by default.
- class gym_jiminy.common.bases.compositions.QuantityTermination(env, name, quantity, low, high, grace_period=0.0, *, is_truncation=False, is_training_only=False)[source]¶
Bases:
AbstractTerminationCondition
,Generic
[ValueT
]Convenience class making it easy to derive termination conditions from generic quantities.
All this class does is checking that, all elements of a given quantity are within bounds. If so, then the episode continues, otherwise it is either truncated or terminated according to ‘is_truncation’ constructor argument. This only applies after the end of a grace period. Before that, the episode continues no matter what.
- Parameters:
env (InterfaceJiminyEnv) – Base or wrapped jiminy environment.
name (str) – Desired name of the termination condition. This name will be used as key for storing the current episode state from the perspective of this specific condition in ‘info’, and to add the underlying quantity to the set of already managed quantities by the environment. As a result, it must be unique otherwise an exception will be raised.
quantity (Tuple[Type[InterfaceQuantity[ndarray | number | float | int | bool | complex | None]], Dict[str, Any]]) – Tuple gathering the class of the underlying quantity to use as termination condition, plus any keyword-arguments of its constructor except ‘env’, and ‘parent’.
low (ndarray | number | float | int | bool | complex | Sequence[float | int | bool | complex | number] | None) – Lower bound below which termination is triggered.
high (ndarray | number | float | int | bool | complex | Sequence[float | int | bool | complex | number] | None) – Upper bound above which termination is triggered.
grace_period (float) – Grace period effective only at the very beginning of the episode, during which the latter is bound to continue whatever happens. Optional: 0.0 by default.
is_truncation (bool) – Whether the episode should be considered terminated or truncated whenever the termination condition is triggered. Optional: False by default.
is_training_only (bool) – Whether the termination condition should be completely by-passed if the environment is in evaluation mode. Optional: False by default.
- compute(info)[source]¶
Evaluate the termination condition.
The underlying quantity is first evaluated. The episode continues if all the elements of its value are within bounds, otherwise the episode is either truncated or terminated according to ‘is_truncation’.
Warning
This method is not meant to be overloaded.