

Action Space


Observation Space

Dict(‘bottles_carrying’: Discrete(3), ‘bottles_delivered’: Discrete(3), ‘bottles_dropped’: MultiBinary(3), ‘location’: Discrete(5))

Reward Shape


Reward High

[ 0. 50. 0.]

Reward Low

[-inf 0. -1.]




This environment implements the problems UnbreakableBottles and BreakableBottles defined in Section 4.1.2 of the paper Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety.

Action Space

The action space is a discrete space with 3 actions:

  • 0: move left

  • 1: move right

  • 2: pick up a bottle

Observation Space

The observation space is a dictionary with 4 keys:

  • location: the current location of the agent

  • bottles_carrying: the number of bottles the agent is currently carrying (0, 1 or 2)

  • bottles_delivered: the number of bottles the agent has delivered (0, 1 or 2)

  • bottles_dropped: for each location, a boolean flag indicating if that location currently contains a bottle

Note that this observation space is different from that listed in the paper above. In the paper, bottles_delivered’s possible values are listed as (0 or 1), rather than (0, 1 or 2). This is because the paper did not take the terminal state, in which 2 bottles have been delivered, into account when calculating the observation space. As such, the observation space of this implementation is larger than specified in the paper, having 360 possible states instead of 240.

Reward Space

The reward space has 3 dimensions:

  • time penalty: -1 for each time step

  • bottle reward: bottle_reward for each bottle delivered

  • potential: While carrying multiple bottles there is a small probability of dropping them. A potential-based penalty is applied for bottles left on the ground.

Starting State

The agent starts at location 0, carrying no bottles, having delivered no bottles and having dropped no bottles.

Episode Termination

The episode terminates when the agent has delivered 2 bottles.


  • size: the number of locations in the environment

  • prob_drop: the probability of dropping a bottle while carrying 2 bottles

  • time_penalty: the time penalty for each time step

  • bottle_reward: the reward for delivering a bottle

  • unbreakable_bottles: if True, a bottle which is dropped in a location can be picked up again (so the outcome of dropping a bottle is reversible), otherwise a dropped bottle cannot be picked up.


This environment was originally a contribution of Robert Klassert The home asset is from The gold, enemy and gem assets are from The bottles pixel art was created with the assistance of DALL·E 2.