[go: nahoru, domu]

Skip to content

Latest commit

 

History

History

bandits

Bandits in TF-Agents

The Multi-Armed Bandit problem (MAB) is a special case of Reinforcement Learning (RL): an agent collects rewards in an environment by taking some actions after observing some state of the environment. The main difference between general RL and MAB is that in MAB, we assume that the action taken by the agent does not influence the next state of the environment. Therefore, agents do not model state transitions, credit rewards to past actions, or "plan ahead" to get to reward-rich states. Due to this very fact, the notion of episodes is not used in MAB, unlike in general RL.

In many bandits use cases, the state of the environment is observed. These are known as contextual bandits problems, and can be thought of as a generalization of multi-armed bandits where the agent has access to additional context in each round.

To get started with Bandits in TF-Agents, we recommend checking our bandits tutorial.

Agents

Currently the following algorithms are available:

Environments

In bandits, the environment is responsible for (i) outputting information about the current state (aka observation or context), and (ii) outputting a reward when receiving an action as input.

In order to test the performance of existing and new bandit algorithms, the library provides several environments spanning various setups such as linear or non-linear rewards functions, stationary or non-stationary environment dynamics. More specifically, the following environments are available:

  • Stationary: This environment assumes stationary functions for generating observations and rewards.
  • Non-stationary: This environment has non-stationary dynamics.
  • Piecewise stationary: This environment is non-stationary, consisting of stationary pieces.
  • Drifting: In this case, the environment is also non-stationary and its dynamics are slowly drifting.
  • Wheel: This is a non-linear environment with a scalar parameter that directly controls the difficulty of the problem.
  • Classification suite: Given any classification dataset wrapped as a tf.data.Dataset, this environment converts it into a bandit problem.

Regret metrics

The library also provides TF-metrics for regret computation. The notion of regret is an important one in the bandits literature and it can be informally defined as the difference between the total expected reward using the optimal policy and the total expected reward collected by the agent. Most of the environments listed above come with utilities for computing metrics such as the regret, the percentage of suboptimal arm plays and so on.

Examples

The library provides ready-to-use end-to-end examples for training and evaluating various bandit agents in the tf_agents/bandits/agents/examples/v2/ directory. A few examples:

  • Stationary linear: tests different bandit agents against stationary linear environments.
  • Wheel: tests different bandit agents against the wheel bandit environment.
  • Drifting linear: tests different bandit agents against drifting (i.e., non-stationary) linear environments.

Advanced functionality

Arm features

In some bandits use cases, each arm has its own features. For example, in movie recommendation problems, the user features play the role of the context and the movies play the role of the arms (aka actions). Each movie has its own features, such as text description, metadata, trailer content features and so on. We refer to such problems as arm features problems.

An example of bandit training with arm features can be found here.

Multi-metric bandits

In some bandits use cases, the "goodness" of the decisions that the agent makes can be measured via multiple metrics. For example, when we recommend a certain movie to a user, we can measure several metrics about this decision, such as: whether the user clicked it, whether the user watched it, whether the user liked it, shared it and so on. For such bandits use cases, the library provides the following solutions.

  • Multi-objective optimization In case of several reward signals, a common technique is called scalarization. The main idea is to combine all the input rewards signals into a single one, which can be optimized by the vanilla bandits algorithms. The library offers the following options for scalarization:
* Constrained optimization In use cases where one metric clearly plays the role of reward metric and other metrics can be understood as auxiliary constraint metrics, constrained optimization may be a good fit. In this case, one can introduce the notion of action feasibility, which may be context-dependent, and implies whether an action is eligible to be selected or not in the current round (given the current context). In the general case, the action feasibility is inferred by evaluating expressions involving one or more of the auxiliary constraint metrics. The [Constraints API](https://github.com/tensorflow/agents/tree/master/tf_agents/bandits/policies/constraints.py) unifies how all constraints are evaluated for computing the action feasibility. A single constraint may be trainable (or not) depending on whether the action feasibility computation is informed by a model predicting the value of the corresponding constraint metric.