This article explores the implementation of a reinforcement learning algorithm called the Upper Confidence Bound (UCB) algorithm. Reinforcement learning, a subset of artificial intelligence, involves an agent interacting with an environment through a series of episodes or rounds. In each round, the agent makes a decision that may yield a reward. The agent's ultimate objective is to learn a strategy that maximizes its cumulative reward over time.
To demonstrate the implementation of the UCB algorithm, I will use the multi-armed bandit problem as an example. The first section of this article introduces the multi-armed bandit problem. Section II delves into the details of the Upper Confidence Bound (UCB) algorithm, while Section III presents its implementation.
I- The Multi-armed Bandit Problem
The multi-armed bandit problem can be described as follows: a player is presented with $K$ slot machines, each equipped with an arm. Pulling the arm of a machine may yield a reward, making each machine effectively a random variable. However, the player does not know the underlying probability distributions of these rewards. In each round, the player's goal is to decide which arm to pull to maximize their total rewards over time.
II- The Upper Confidence Bound Algorithm
The multi-armed bandit game can be approached using a simple naive strategy. The player begins by randomly selecting an arm to pull for the first $N$ rounds, updating the average reward for each arm after every round. This process helps the player identify the arm with the highest average reward. For the remainder of the game, the player consistently pulls the arm of the perceived best bandit. While straightforward, this strategy has two significant drawbacks:
- Since the decisions during the learning phase are random, the total reward accumulated during this phase might be very low.
- The reward distributions of the bandits can change over time, meaning the best bandit may no longer remain optimal. Relying on a single bandit after the learning phase can lead to suboptimal rewards.
A more effective strategy involves using confidence intervals for the average rewards. This approach forms the basis of the Upper Confidence Bound (UCB) algorithm. Consider the following notations:
- $N_i(n)$: the number of times arm $i$ has been pulled up to round $n$.
- $R_i(n)$: the sum of rewards yielded by arm $i$ up to round $n$.
The average reward of arm $i$ up to round $n$ is then given by: $$ \bar{r}_i(n) = \frac{R_i(n)}{N_i(n)}$$
Finally, the confidence interval of each bandit average reward is: $$ \bar{r}_i(n) \pm \Delta_i(n)$$ where $$ \Delta_i(n) = \sqrt{\frac{3}{2}\frac{\log(n)}{N_i(n)}}$$
UCB algorithm dictates to the player to pull arm $\hat{i}$ such that: $$\hat{i}(n) = argmax_i \{{\bar{r}_i(n) + \Delta_i(n)}\}$$ Unlike the naive strategy described earlier, which relies solely on the average reward to select the best arm at round $n$, the UCB algorithm introduces an additional boosting term, $ \Delta_i(n)$, to the average reward. Arms that have been tried fewer times (lower $N_i(n)$) receive a significant boost to their selection score. This boosting effect also grows as the game progresses to later rounds, with higher values of $n$.
Bootstrapping UCB
The equation for $\Delta_i(n)$ includes a division by $N_i(n)$, which necessitates that each arm is tried at least once before UCB scores can be calculated. To prevent division by zero, all arms must be tested either sequentially or randomly before applying the UCB algorithm.
An Implementation of UCB
The implementation of the upper confidence bound algorithm is provided by class UpperConfidenceBound below:
from typing import List, Optional
import array
import random
import math
class UpperConfidenceBound:
def __init__(self, rewards_by_round: List[List[int]]):
self.__rewards_by_round = rewards_by_round
n_rounds, n_candidates = rewards_by_round.shape
self.__n_rounds = n_rounds
self.__n_candidates = n_candidates
if self.__n_rounds < self.__n_candidates:
raise ValueError('Number of rounds should be greater or equal to the number of candidates!')
self.__reset_state()
def __reset_state(self):
self.__n_selected: List[int] = array.array('i', [0] * self.__n_candidates)
self.__sum_rewards: List[int] = array.array('i', [0] * self.__n_candidates)
self.__selected_by_round: List[int] = array.array('i')
self.__total_reward: int = 0
def learn(self):
pass
The constructor of this class takes a single argument, rewards_by_run
, which represents the environment for the player. It is structured as a matrix where each row corresponds to a round, and each column represents a bandit. Specifically, rewards_by_run[i][j]
denotes the potential reward obtained by pulling arm j
during round i
. Importantly, this matrix is not known to the player. The implementation assumes that rewards_by_run
is either a NumPy array or a data type with a shape
attribute.
The state of an UpperConfidenceBound instance, consists of:
__n_rounds
and__n_candidates
: the number of rounds to play and the number of arms, respectively.__n_selected
and__sum_rewards
: which are $N_i(n)$ and $R_i(n)$, respectively.__selected_by_round
: index of the arm selected at roundi
__total_reward
: total reward earned so far by the player.
Now, let's implement the main method of this class, method learn()
:
def learn(self):
self.__reset_state()
self.__random_selection()
self.__ucb_selection()
This method begins by resetting the game state, then bootstraps the UCB algorithm by invoking the method __random_selection()
, and finally applies the core UCB strategy.
The code snippet below shows the implementation of __random_selection():
def __random_selection(self):
random_candidates = list(range(self.__n_candidates))
random.shuffle(random_candidates)
for r in range(self.__n_candidates):
self.__select_candidate(random_candidates[r], r)
The arms are first shuffled, then they are tried in sequence. The internal state of the game is updated after each trial thanks to method __select_candidate
def __select_candidate(self, candidate_idx: int, round_idx: int):
self.__n_selected[candidate_idx] += 1
reward = self.__rewards_by_round[round_idx][candidate_idx]
self.__sum_rewards[candidate_idx] += reward
self.__total_reward += reward
self.__selected_by_round.append(candidate_idx)
This method increments the number of times the selected arm has been tried, updates the sum of rewards of that arm, the total reward and records the decision made at this round.
The top-level implementation of UCB is quite simple, method __ucb_selection()
selects the arm that has the maximum UCB, then updates the internal state of the game:
def __ucb_selection(self):
for r in range(self.__n_candidates, self.__n_rounds):
best_candidate_idx = self.__find_best_candidate(r)
self.__select_candidate(best_candidate_idx, r)
Method __find_best_candidate()
selects the first arm that maximizes $\bar{r}_i(n) + \Delta_i(n)$:
def __find_best_candidate(self, round_idx: int) -> int:
best_idx: Optional[int] = None
max_ucb = 0.0
for cnd_idx in range(self.__n_candidates):
average_reward = float(self.__sum_rewards[cnd_idx]) / self.__n_selected[cnd_idx]
delta = math.sqrt(1.5 * math.log(round_idx) / self.__n_selected[cnd_idx]) # Regularization term
ucb = average_reward + delta
if ucb > max_ucb:
max_ucb = ucb
best_idx = cnd_idx
return best_idx
Lastly, two additional accessor methods are needed for testing and evaluation: one to retrieve the total reward and another to access the decisions made by the player throughout the game.
def total_reward(self) -> int:
return self.__total_reward
def selected_candidates(self) -> List[int]:
return self.__selected_by_round