A Python Implementation of The Upper Confidence Bound Reinforcement Learning Algorithm

By Skander, 1 November, 2024

This article explores the implementation of a reinforcement learning algorithm called the Upper Confidence Bound (UCB) algorithm. Reinforcement learning, a subset of artificial intelligence, involves an agent interacting with an environment through a series of episodes or rounds. In each round, the agent makes a decision that may yield a reward. The agent's ultimate objective is to learn a strategy that maximizes its cumulative reward over time.

To demonstrate the implementation of the UCB algorithm, I will use the multi-armed bandit problem as an example. The first section of this article introduces the multi-armed bandit problem. Section II delves into the details of the Upper Confidence Bound (UCB) algorithm, while Section III presents its implementation.

I- The Multi-armed Bandit Problem

The multi-armed bandit problem can be described as follows: a player is presented with $K$ slot machines, each equipped with an arm. Pulling the arm of a machine may yield a reward, making each machine effectively a random variable. However, the player does not know the underlying probability distributions of these rewards. In each round, the player's goal is to decide which arm to pull to maximize their total rewards over time.

II- The Upper Confidence Bound Algorithm

The multi-armed bandit game can be approached using a simple naive strategy. The player begins by randomly selecting an arm to pull for the first $N$ rounds, updating the average reward for each arm after every round. This process helps the player identify the arm with the highest average reward. For the remainder of the game, the player consistently pulls the arm of the perceived best bandit. While straightforward, this strategy has two significant drawbacks:

Since the decisions during the learning phase are random, the total reward accumulated during this phase might be very low.
The reward distributions of the bandits can change over time, meaning the best bandit may no longer remain optimal. Relying on a single bandit after the learning phase can lead to suboptimal rewards.

A more effective strategy involves using confidence intervals for the average rewards. This approach forms the basis of the Upper Confidence Bound (UCB) algorithm. Consider the following notations:

$N_i(n)$: the number of times arm $i$ has been pulled up to round $n$.
$R_i(n)$: the sum of rewards yielded by arm $i$ up to round $n$.

The average reward of arm $i$ up to round $n$ is then given by: $$ \bar{r}_i(n) = \frac{R_i(n)}{N_i(n)}$$

Finally, the confidence interval of each bandit average reward is: $$ \bar{r}_i(n) \pm \Delta_i(n)$$ where $$ \Delta_i(n) = \sqrt{\frac{3}{2}\frac{\log(n)}{N_i(n)}}$$

UCB algorithm dictates to the player to pull arm $\hat{i}$ such that: $$\hat{i}(n) = argmax_i \{{\bar{r}_i(n) + \Delta_i(n)}\}$$ Unlike the naive strategy described earlier, which relies solely on the average reward to select the best arm at round $n$, the UCB algorithm introduces an additional boosting term, $ \Delta_i(n)$, to the average reward. Arms that have been tried fewer times (lower $N_i(n)$) receive a significant boost to their selection score. This boosting effect also grows as the game progresses to later rounds, with higher values of $n$.

Bootstrapping UCB

The equation for $\Delta_i(n)$ includes a division by $N_i(n)$, which necessitates that each arm is tried at least once before UCB scores can be calculated. To prevent division by zero, all arms must be tested either sequentially or randomly before applying the UCB algorithm.

An Implementation of UCB

The implementation of the upper confidence bound algorithm is provided by class UpperConfidenceBound below:

from typing import List, Optional
import array
import random
import math

class UpperConfidenceBound:
    def __init__(self, rewards_by_round: List[List[int]]):
        self.__rewards_by_round = rewards_by_round
        n_rounds, n_candidates = rewards_by_round.shape
        self.__n_rounds = n_rounds
        self.__n_candidates = n_candidates
        
        if self.__n_rounds < self.__n_candidates:
            raise ValueError('Number of rounds should be greater or equal to the number of candidates!')
        
        self.__reset_state()
        
    def __reset_state(self):
        self.__n_selected: List[int] = array.array('i', [0] * self.__n_candidates)
        self.__sum_rewards: List[int] = array.array('i', [0] * self.__n_candidates)
        self.__selected_by_round: List[int] = array.array('i')
        self.__total_reward: int = 0
            
    def learn(self):
        pass

The constructor of this class takes a single argument, rewards_by_run, which represents the environment for the player. It is structured as a matrix where each row corresponds to a round, and each column represents a bandit. Specifically, rewards_by_run[i][j] denotes the potential reward obtained by pulling arm j during round i . Importantly, this matrix is not known to the player. The implementation assumes that rewards_by_run is either a NumPy array or a data type with a shape attribute.

The state of an UpperConfidenceBound instance, consists of:

__n_rounds and __n_candidates: the number of rounds to play and the number of arms, respectively.
__n_selected and __sum_rewards: which are $N_i(n)$ and $R_i(n)$, respectively.
__selected_by_round: index of the arm selected at round i
__total_reward: total reward earned so far by the player.

Now, let's implement the main method of this class, method learn():

def learn(self):
        self.__reset_state()
        self.__random_selection()
        self.__ucb_selection()

This method begins by resetting the game state, then bootstraps the UCB algorithm by invoking the method __random_selection(), and finally applies the core UCB strategy.

The code snippet below shows the implementation of __random_selection():

def __random_selection(self):
        random_candidates = list(range(self.__n_candidates))
        random.shuffle(random_candidates)
        
        for r in range(self.__n_candidates):
            self.__select_candidate(random_candidates[r], r)

The arms are first shuffled, then they are tried in sequence. The internal state of the game is updated after each trial thanks to method __select_candidate

def __select_candidate(self, candidate_idx: int, round_idx: int):
        self.__n_selected[candidate_idx] += 1
        reward = self.__rewards_by_round[round_idx][candidate_idx]
        self.__sum_rewards[candidate_idx] += reward
        self.__total_reward += reward
        self.__selected_by_round.append(candidate_idx)

This method increments the number of times the selected arm has been tried, updates the sum of rewards of that arm, the total reward and records the decision made at this round.

The top-level implementation of UCB is quite simple, method __ucb_selection() selects the arm that has the maximum UCB, then updates the internal state of the game:

def __ucb_selection(self):
        for r in range(self.__n_candidates, self.__n_rounds):
            best_candidate_idx = self.__find_best_candidate(r)
            self.__select_candidate(best_candidate_idx, r)

Method __find_best_candidate() selects the first arm that maximizes $\bar{r}_i(n) + \Delta_i(n)$:

def __find_best_candidate(self, round_idx: int) -> int:
        best_idx: Optional[int] = None
        max_ucb = 0.0
    
        for cnd_idx in range(self.__n_candidates):
            average_reward = float(self.__sum_rewards[cnd_idx]) / self.__n_selected[cnd_idx]
            delta = math.sqrt(1.5 * math.log(round_idx) / self.__n_selected[cnd_idx]) # Regularization term
            ucb = average_reward + delta
            
            
            if ucb > max_ucb:
                max_ucb = ucb
                best_idx = cnd_idx
        
        return best_idx

Lastly, two additional accessor methods are needed for testing and evaluation: one to retrieve the total reward and another to access the decisions made by the player throughout the game.

def total_reward(self) -> int:
        return self.__total_reward
    
    def selected_candidates(self) -> List[int]:
        return self.__selected_by_round

Next we conduct an analysis of this upper confidence bound implementation.