This document describes the multiarmed bandit problem and its application in the Frosmo Platform.
What is the multiarmed bandit?
The multiarmed bandit is a modification case for automatically managing and optimizing variation distribution. Multiarmed bandit modifications continuously adjust their variation distribution based on how well the variations perform over time.
Performance is measured as either the clickthrough rate (CTR) or conversion rate (CR) of a variation. The higher the rate relative to the other variations of the modification, the more successful the variation, and the greater the likelihood that the variation is displayed to visitors: its distribution percentage increases relative to the percentages of the other variations. If the variation underperforms, its distribution percentage decreases, resulting in fewer visitors seeing the variation.
No single variation, no matter how successful, will ever permanently dominate the distribution. Even the least successful variation will always be displayed to some fraction of visitors within a given time span.
The multiarmed bandit automatically recalculates the variation distribution every two hours and every time the modification configuration changes. (The update interval is fixed, so you cannot change it.)
Multiarmed bandit problem
The multiarmed bandit problem is probability theory problem, which Wikipedia defines as:
a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain. … The name comes from imagining a gambler at a row of slot machines (sometimes known as "onearmed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine.
In the problem, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines.
In the Frosmo Platform, the multiarmed bandit modification is the row of slot machines, the variations of the modification are the slot machines (or simply "arms"), and the rewards are either clicks or conversions generated by the variations. The Frosmo Platform is the agent that continuously tracks how well the variations perform (exploration) and, based on this data, dynamically adjusts the relative distribution of the variations to maximize the number of clicks or conversions for the modification (exploitation). To maintain exploration, the platform does not allow any variation, no matter how successful, to ever permanently dominate the distribution. However, if one variation is clearly more successful than all the others, the most succesful variation is delivered practically all the time.
For more information about the multiarmed bandit problem, see Further reading.
Multiarmed bandit configuration
To configure a multiarmed bandit modification, you must first create two or more content variations for the modification and activate them. You can then select the reward (click, conversion after click, or conversion after display) to define how variation performace is measured. You also select the algorithm for optimizing the variation distribution.
The multiarmed bandit algorithm adjusts the variation distribution percentages for each variation over time. If the modification includes a comparison group, the multiarmed bandit has no effect on it, meaning 10% of visitors continue to be excluded from the variations, limiting the multiarmed bandit to the remaining 90% of visitors.
For more information about configuring a multiarmed bandit modification, see Defining the content.
Multiarmed bandit and test mode
If you use test mode to test a multiarmed bandit modification, the testing produces usage data (displays, clicks, and conversions) for the multiarmed bandit algorithm. In other words, the testing affects the variation distribution calculated by the multiarmed bandit. If you do not want testing to impact variation distribution, you can duplicate the modification for public production use, or use a workspace.
For more information about test mode, see Test mode.
Multiarmed bandit algorithms
The Frosmo Platform supports a number of different multiarmed bandit algorithms. Each algorithm represents a specific solution to the multiarmed bandit problem, that is, a specific mathematical strategy for maximizing rewards over time.
By default, modifications use the Thompson sampling algorithm. You can change the algorithm separately for each modification from the modification settings. You can change algorithms at any time. The change takes effect immediately, and the variation distribution for the modification is recalculated with the newly selected algorithm.
A modification can only use a single algorithm at a time.
For more information about the algorithms, see Supported multiarmed bandit algorithms and Further reading.
Multiarmed bandit use cases
Multiarmed bandit modifications are only useful when they have two or more active variations.
Use multiarmed bandit modifications to:
 Ensure that, in the long run, the most successful variations are displayed most often. In other words, you want the variations to compete for display time.
 Test how well different variations perform over time, and put the test results to use while the test is still ongoing (rather than wait until testing is complete before adjusting the distribution percentages).
 Optimize the variation distribution of a modification for maximum gain without spending time or resources to run A/B testing. For example, if you're planning a short campaign on a tight schedule, running an A/B test in preparation for the campaign may not be feasible, but you can use the multiarmed bandit to optimize onthefly during the campaign.
 React dynamically to visitors' changing interests or needs over time without manually adjusting the variation distribution.
Do not use the multiarmed bandit modification to:
 Create solutions that require static variation distribution. For example, if you have fixed ratios for sharing display time between different products or product categories, you need to use static distribution percentages.
 Use the most successful modification from a set of alternatives. The multiarmed bandit operates within a modification, on the variations of the same modification, not between modifications, so you cannot use it to compare modifications.
Here are some examples of potential multiarmed bandit modifications:
 Banner with alternative onsale products (one product per variation)
 Banner with alternative visual styles, such as alternative colors or shapes (one visual style per variation)
 Product recommendation with one variation for the most purchased products and another variation for the most viewed products
Supported multiarmed bandit algorithms
The Frosmo Platform supports the following multiarmed bandit algorithms:
By default, modifications use the Thompson sampling algorithm. For instructions on how to change the algorithm for a modification, see Defining the content.
Which algorithm to use?
The multiarmed bandit problem has been notoriously difficult to solve. The algorithms all represent approximate solutions to the problem, and there is no single algorithm that always works better than the others in all situations. Which algorithm to use therefore depends on your modification and what you expect from it. Moreover, whether a given algorithm is the optimal choice for your modification is ultimately determined by realworld usage.
The following table gives you a starting point for choosing which algorithms to use in which use cases.
Table: Use cases for multiarmed bandit algorithms
Use case  Optimal algorithms 

You expect the variations to perform consistently over time. The most successful variations will continue to perform the best, and the least successful variations will remain the least popular among visitors. 

You expect variation performance to fluctuate slowly over time. Variations will continuously lose and gain popularity among visitors. 

You want each variation to always be displayed to at least some portion of visitors, irrespective of how well the variation performs. In other words, the distribution percentage of a variation can never reach 0.  Epsilongreedy or Softmax 
Epsilongreedy
In the Frosmo Platform, the epsilongreedy algorithm assigns 90% distribution to the most successful variation (exploitation) and splits the remaining 10% equally between the other variations (exploration). If the most successful variation starts to underperform (its rewardstodisplays ratio drops), or if one the other variations in the 10% distribution starts to perform really well (its rewardstodisplays ratio increases), a new winner may emerge and get the 90% distribution.
If the modification includes a comparison group, the algorithm assigns 81% distribution to the most successful variation and 9% distribution to the remaining variations since the comparison group reserves 10% of the total distribution.
Table: Epsilongreedy pros and cons
Pros  Cons 

Easy to understand  Fixed distribution percentage for variations other than the bestperforming one 
Softmax
In mathematical terms, the Softmax algorithm assigns a probability to each arm using a Boltzmann distribution and, based on the probabilities, randomly selects the arm to exploit. The probability assigned to each arm is proportional to its average reward: the more successful the arm, the higher its probability.
In Frosmo terms, the algorithm assigns a distribution percentage to each variation based on how well the variation has performed in the past, that is, how its rewardstodisplays ratio ranks against the other variations. With each recalculation, the most successful variation up to that point gets the highest distribution percentage, while the least successful variation gets the lowest distribution percentage. Each variation is therefore continuously exploited (displayed), but the more successful ones are exploited more.
If the modification includes a comparison group, the total distribution available to the algorithm is 90%.
Table: Softmax pros and cons
Pros  Cons 

Good shortterm performance: finds the bestperforming variation quickly Relatively easy to understand  Poor longterm performance: accumulates much more regret than more complex algorithms Regret measures the difference between (a) the total reward you would've collected had you used the best arm throughout and (b) the total reward you actually collected. The smaller the latter compared to the former, the greater the regret, and the less successful the algorithm. Not suitable for modifications whose variations have similar performance (as the algorithm ends up spending a lot of time exploring nonoptimal variations) 
Thompson sampling
In the Frosmo Platform, the Thompson sampling algorithm calculates a beta distribution for each variation, randomly selects a number from that distribution for the variation, and assigns 100% distribution to the variation with the highest number. This variation is therefore the only variation displayed until the distribution is recalculated.
The beta distribution for a variation reflects the probability of that variation being the most successful variation.
If the modification includes a comparison group, the selected variation gets 90% distribution.
Table: Thompson sampling pros and cons
Pros  Cons 

Finds the bestperforming variation really quickly Once there's enough accumulated data, the algorithm exploits (as opposed to explores) almost all of the time  If the bestperforming variation starts to underperform and its rewardstodisplays ratio drops below that of another variation, it can take a long time for the algorithm to catch up and start displaying the new bestperforming variation 
UCB1 and UCB1tuned
The upper confidence bound (UCB) algorithms are based on the principle of optimism in the face of uncertainty. In short, a UCB algorithm makes the most plausibly optimistic guess about the expected average reward of each arm and then selects the arm that promises the highest reward. Either the optimism is justified, in which case the algorithm continues to exploit the selected arm, or the optimism is not justified, in which case the algorithm eventually switches to explore other arms. The "optimism" is embodied by the upper confidence bound the algorithm continuously calculates for each arm: the optimistic guess about how well the arm will perform.
In the Frosmo Platform, the UCB algorithms assign 100% distribution to the variation with the highest upper confidence bound. This variation is therefore the only variation displayed until the distribution is recalculated.
If the modification includes a comparison group, the selected variation gets 90% distribution.
The Frosmo Platform supports two UCB algorithms: UCB1 and UCB1tuned. The algorithms differ only in that UCB1 calculates the upper confidence bound based on a variation's current rewardstodisplays ratio, whereas UCB1tuned also considers how the ratio has varied in the past. UCB1tuned can perform better in practice, but this cannot be theoretically guaranteed.
Table: UCB1 pros and cons
Pros  Cons 

Finds the bestperforming variation really fast Once there's enough accumulated data, the algorithm exploits (as opposed to explores) almost all of the time May react to changes in variation performance faster than Thompson sampling  May lose to Thompson sampling in overall performance 
Further reading
General
 Multiarmed bandit (Wikipedia)
Algorithms
Algorithms: Softmax
 Boltzmann distribution (Wikipedia)
 Softmax function (Wikipedia)
Algorithms: Thompson sampling
 Analysis of Thompson Sampling for the multiarmed bandit problem (arXiv)
 Beta distribution (Wikipedia)
 Probability matching (Wikipedia)
 Thompson sampling (Wikipedia)
 Thompson Sampling: An Asymptotically Optimal Finite Time Analysis (arXiv)
Algorithms: UCB
 Confidence Bounds (ReliaWiki)
 Confidence interval (Wikipedia)
 Optimism in the Face of Uncertainty: the UCB1 Algorithm (Math n Programming)
 The Upper Confidence Bound Algorithm (Bandit Algorithms)