This document describes the multiarmed bandit feature for modifications and shows you how to use it in the Frosmo Control Panel.
What is the multiarmed bandit?
The multiarmed bandit is a feature for automatically managing and optimizing variation distribution for modifications. When enabled for a given modification, the multiarmed bandit continuously adjusts variation distribution for that modification based on how well the variations perform over time.
Performance is measured as either the clickthrough rate (CTR) or conversion rate (CR) of a variation. The higher the rate relative to the other variations of the modification, the more successful the variation, and the greater the likelihood that the variation is displayed to visitors: its distribution percentage increases relative to the percentages of the other variations. If the variation underperforms, its distribution percentage decreases, resulting in fewer visitors seeing the variation.
No single variation, no matter how successful, will ever permanently dominate the distribution. Even the least successful variation will always be displayed to some fraction of visitors within a given time span.
The multiarmed bandit automatically recalculates the variation distribution every few hours and every time the modification configuration changes. (The update interval is fixed, so you cannot change it.)
Multiarmed bandit problem
The multiarmed bandit problem is probability theory problem, which Wikipedia defines as:
a problem in which a gambler at a row of slot machines (sometimes known as "onearmed bandits") has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.
In more general terms:
The multiarmed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called "exploration") and optimize his or her decisions based on existing knowledge (called "exploitation"). The agent attempts to balance these competing tasks in order to maximize his total value over the period of time considered.
In the Frosmo Platform, the modification is the row of slot machines, the variations of the modification are the slot machines (or simply "arms"), and the rewards are either clicks or conversions generated by the variations. The Frosmo Platform is the agent that continuously tracks how well the variations perform (exploration) and, based on this data, dynamically adjusts the relative distribution of the variations to maximize the number of clicks or conversions for the modification (exploitation). To maintain exploration, the platform does not allow any variation, no matter how successful, to ever permanently dominate the distribution.
For more information about the multiarmed bandit problem, see Further reading.
Multiarmed bandit configuration
You enable and configure the multiarmed bandit separately for each modification.
If enabled, the multiarmed bandit overrides the static variation distribution percentages for the modification. If the modification includes a comparison group, the multiarmed bandit has no effect on it, meaning 10% of visitors continue to be excluded from the variations, limiting the multiarmed bandit to the remaining 90% of visitors.
You can use the multiarmed bandit with any modification type.
Multiarmed bandit and test mode
If you use test mode to test a multiarmed bandit modification, the testing produces usage data (displays, clicks, and conversions) for the multiarmed bandit algorithm. In other words, the testing affects the variation distribution calculated by the multiarmed bandit. If you do not want testing to impact variation distribution, duplicate the modification for public production use.
For more information about test mode, see Testing modifications with test mode.
Multiarmed bandit and variation lock
Enabling the multiarmed bandit for a modification automatically disables variation lock for that modification. With the multiarmed bandit enabled, a visitor is not locked to the variation they see on their first page load. Instead, they are randomly assigned a new variation on each page load, with the most successful variations at that time having the highest probabilities of getting displayed.
For more information about variation lock, see the Palmdale release notes.
Multiarmed bandit algorithms
The Frosmo Platform supports a number of different multiarmed bandit algorithms. Each algorithm represents a specific solution to the multiarmed bandit problem, that is, a specific mathematical strategy for maximizing rewards over time.
By default, modifications use the Thompson sampling algorithm. You can change the algorithm separately for each modification from the modification settings. You can change algorithms at any time. The change takes effect immediately, and the variation distribution for the modification is recalculated with the newly selected algorithm.
For more information about the algorithms, see Supported multiarmed bandit algorithms and Further reading.
Multiarmed bandit use cases
The multiarmed bandit is only useful with modifications that have two or more active variations.
Use the multiarmed bandit when:
 You want to ensure that, in the long run, the most successful variations are displayed most often. In other words, you want the variations to compete for display time.
 You want to both test how well different variations perform over time and put the test results to use while the test is still ongoing (rather than wait until testing is complete before adjusting the distribution percentages).
 You don't have the time or resources to run A/B testing on a modification, but you nonetheless want to optimize its variation distribution for maximum gain. For example, if you're planning a short campaign on a tight schedule, running an A/B test in preparation for the campaign may not be feasible, but you can use the multiarmed bandit to optimize onthefly during the campaign.
 You want a modification to dynamically react to visitors' changing interests or needs over time.
Do not use the multiarmed bandit when:
 You need static variation distribution. For example, if you have fixed ratios for sharing display time between different products or product categories, you need to use static distribution percentages.
 You want to use the most successful modification from a set of alternatives. The multiarmed bandit operates within a modification, on the variations of the same modification, not between modifications, so you cannot use it to compare modifications.
Here are some examples of modifications that can benefit from the multiarmed bandit:
 Banner with alternative onsale products (one product per variation)
 Banner with alternative visual styles, such as alternative colors or shapes (one visual style per variation)
 Product recommendation with one variation for the most purchased products and another variation for the most viewed products
Enabling the multiarmed bandit for a modification
To enable the multiarmed bandit for a modification:
 In the Control Panel, select Modifications > Overview.
 Find the modification, and click the modification name to open the modification settings.
 Select Content.
 If the modification has only one active variation, create or activate additional variations. For more information, see Working with modifications.
 Click Set variation distribution in any of the variations.
Select Multiarmed bandit. The variation distribution percentages are grayed out to indicate that they are now automatically calculated. If it was selected, Lock variations for visitors is automatically deselected.
To customize how the multiarmed bandit works, click Multiarmed bandit settings.
By default, the modification uses clicks for rewards and Thompson sampling as the multiarmed bandit algorithm.
 From the Multiarmed bandit reward options, select how you want to measure variation performance:
 Click: Number of times the variation is clicked
 Conversion (after click): Number of times clicking the variation leads to a conversion
 Conversion (after display): Number of times displaying the variation leads to a conversion
From the Multiarmed bandit algorithm options, select the algorithm you want to use for this modification. For more information about the algorithms, see Supported multiarmed bandit algorithms.
If you're unsure which algorithm to use, select Thompson sampling.Click Set. The modification summary information now shows Multiarmed bandit: On.
You can view the current variation distribution percentages next to the Set variation distribution buttons.
 To save the modification, click Save all.
You have enabled the multiarmed bandit for the modification. The Frosmo Platform now calculates variation distribution for the modification dynamically based on how well the variations perform over time.
When you first enable the multiarmed bandit for a modification, the variations all get equal distribution. There's no data available for the algorithm to work on, so there's no way to rank the variations and adjust their distribution. This is the case irrespective of the active algorithm. The multiarmed bandit starts adjusting the variation distribution once data starts to accumulate for all variations.
Viewing the current variation distribution for a modification
To view the current variation distribution percentages for a modification:
 In the Control Panel, select Modifications > Overview.
 Find the modification, and click the modification name to open the modification settings.
 Select Content.
Note the current percentages next to the Set variation distribution buttons. The distribution is recalculated every few hours.
Disabling the multiarmed bandit for a modification
To disable the multiarmed bandit for a modification:
 In the Control Panel, select Modifications > Overview.
 Find the modification, and click the modification name to open the modification settings.
 Select Content.
 Click Set variation distribution in any of the variations.
Deselect Multiarmed bandit. The fixed variation distribution percentages are restored to active status, meaning you can now edit them manually.
Disabling the multiarmed bandit leaves the distribution percentages to the latest values calculated by the bandit. If one or more variations are at 0% distribution, you must change their percentages to a positive value and adjust the other variations so that the combined distribution of all the variations totals 100%. If you want to enable variation lock for the modification, select Lock variations for visitors. For more information, see the Palmdale release notes.
 Click Set. The Multiarmed bandit: On status is removed from the modification summary information.
 To save the modification, click Save all.
You have disabled the multiarmed bandit for the modification. The Frosmo Platform now uses the static variation distribution percentages to randomize which variation is displayed to a visitor.
Supported multiarmed bandit algorithms
The Frosmo Platform supports the following multiarmed bandit algorithms:
By default, modifications use the Thompson sampling algorithm. For instructions on how to change the algorithm for a modification, see Enabling the multiarmed bandit for a modification.
Which algorithm to use?
The multiarmed bandit problem has been notoriously difficult to solve. The algorithms all represent approximate solutions to the problem, and there is no single algorithm that always works better than the others in all situations. Which algorithm to use therefore depends on your modification and what you expect from it. Moreover, whether a given algorithm is the optimal choice for your modification is ultimately determined by realworld usage.
The following table gives you a starting point for choosing which algorithms to use in which use cases.
Table: Use cases for multiarmed bandit algorithms
Use case  Optimal algorithms 

You expect the variations to perform consistently over time. The most successful variations will continue to perform the best, and the least successful variations will remain the least popular among visitors. 

You expect variation performance to fluctuate slowly over time. Variations will continuously lose and gain popularity among visitors. 

You want each variation to always be displayed to at least some portion of visitors, irrespective of how well the variation performs. In other words, the distribution percentage of a variation can never reach 0.  Epsilongreedy or Softmax 
Epsilongreedy
In the Frosmo Platform, the epsilongreedy algorithm assigns 90% distribution to the most successful variation (exploitation) and splits the remaining 10% equally between the other variations (exploration). If the most successful variation starts to underperform (its rewardstodisplays ratio drops), or if one the other variations in the 10% distribution starts to perform really well (its rewardstodisplays ratio increases), a new winner may emerge and get the 90% distribution.
Table: Epsilongreedy pros and cons
Pros  Cons 

Easy to understand  Fixed distribution percentage for variations other than the bestperforming one 
Softmax
In mathematical terms, the Softmax algorithm assigns a probability to each arm using a Boltzmann distribution and, based on the probabilities, randomly selects the arm to exploit. The probability assigned to each arm is proportional to its average reward: the more successful the arm, the higher its probability.
In Frosmo terms, the algorithm assigns a distribution percentage to each variation based on how well the variation has performed in the past, that is, how its rewardstodisplays ratio ranks against the other variations. With each recalculation, the most successful variation up to that point gets the highest distribution percentage, while the least successful variation gets the lowest distribution percentage. Each variation is therefore continuously exploited (displayed), but the more successful ones are exploited more.
Table: Softmax pros and cons
Pros  Cons 

Good shortterm performance: finds the bestperforming variation quickly Relatively easy to understand  Poor longterm performance: accumulates much more regret than more complex algorithms Regret measures the difference between (a) the total reward you would've collected had you used the best arm throughout and (b) the total reward you actually collected. The smaller the latter compared to the former, the greater the regret, and the less successful the algorithm. Not suitable for modifications whose variations have similar performance (as the algorithm ends up spending a lot of time exploring nonoptimal variations) 
Thompson sampling
In the Frosmo Platform, the Thompson sampling algorithm calculates a beta distribution for each variation, randomly selects a number from that distribution for the variation, and assigns 100% distribution to the variation with the highest number. This variation is therefore the only variation displayed until the distribution is recalculated.
The beta distribution for a variation reflects the probability of that variation being the most successful variation.
Table: Thompson sampling pros and cons
Pros  Cons 

Finds the bestperforming variation really quickly Once there's enough accumulated data, the algorithm exploits (as opposed to explores) almost all of the time  If the bestperforming variation starts to underperform and its rewardstodisplays ratio drops below that of another variation, it can take a long time for the algorithm to catch up and start displaying the new bestperforming variation 
UCB1 and UCB1tuned
The upper confidence bound (UCB) algorithms are based on the principle of optimism in the face of uncertainty. In short, a UCB algorithm makes the most plausibly optimistic guess about the expected average reward of each arm and then selects the arm that promises the highest reward. Either the optimism is justified, in which case the algorithm continues to exploit the selected arm, or the optimism is not justified, in which case the algorithm eventually switches to explore other arms. The "optimism" is embodied by the upper confidence bound the algorithm continuously calculates for each arm: the optimistic guess about how well the arm will perform.
In the Frosmo Platform, the UCB algorithms assign 100% distribution to the variation with the highest upper confidence bound. This variation is therefore the only variation displayed until the distribution is recalculated.
The Frosmo Platform supports two UCB algorithms: UCB1 and UCB1tuned. The algorithms differ only in that UCB1 calculates the upper confidence bound based on a variation's current rewardstodisplays ratio, whereas UCB1tuned also considers how the ratio has varied in the past. UCB1tuned can perform better in practice, but this cannot be theoretically guaranteed.
Table: UCB1 pros and cons
Pros  Cons 

Finds the bestperforming variation really fast Once there's enough accumulated data, the algorithm exploits (as opposed to explores) almost all of the time May react to changes in variation performance faster than Thompson sampling  May lose to Thompson sampling in overall performance 
Further reading
General
 Multiarmed bandit (Wikipedia)
 Multiarmed bandit experiments (Google Analytics Help Center)
Algorithms
Algorithms: Softmax
 Boltzmann distribution (Wikipedia)
 Softmax function (Wikipedia)
Algorithms: Thompson sampling
 Analysis of Thompson Sampling for the multiarmed bandit problem (arXiv)
 Beta distribution (Wikipedia)
 Probability matching (Wikipedia)
 Thompson sampling (Wikipedia)
 Thompson Sampling: An Asymptotically Optimal Finite Time Analysis (arXiv)
Algorithms: UCB
 Confidence Bounds (ReliaWiki)
 Confidence interval (Wikipedia)
 Optimism in the Face of Uncertainty: the UCB1 Algorithm (Math n Programming)
 The Upper Confidence Bound Algorithm (Bandit Algorithms)