Skip to main content

Multi-armed bandit optimization

This document describes the multi-armed bandit problem and its application in the Frosmo Platform.

What is the multi-armed bandit?

The multi-armed bandit is a modification case for automatically managing and optimizing variation distribution. Multi-armed bandit modifications continuously adjust their variation distribution based on how well the variations perform over time.

Performance is measured as either the click-through rate (CTR) or conversion rate (CR) of a variation. The higher the rate relative to the other variations of the modification, the more successful the variation, and the greater the likelihood that the variation is displayed to visitors: its distribution percentage increases relative to the percentages of the other variations. If the variation underperforms, its distribution percentage decreases, resulting in fewer visitors seeing the variation.

The multi-armed bandit automatically recalculates the variation distribution every two hours and every time the modification configuration changes. (The update interval is fixed, so you cannot change it.)

Multi-armed bandit problem

The multi-armed bandit problem is probability theory problem, which Wikipedia defines as:

a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain … The name comes from imagining a gambler at a row of slot machines (sometimes known as "one-armed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine.

In the problem, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines.

In the Frosmo Platform, the multi-armed bandit modification is the row of slot machines, the variations of the modification are the slot machines (or simply "arms"), and the rewards are either clicks or conversions generated by the variations. The Frosmo Platform is the agent that continuously tracks how well the variations perform (exploration) and, based on this data, dynamically adjusts the relative distribution of the variations to maximize the number of clicks or conversions for the modification (exploitation). To maintain exploration, the platform does not allow any variation, no matter how successful, to ever permanently dominate the distribution. However, if one variation is clearly more successful than all the others, the most successful variation is delivered practically all the time.

For more information about the multi-armed bandit problem, see Further reading.

Multi-armed bandit configuration

To configure a multi-armed bandit modification, you must first create two or more content variations for the modification and activate them. You can then select the reward (click, conversion after click, or conversion after display) to define how variation performance is measured. You also select the algorithm for optimizing the variation distribution.

The multi-armed bandit algorithm adjusts the variation distribution percentages for each variation over time. If the modification has an active comparison group variation, the multi-armed bandit has no effect on it, meaning a fixed percentage of all visitors continue to be excluded from seeing the modification, limiting the multi-armed bandit to the remaining portion of visitors (100% minus the comparison group size for the site).


When you activate a multi-armed bandit modification for the first time, each variation receives equal distribution percentage for a period of time (until the algorithm starts optimizing the distribution based on variation performance).

For more information about configuring a multi-armed bandit modification, see Defining the content for a modification.

Multi-armed bandit and test mode

If you use test mode to test a multi-armed bandit modification, the testing produces usage data (displays, clicks, and conversions) for the multi-armed bandit algorithm. In other words, the testing affects the variation distribution calculated by the multi-armed bandit. If you do not want testing to impact variation distribution, you can duplicate the modification for public production use, or use a workspace.

Multi-armed bandit algorithms

The Frosmo Platform supports a number of different multi-armed bandit algorithms. Each algorithm represents a specific solution to the multi-armed bandit problem, that is, a specific mathematical strategy for maximizing rewards over time.

By default, modifications use the Thompson sampling algorithm. You can change the algorithm separately for each modification from the modification settings. You can change algorithms at any time. The change takes effect immediately, and the variation distribution for the modification is recalculated with the newly selected algorithm.


A modification can only use a single algorithm at a time.

For more information about the algorithms, see Supported multi-armed bandit algorithms and Further reading.

Multi-armed bandit use cases

Multi-armed bandit modifications are only useful when they have two or more active variations.

Use multi-armed bandit modifications to:

  • Ensure that, in the long run, the most successful variations are displayed most often. In other words, you want the variations to compete for display time.

  • Test how well different variations perform over time, and put the test results to use while the test is still ongoing (rather than wait until testing is complete before adjusting the distribution percentages).

  • Optimize the variation distribution of a modification for maximum gain without spending time or resources to run A/B testing. For example, if you're planning a short campaign on a tight schedule, running an A/B test in preparation for the campaign may not be feasible, but you can use the multi-armed bandit to optimize on-the-fly during the campaign.

  • React dynamically to visitors' changing interests or needs over time without manually adjusting the variation distribution.

Do not use the multi-armed bandit modification to:

  • Create solutions that require static variation distribution. For example, if you have fixed ratios for sharing display time between different products or product categories, you need to use static distribution percentages.

  • Use the most successful modification from a set of alternatives. The multi-armed bandit operates within a modification, on the variations of the same modification, not between modifications, so you cannot use it to compare modifications.

Here are some examples of potential multi-armed bandit modifications:

  • Banner with alternative on-sale products (one product per variation)

  • Banner with alternative visual styles, such as alternative colors or shapes (one visual style per variation)

  • Product recommendation with one variation for the most purchased products and another variation for the most viewed products

Supported multi-armed bandit algorithms

The Frosmo Platform supports the following multi-armed bandit algorithms:

By default, modifications use the Thompson sampling algorithm. For instructions on how to change the algorithm for a modification, see Defining the content for a modification.

Which algorithm to use?

The multi-armed bandit problem has been notoriously difficult to solve. The algorithms all represent approximate solutions to the problem, and there is no single algorithm that always works better than the others in all situations. Which algorithm to use therefore depends on your modification and what you expect from it. Moreover, whether a given algorithm is the optimal choice for your modification is ultimately determined by real-world usage.

The following table gives you a starting point for choosing which algorithms to use in which use cases.

Use cases for multi-armed bandit algorithms
Use caseOptimal algorithms

You expect the variations to perform consistently over time. The most successful variations will continue to perform the best, and the least successful variations will remain the least popular among visitors.

  1. Thompson sampling

  2. UCB1 or UCB1-tuned

You expect variation performance to fluctuate slowly over time. Variations will continuously lose and gain popularity among visitors.

  1. UCB1 or UCB1-tuned

  2. Epsilon-greedy or Softmax

You want each variation to always be displayed to at least some portion of visitors, irrespective of how well the variation performs. In other words, the distribution percentage of a variation can never reach 0.

  1. Epsilon-greedy or Softmax


In the Frosmo Platform, the epsilon-greedy algorithm assigns 90% distribution to the most successful variation (exploitation) and splits the remaining 10% equally between the other variations (exploration). If the most successful variation starts to underperform (its rewards-to-displays ratio drops), or if one the other variations in the 10% distribution starts to perform really well (its rewards-to-displays ratio increases), a new winner may emerge and get the 90% distribution.


If the modification has an active comparison group variation, the algorithm works on the percentage of visitors outside the comparison group. For example, if the comparison group size is 10%, the algorithm assigns 81% distribution to the most successful variation and 9% distribution to the remaining variations since the comparison group reserves 10% of the total distribution.

Epsilon-greedy pros and cons

Easy to understand

Fixed distribution percentage for variations other than the best-performing one


In mathematical terms, the Softmax algorithm assigns a probability to each arm using a Boltzmann distribution and, based on the probabilities, randomly selects the arm to exploit. The probability assigned to each arm is proportional to its average reward: the more successful the arm, the higher its probability.

In Frosmo terms, the algorithm assigns a distribution percentage to each variation based on how well the variation has performed in the past, that is, how its rewards-to-displays ratio ranks against the other variations. With each recalculation, the most successful variation up to that point gets the highest distribution percentage, while the least successful variation gets the lowest distribution percentage. Each variation is therefore continuously exploited (displayed), but the more successful ones are exploited more.


If the modification has an active comparison group variation, the total distribution available to the algorithm is 100% minus the comparison group size for the site.

Softmax pros and cons

Good short-term performance: finds the best-performing variation quickly

Relatively easy to understand

Poor long-term performance: accumulates much more regret than more complex algorithms


Regret measures the difference between (a) the total reward you would've collected had you used the best arm throughout and (b) the total reward you actually collected. The smaller the latter compared to the former, the greater the regret, and the less successful the algorithm.

Not suitable for modifications whose variations have similar performance (as the algorithm ends up spending a lot of time exploring non-optimal variations)

Thompson sampling

In the Frosmo Platform, the Thompson sampling algorithm calculates a beta distribution for each variation, randomly selects a number from that distribution for the variation, and assigns 100% distribution to the variation with the highest number. This variation is therefore the only variation displayed until the distribution is recalculated.

The beta distribution for a variation reflects the probability of that variation being the most successful variation.


If the modification has an active comparison group variation, the selected variation gets a distribution equal to 100% minus the comparison group size for the site.

Thompson sampling pros and cons

Finds the best-performing variation really quickly

Once there's enough accumulated data, the algorithm exploits (as opposed to explores) almost all of the time

If the best-performing variation starts to underperform and its rewards-to-displays ratio drops below that of another variation, it can take a long time for the algorithm to catch up and start displaying the new best-performing variation

UCB1 and UCB1-tuned

The upper confidence bound (UCB) algorithms are based on the principle of optimism in the face of uncertainty. In short, a UCB algorithm makes the most plausibly optimistic guess about the expected average reward of each arm and then selects the arm that promises the highest reward. Either the optimism is justified, in which case the algorithm continues to exploit the selected arm, or the optimism is not justified, in which case the algorithm eventually switches to explore other arms. The "optimism" is embodied by the upper confidence bound the algorithm continuously calculates for each arm: the optimistic guess about how well the arm will perform.

In the Frosmo Platform, the UCB algorithms assign 100% distribution to the variation with the highest upper confidence bound. This variation is therefore the only variation displayed until the distribution is recalculated.


If the modification has an active comparison group variation, the selected variation gets a distribution equal to 100% minus the comparison group size for the site.

The Frosmo Platform supports two UCB algorithms: UCB1 and UCB1-tuned. The algorithms differ only in that UCB1-tuned selects high-variance variations more often in the early stage when the algorithm does not yet have much data about how well the variations are performing. UCB1-tuned can perform better in practice, but this cannot be theoretically guaranteed.

UCB1 pros and cons

Finds the best-performing variation really fast

Once there's enough accumulated data, the algorithm exploits (as opposed to explores) almost all of the time

May react to changes in variation performance faster than Thompson sampling

May lose to Thompson sampling in overall performance

Further reading



Algorithms: Softmax

Algorithms: Thompson sampling

Algorithms: UCB