Pages

This document describes the multi-armed bandit feature for modifications and shows you how to use it in the Frosmo Control Panel.

What is the multi-armed bandit?

The multi-armed bandit is a feature for automatically managing and optimizing variation distribution for modifications. When enabled for a given modification, the multi-armed bandit continuously adjusts variation distribution for that modification based on how well the variations perform over time.

Performance is measured as either the click-through rate (CTR) or conversion rate (CR) of a variation. The higher the rate relative to the other variations of the modification, the more successful the variation, and the greater the likelihood that the variation is displayed to visitors: its distribution percentage increases relative to the percentages of the other variations. If the variation underperforms, its distribution percentage decreases, resulting in fewer visitors seeing the variation.

No single variation, no matter how successful, will ever permanently dominate the distribution. Even the least successful variation will always be displayed to some fraction of visitors within a given time span.

The multi-armed bandit automatically recalculates the variation distribution every few hours and every time the modification configuration changes. (The update interval is fixed, so you cannot change it.)

Multi-armed bandit problem

The multi-armed bandit problem is probability theory problem, which Wikipedia defines as:

a problem in which a gambler at a row of slot machines (sometimes known as "one-armed bandits") has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.

In more general terms:

The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called "exploration") and optimize his or her decisions based on existing knowledge (called "exploitation"). The agent attempts to balance these competing tasks in order to maximize his total value over the period of time considered.

In the Frosmo Platform, the modification is the row of slot machines, the variations of the modification are the slot machines (or simply "arms"), and the rewards are either clicks or conversions generated by the variations. The Frosmo Platform is the agent that continuously tracks how well the variations perform (exploration) and, based on this data, dynamically adjusts the relative distribution of the variations to maximize the number of clicks or conversions for the modification (exploitation). To maintain exploration, the platform does not allow any variation, no matter how successful, to ever permanently dominate the distribution.

For more information about the multi-armed bandit problem, see Further reading.

Multi-armed bandit configuration

You enable and configure the multi-armed bandit separately for each modification.

If enabled, the multi-armed bandit overrides the static variation distribution percentages for the modification. If the modification includes a comparison group, the multi-armed bandit has no effect on it, meaning 10% of visitors continue to be excluded from the variations, limiting the multi-armed bandit to the remaining 90% of visitors.

You can use the multi-armed bandit with any modification type.

Multi-armed bandit and test mode

If you use test mode to test a multi-armed bandit modification, the testing produces usage data (displays, clicks, and conversions) for the multi-armed bandit algorithm. In other words, the testing affects the variation distribution calculated by the multi-armed bandit. If you do not want testing to impact variation distribution, duplicate the modification for public production use.

For more information about test mode, see Testing modifications with test mode.

Multi-armed bandit and variation lock

Enabling the multi-armed bandit for a modification automatically disables variation lock for that modification. With the multi-armed bandit enabled, a visitor is not locked to the variation they see on their first page load. Instead, they are randomly assigned a new variation on each page load, with the most successful variations at that time having the highest probabilities of getting displayed.

For more information about variation lock, see the Palmdale release notes.

Multi-armed bandit algorithms

The Frosmo Platform supports a number of different multi-armed bandit algorithms. Each algorithm represents a specific solution to the multi-armed bandit problem, that is, a specific mathematical strategy for maximizing rewards over time.

By default, modifications use the Thompson sampling algorithm. You can change the algorithm separately for each modification from the modification settings. You can change algorithms at any time. The change takes effect immediately, and the variation distribution for the modification is recalculated with the newly selected algorithm.

A modification can only use a single algorithm at a time.

For more information about the algorithms, see Supported multi-armed bandit algorithms and Further reading.

Multi-armed bandit use cases

The multi-armed bandit is only useful with modifications that have two or more active variations.

Use the multi-armed bandit when:

  • You want to ensure that, in the long run, the most successful variations are displayed most often. In other words, you want the variations to compete for display time.
  • You want to both test how well different variations perform over time and put the test results to use while the test is still ongoing (rather than wait until testing is complete before adjusting the distribution percentages).
  • You don't have the time or resources to run A/B testing on a modification, but you nonetheless want to optimize its variation distribution for maximum gain. For example, if you're planning a short campaign on a tight schedule, running an A/B test in preparation for the campaign may not be feasible, but you can use the multi-armed bandit to optimize on-the-fly during the campaign.
  • You want a modification to dynamically react to visitors' changing interests or needs over time.

Do not use the multi-armed bandit when:

  • You need static variation distribution. For example, if you have fixed ratios for sharing display time between different products or product categories, you need to use static distribution percentages.
  • You want to use the most successful modification from a set of alternatives. The multi-armed bandit operates within a modification, on the variations of the same modification, not between modifications, so you cannot use it to compare modifications.

Here are some examples of modifications that can benefit from the multi-armed bandit:

  • Banner with alternative on-sale products (one product per variation)
  • Banner with alternative visual styles, such as alternative colors or shapes (one visual style per variation)
  • Product recommendation with one variation for the most purchased products and another variation for the most viewed products

Enabling the multi-armed bandit for a modification

If you want to use the multi-armed bandit with multiple modifications, you must enable it separately for each modification.

To enable the multi-armed bandit for a modification:

  1. In the Control Panel, select Modifications > Overview.
  2. Find the modification, and click the modification name to open the modification settings.
  3. Select Content.
  4. If the modification has only one active variation, create or activate additional variations. For more information, see Working with modifications.
  5. Click Set variation distribution in any of the variations.
  6. Select Multi-armed bandit. The variation distribution percentages are grayed out to indicate that they are now automatically calculated. If it was selected, Lock variations for visitors is automatically deselected.

    Enabling the multi-armed bandit

  7. To customize how the multi-armed bandit works, click Multi-armed bandit settings.

    By default, the modification uses clicks for rewards and Thompson sampling as the multi-armed bandit algorithm.

  8. From the Multi-armed bandit reward options, select how you want to measure variation performance:
    • Click: Number of times the variation is clicked
    • Conversion (after click): Number of times clicking the variation leads to a conversion
    • Conversion (after display): Number of times displaying the variation leads to a conversion
  9. From the Multi-armed bandit algorithm options, select the algorithm you want to use for this modification. For more information about the algorithms, see Supported multi-armed bandit algorithms.

    If you're unsure which algorithm to use, select Thompson sampling.

    Configuring the multi-armed bandit

  10. Click Set. The modification summary information now shows Multi-armed bandit: On.

    Multi-armed bandit enabled

    You can view the current variation distribution percentages next to the Set variation distribution buttons.

    Initial variation distribution percentages

  11. To save the modification, click Save all.

You have enabled the multi-armed bandit for the modification. The Frosmo Platform now calculates variation distribution for the modification dynamically based on how well the variations perform over time.

When you first enable the multi-armed bandit for a modification, the variations all get equal distribution. There's no data available for the algorithm to work on, so there's no way to rank the variations and adjust their distribution. This is the case irrespective of the active algorithm. The multi-armed bandit starts adjusting the variation distribution once data starts to accumulate for all variations.

Viewing the current variation distribution for a modification

To view the current variation distribution percentages for a modification:

  1. In the Control Panel, select Modifications > Overview.
  2. Find the modification, and click the modification name to open the modification settings.
  3. Select Content.
  4. Note the current percentages next to the Set variation distribution buttons. The distribution is recalculated every few hours.

    Current variation distribution percentages

Disabling the multi-armed bandit for a modification

To disable the multi-armed bandit for a modification:

  1. In the Control Panel, select Modifications > Overview.
  2. Find the modification, and click the modification name to open the modification settings.
  3. Select Content.
  4. Click Set variation distribution in any of the variations.
  5. Deselect Multi-armed bandit. The fixed variation distribution percentages are restored to active status, meaning you can now edit them manually.

    Disabling the multi-armed bandit leaves the distribution percentages to the latest values calculated by the bandit. If one or more variations are at 0% distribution, you must change their percentages to a positive value and adjust the other variations so that the combined distribution of all the variations totals 100%.

    Disabling the multi-armed bandit

  6. If you want to enable variation lock for the modification, select Lock variations for visitors. For more information, see the Palmdale release notes.
  7. Click Set. The Multi-armed bandit: On status is removed from the modification summary information.
  8. To save the modification, click Save all.

You have disabled the multi-armed bandit for the modification. The Frosmo Platform now uses the static variation distribution percentages to randomize which variation is displayed to a visitor.

Supported multi-armed bandit algorithms

The Frosmo Platform supports the following multi-armed bandit algorithms:

By default, modifications use the Thompson sampling algorithm. For instructions on how to change the algorithm for a modification, see Enabling the multi-armed bandit for a modification.

Which algorithm to use?

The multi-armed bandit problem has been notoriously difficult to solve. The algorithms all represent approximate solutions to the problem, and there is no single algorithm that always works better than the others in all situations. Which algorithm to use therefore depends on your modification and what you expect from it. Moreover, whether a given algorithm is the optimal choice for your modification is ultimately determined by real-world usage.

The following table gives you a starting point for choosing which algorithms to use in which use cases.

Table: Use cases for multi-armed bandit algorithms

Use caseOptimal algorithms
You expect the variations to perform consistently over time. The most successful variations will continue to perform the best, and the least successful variations will remain the least popular among visitors.
  1. Thompson sampling
  2. UCB1 or UCB1-tuned
You expect variation performance to fluctuate slowly over time. Variations will continuously lose and gain popularity among visitors.
  1. UCB1 or UCB1-tuned
  2. Epsilon-greedy or Softmax
You want each variation to always be displayed to at least some portion of visitors, irrespective of how well the variation performs. In other words, the distribution percentage of a variation can never reach 0.Epsilon-greedy or Softmax

Epsilon-greedy

In the Frosmo Platform, the epsilon-greedy algorithm assigns 90% distribution to the most successful variation (exploitation) and splits the remaining 10% equally between the other variations (exploration). If the most successful variation starts to underperform (its rewards-to-displays ratio drops), or if one the other variations in the 10% distribution starts to perform really well (its rewards-to-displays ratio increases), a new winner may emerge and get the 90% distribution.

If the modification includes a comparison group, the algorithm assigns 81% distribution to the most successful variation and 9% distribution to the remaining variations since the comparison group reserves 10% of the total distribution.

Table: Epsilon-greedy pros and cons

ProsCons
Easy to understandFixed distribution percentage for variations other than the best-performing one

Softmax

In mathematical terms, the Softmax algorithm assigns a probability to each arm using a Boltzmann distribution and, based on the probabilities, randomly selects the arm to exploit. The probability assigned to each arm is proportional to its average reward: the more successful the arm, the higher its probability.

In Frosmo terms, the algorithm assigns a distribution percentage to each variation based on how well the variation has performed in the past, that is, how its rewards-to-displays ratio ranks against the other variations. With each recalculation, the most successful variation up to that point gets the highest distribution percentage, while the least successful variation gets the lowest distribution percentage. Each variation is therefore continuously exploited (displayed), but the more successful ones are exploited more.

If the modification includes a comparison group, the total distribution available to the algorithm is 90%.

Table: Softmax pros and cons

ProsCons

Good short-term performance: finds the best-performing variation quickly

Relatively easy to understand

Poor long-term performance: accumulates much more regret than more complex algorithms

Regret measures the difference between (a) the total reward you would've collected had you used the best arm throughout and (b) the total reward you actually collected. The smaller the latter compared to the former, the greater the regret, and the less successful the algorithm.

Not suitable for modifications whose variations have similar performance (as the algorithm ends up spending a lot of time exploring non-optimal variations)

Thompson sampling

In the Frosmo Platform, the Thompson sampling algorithm calculates a beta distribution for each variation, randomly selects a number from that distribution for the variation, and assigns 100% distribution to the variation with the highest number. This variation is therefore the only variation displayed until the distribution is recalculated.

The beta distribution for a variation reflects the probability of that variation being the most successful variation.

If the modification includes a comparison group, the selected variation gets 90% distribution.

Table: Thompson sampling pros and cons

ProsCons

Finds the best-performing variation really quickly

Once there's enough accumulated data, the algorithm exploits (as opposed to explores) almost all of the time

If the best-performing variation starts to underperform and its rewards-to-displays ratio drops below that of another variation, it can take a long time for the algorithm to catch up and start displaying the new best-performing variation

UCB1 and UCB1-tuned

The upper confidence bound (UCB) algorithms are based on the principle of optimism in the face of uncertainty. In short, a UCB algorithm makes the most plausibly optimistic guess about the expected average reward of each arm and then selects the arm that promises the highest reward. Either the optimism is justified, in which case the algorithm continues to exploit the selected arm, or the optimism is not justified, in which case the algorithm eventually switches to explore other arms. The "optimism" is embodied by the upper confidence bound the algorithm continuously calculates for each arm: the optimistic guess about how well the arm will perform.

In the Frosmo Platform, the UCB algorithms assign 100% distribution to the variation with the highest upper confidence bound. This variation is therefore the only variation displayed until the distribution is recalculated.

If the modification includes a comparison group, the selected variation gets 90% distribution.

The Frosmo Platform supports two UCB algorithms: UCB1 and UCB1-tuned. The algorithms differ only in that UCB1 calculates the upper confidence bound based on a variation's current rewards-to-displays ratio, whereas UCB1-tuned also considers how the ratio has varied in the past. UCB1-tuned can perform better in practice, but this cannot be theoretically guaranteed.

Table: UCB1 pros and cons

ProsCons

Finds the best-performing variation really fast

Once there's enough accumulated data, the algorithm exploits (as opposed to explores) almost all of the time

May react to changes in variation performance faster than Thompson sampling

May lose to Thompson sampling in overall performance

Further reading

General

Algorithms

Algorithms: Softmax

Algorithms: Thompson sampling

Algorithms: UCB

  • No labels