Updated May 15, 2026
TL;DR: If your CMO wants proof that free-to-play (F2P) games drive revenue, engagement metrics will not save your budget. You need a rigorous holdout testing framework to measure incremental lift and multi-touch attribution connecting game interactions to First-Time Deposits (FTDs). Disconnected martech stacks create data lag that ruins test accuracy. Proving GGR contribution requires your CRM, CDP, and F2P mechanics on one data layer. Xtremepush connects them, reducing reconciliation work and improving attribution accuracy.
High engagement rates on your daily spin wheel might not translate to incremental revenue. Those players spinning the wheel might have deposited anyway. Without proper measurement, you cannot distinguish between engagement and actual revenue impact.
This framework gives you the holdout test designs, sample size calculations, and attribution models to connect F2P mechanics to actual GGR.
Engagement metrics like open rates and click-throughs tell you what players did, not whether your game caused them to deposit. You need metrics that capture causation, not correlation, and that trace the full player journey from first game interaction to GGR contribution.
Operators typically target LTV:CAC ratios that sustain profitable acquisition, with a common internal benchmark of 3:1 or higher, though actual targets vary by market and acquisition channel. If your F2P cohort delivers above that benchmark while your paid cohort sits below it, you have evidence that XP Gamify produces higher-quality players, not just more players. Stop reconciling exports across separate vendors. Xtremepush connects game events, campaign touches, and revenue outcomes on one data layer.
F2P games often convert players across multiple sessions. A player finds a spin wheel via a paid social ad, registers to claim their spin, receives a welcome email three days later, then deposits after a push notification during a live match. Last-touch attribution gives all the credit to that push notification, potentially undervaluing earlier touchpoints in the journey.
F2P mechanics can act as an assist channel: they may create early engagement events (registration), warm the player through gameplay, and reduce the friction of an eventual deposit. To measure this fairly, you need multi-touch attribution. It distributes credit across every touchpoint from the first game interaction through to the deposit, calculating each channel's fair share based on actual conversion contribution. See which touchpoints contributed to each FTD. Xtremepush's campaign attribution tools track every interaction across the player journey.
A weekend deposit match running alongside your prediction game launch creates an attribution mystery. Did the player deposit because of the prediction game, the deposit match, or both? Without structured isolation, you cannot defend either budget line.
A campaign-specific holdout group provides one solution. You exclude a segment from the F2P game entirely, giving you a baseline to measure the game's independent effect. Compare outcomes across groups: F2P only, promotion only, and neither. For multiple concurrent campaigns, assign separate holdout groups to each one rather than relying on a single universal control. A universal holdout exposed to other concurrent promotions cannot isolate individual campaign effects reliably. You do not need a separate analytics tool to manage group assignments. Xtremepush's split testing link supports variant testing structures.
VIP player cohorts are typically small, which creates a real statistical problem. Traditional A/B testing at 95% confidence requires thousands of data points to detect small effect sizes reliably. Three adjustments help you work within those constraints.
First, accept a larger minimum detectable effect (MDE), for example a 5% lift in GGR rather than 1%, because small cohorts can only reliably signal large effects. Second, consider lowering your confidence threshold from 95% to 90% for VIP tests and document that adjustment explicitly in your report. Third, consider a Bayesian approach: Bayesian models can give reliable results with small samples by incorporating prior behaviour data into the analysis, and they let you update conclusions incrementally as new data arrives rather than waiting for a fixed end date.
A holdout group is a segment of your audience completely excluded from a marketing treatment, in this case your F2P game. The difference in outcomes between the exposed group and the holdout group is your incremental lift: the revenue or conversion caused by the game, not by baseline player behaviour.
F2P holdout test setup checklist:
You have two structural options. A campaign-specific holdout excludes a portion of your segment from a single game mechanic, letting you measure that mechanic's impact while controlling for other marketing activity. A universal holdout excludes a fixed segment from all marketing interventions for a defined period, giving you a clean baseline to measure the cumulative effect of your entire F2P programme.
Use the campaign-specific holdout to measure the impact of a specific spin wheel or scratch card launch. Use the universal holdout to prove the aggregate value of your F2P programme to the CFO. The in-game click tracking documentation covers event capture implementation across both group types.
Players in your holdout group should not experience a degraded product. If your spin wheel appears in the main navigation for treatment players but is conspicuously absent for holdout players, you introduce bias through an inconsistent experience. Where possible, design your holdout so absent mechanics are not obviously missing.
On the responsible gaming dimension, both groups must retain full access to spending limits, reality check reminders, and time-out features regardless of F2P exposure.
The incremental lift formula is:
Lift = (Treatment Conversion Rate - Control Conversion Rate) / Control Conversion Rate
Say your control group converts to FTD at 2% and your treatment group converts at 3.5%. Your lift is (3.5% - 2%) / 2% = 75%. In this hypothetical, the F2P mechanic made players 75% more likely to deposit than they would have been without it.
Apply the same formula to GGR per player across both groups to calculate revenue-level lift, which is the number your CFO actually wants. Your lift calculation draws from one source of truth instead of reconciling outputs from a game vendor and a separate analytics platform. Xtremepush records all game events on the same data layer as your campaign and revenue events.
The required sample size depends on four inputs: your baseline conversion rate, your minimum detectable effect (MDE), your desired confidence level (typically 95%), and your statistical power (typically 80%).
At a 2% baseline conversion rate, you will typically need thousands of players per group depending on your MDE and confidence level. Use a sample size calculator to confirm the right number for your specific test parameters. The smaller the MDE you want to detect, the larger the sample you need. Choosing an MDE that is too small relative to your available segment size is one of the most common reasons F2P tests are inconclusive.
A p-value of 0.05 is the widely accepted significance threshold in statistical testing. With an alpha level of 0.05, there is a 5% chance of making a Type I error (concluding the game is effective when it actually has no effect). This threshold is the standard starting point when presenting results to a CFO who will make budget decisions based on your findings.
Define your MDE in revenue terms before the test begins. A 0.5 percentage point improvement in FTD conversion sounds small, but at scale, small percentage improvements can represent meaningful revenue gains from a single mechanic launch. Anchoring your threshold to a business outcome prevents post-test rationalisation.
When your VIP segment contains fewer than 1,000 players, adjust these parameters:
The choice of attribution model determines whether F2P games look like heroes or irrelevant noise in your CMO report:
| Model | How it works | Pros | Cons | Best F2P use case |
|---|---|---|---|---|
| First-touch | 100% credit to first interaction | Shows top-funnel value clearly | Ignores all nurture touches | Games used primarily for acquisition with minimal nurture |
| Last-touch | 100% credit to final pre-FTD interaction | Easy to explain | Undervalues earlier touchpoints | Single-channel journeys where only one touchpoint exists before FTD |
| Linear | Equal credit across all touches | Balanced, avoids single-point bias | Treats all touches as equally important | Long, complex journeys with sustained engagement |
| Shapley value | Credit based on each touchpoint's actual contribution across all journey combinations | Mathematically fair, accounts for interaction effects | Requires more data and analytical resource | Complex multi-channel journeys where touchpoint order matters |
Last-touch attribution is the most commonly used model but often unsuitable for F2P mechanics. It gives the final push notification all the credit, leaving your spin wheel looking like a cost with no return.
A typical F2P acquisition journey looks like this:
Paid social ad click → Spin wheel play (registration event) → Welcome email → Push notification during live event → FTD
Each interaction leaves a data event. Xtremepush captures all of them on the same data layer, from ad campaign attribution through to the FTD event triggered via the PAM backend. When you run a Shapley model, every touchpoint draws from one source with no reconciliation lag.
When F2P mechanics run through a third-party game vendor integrated via API, data sync delays can create gaps in this journey. Any player who plays the spin wheel and deposits within the same session may have their journey attributed incorrectly if the game event arrives in your CDP after the deposit event.
Xtremepush's campaign attribution dashboard connects game interaction events to campaign touches and revenue outcomes in one view. Rather than exporting from a game vendor, reconciling with email platform data, and then matching to your analytics tool, you see the full journey attribution report in one place.
See how Xtremepush's unified attribution dashboard connects F2P game interactions to FTDs and GGR. Book a demo to walk through your player journey data.
Major sporting events like the Cheltenham Festival, Super Bowl, and World Cup can create natural spikes in player acquisition and deposit behaviour. If your F2P game launches during one of these events and FTDs rise 40%, you need to control for the seasonal effect to isolate the game's contribution.
Your universal holdout group can help address this. It experiences the same seasonal environment as your treatment group but without the game, so the FTD rate difference between the two groups represents the game's incremental contribution with seasonality factored out. If you do not have a holdout in place before the event begins, isolating the game's impact from the seasonal effect becomes significantly more difficult.
For acquisition-focused mechanics like prediction games and spin wheels, FTD rate is a key conversion metric. Consider measuring two conversion events: anonymous-to-known (game play triggers registration) and registered-to-FTD. Both events should be captured at the player level with timestamps so you can attribute them to the game session that initiated the journey.
Track whether your game participants show higher Day-7 and Day-30 retention than bonus-driven paid cohorts. Higher retention can indicate the player's initial motivation was genuine interest in the game rather than a deposit incentive, which may produce more durable long-term engagement.
Your holdout group captures baseline organic conversion, the conversion rate you would see without any F2P intervention. Any conversion above that baseline in your treatment group is incremental lift attributable to the game. This is the critical distinction between proving causation and reporting correlation.
A player who deposits during your spin wheel promotion might have deposited anyway. Your holdout group tells you exactly how many players would have deposited without the game, and only the excess is yours to claim as F2P impact.
Put these metrics on your CMO's desk, in this order of priority:
Xtremepush's unified data layer aggregates game events, campaign interactions, and PAM backend revenue data into one reporting view. This reduces the reconciliation work across separate systems and improves attribution accuracy, though some data validation between sources remains necessary. Funstage (Greentube-Novomatic) increased customer LTV by 199.4% after consolidating their CRM and engagement onto Xtremepush.
Calculate your LTV:CAC ratio for F2P-acquired players using this structure:
acquisition, though your threshold will depend on market, channel, and margin expectations.
If your F2P CAC is £40 per FTD and the 90-day LTV of those players is £160, you achieve a 4:1 ratio. The gamification trends in LatAm session covers how operators in emerging markets are using this framing to justify F2P investment internally.
Structure your retention comparison as a table in your CMO presentation:
| Cohort | Day-1 retention | Day-7 retention | Day-30 retention |
|---|---|---|---|
| Treatment group (F2P exposed) | % | % | % |
| Holdout group (no F2P exposure) | % | % | % |
| Incremental lift | pp difference | pp difference | pp difference |
Populate this table from your holdout test data. Meaningful improvements in Day-30 retention for the F2P cohort can translate to reduced reactivation spend and improved LTV:CAC, closing the loop between game investment and business outcome.
Batch processing can create timing issues. If your game events sync to your CDP in nightly batch updates, any player who plays the spin wheel and deposits within the same session may have their journey logged incorrectly. The game event may arrive in your CDP hours after the deposit event, making the deposit appear to have no prior game interaction and potentially disrupting your attribution model.
Real-time event processing can eliminate this contamination. The player journey is captured correctly regardless of how fast a player converts. Xtremepush processes game events and deposit events on one data layer in milliseconds. Kwiff doubled user numbers and increased retention while reducing manual campaign work from 100% to 50% of daily tasks after automating journey streams with Xtremepush. Real-time processing requires more complex infrastructure and operational overhead than batch systems, but it eliminates the data timing issues that corrupt attribution.
The most common F2P testing mistake is calling a winner too early because initial results look positive. Early variance in small samples can produce extreme-looking results before regressing to the true mean. Early results in small samples do not hold up at scale.
Set your test end date before the test begins and do not review results until you reach it. A Type I error, also known as a false positive, occurs when your test concludes a mechanic works when it actually does not. The risk increases when you review interim results early and act on them without adjusting your significance threshold accordingly. If operational pressure forces an early look, apply a more conservative significance threshold to account for the additional analysis.
Simpson's paradox occurs when a trend in aggregated data reverses or disappears when you segment by a confounding variable. In F2P testing, this happens when you mix VIP and casual players in one test cohort without segmentation. A game might show better aggregate conversion simply because more high-converting VIPs were randomly assigned to it, not because the mechanic is superior.
To prevent this, stratify your randomisation. Assign players to treatment and control groups separately within each player value tier (VIP, mid-value, casual), then combine results with appropriate statistical weighting. This ensures each tier is balanced across both groups and eliminates the confounding that creates misleading aggregate conclusions. The US sports market panel highlights how player segment differences across markets make this stratification particularly important for operators in multiple jurisdictions.
Calculate your total cost of ownership savings from replacing a standalone game vendor with XP Gamify. Book a demo to walk through the attribution numbers with our team.
Run for at least two full conversion cycles. For most iGaming operators, if your typical FTD window is seven days post-registration, a 14-day test captures two full cycles and accounts for day-of-week variance. Longer tests provide more reliable results.
There is no hard minimum, but when working with very small VIP cohorts consider using Bayesian methods rather than traditional hypothesis testing, accept a larger MDE, and consider lowering your confidence threshold to 90%. Document every adjustment explicitly in your CMO report to maintain credibility.
Incremental lift requires a concurrent holdout group, not a before-and-after comparison. Use the formula (Treatment Conversion Rate - Control Conversion Rate) / Control Conversion Rate, where the control group was never exposed to the game, to isolate the game's causal impact from organic behaviour and concurrent promotions.
Yes, but each parallel test increases the risk of interaction effects where a player in the spin wheel test also receives the scratch card, contaminating both results. Where possible, assign players to only one active test at a time, or accept that parallel tests measure the combined effect of both mechanics and plan your analysis accordingly.
Incremental lift: The additional conversion or revenue generated by a marketing treatment above what the control group produced organically, calculated as (Treatment Rate - Control Rate) / Control Rate and expressed as a percentage.
Statistical power: The probability that a test will detect a real effect when one exists, typically set at 80% in iGaming testing. Low power means you will miss real improvements in your F2P mechanics and incorrectly conclude they do not work.
Multi-touch attribution: A model that distributes conversion credit across every touchpoint in a player's journey rather than assigning it all to one interaction, particularly valuable for proving F2P game value when players touch multiple channels before depositing.
Holdout group: A segment of players excluded from a marketing treatment entirely, used as a baseline to measure the incremental effect of the treatment on the exposed group and the foundational mechanism for proving causation rather than correlation.
Type I error: A false positive in statistical testing, where your test concludes that an F2P mechanic works when it actually has no real effect. The risk is controlled by your significance threshold (p-value) and increases when tests are stopped early or interim results are reviewed without adjusting for multiple analyses.