Product descriptions:
Powering real-time insights from game data
SEGA, a global video game publisher, collects about 40,000 events per second - from over 450 event types – on Databricks across its 40M+ users. This massive set of data streams helps SEGA better understand player behavior, optimize gameplay experiences, and quickly identify issues that may arise during live operations. To efficiently process and transform this data, SEGA uses Lakeflow Spark Declarative Pipelines, ensuring that insights are delivered rapidly and at scale to support both current and future gaming innovations.
Managing complex data pipelines at scale
SEGA has been working with Databricks since 2019, using the Data Intelligence Platform to collect data from over 140 different games and 40 million players. Today, the company has petabytes of data stored in its data lakehouse. This data supports hundreds of business use cases, from tracking daily active users and retention rates to building machine learning models for game performance forecasting and recommendation engines. For example, the company uses game data to track key business metrics and balance game difficulty for better player experience.
Development teams measure player engagement with new features to determine ROI and identify improvements. Game telemetry can also be enriched with partner data, market intelligence, and social media feedback to support analysis across the business. Meanwhile, the company’s CRM system manages millions of customer profiles with over 70 attributes each, enabling personalized communication based on game ownership, genre preferences and purchase history.
The challenge was that SEGA's existing data pipeline for game data couldn't keep pace with demand. Built in 2019 using Scala and Spark Structured Streaming, the pipeline had become difficult to maintain and modify. "Our team had limited familiarity with the data pipeline that was implemented several years ago," explains Felix Baker, Head of Data Services at SEGA Europe. "When modifications were required or issues arose, investigation could be time-consuming due to the complexity of the codebase."
The game data pipeline had a number of technical limitations. First, it lacked automation. When game studios added new events or changed event schemas, the team had to manually turn off streams, align schemas between bronze and silver layers, and coordinate updates - a slow and time-consuming process. Second, monitoring was limited to basic pipeline health checks without visibility into game-specific usage patterns. Finally, the infrastructure ran at its maximum capacity all the time to handle potential spikes during weekends or holidays, creating unnecessary costs. SEGA needed a more automated, cost-efficient solution for streaming this data that could deliver real-time insights for faster business decisions.
Automating real-time game data with Lakeflow Spark Declarative Pipelines
SEGA's data pipeline transformation began after a conversation with the Databricks account team. The account team believed that SEGA's streaming requirements were a perfect fit for Lakeflow Spark Declarative Pipelines (SDP) and thought it could help SEGA simplify operations while reducing costs. SEGA and Databricks worked together to build a proof of concept that showed how Lakeflow SDP would yield significant cost savings, as well as how its SQL interface would make development easier for the team.
Working with implementation partner Advancing Analytics, SEGA built on the initial proof of concept to create a production streaming pipeline. The system ingests data from Amazon Kinesis into a bronze layer, filtering out events from older games that don't need monitoring and encrypting data on the fly. With 40,000 events per second flowing through the system, the team implemented stream grouping to balance compute loads efficiently.
The pipeline merges legacy data from multiple sources—the existing platform, game studios and historical data with schema changes from game updates. In the bronze layer, Lakeflow SDP converts JSON into individual event tables in the silver layer while checking data quality. Poor quality data goes to a quarantine zone instead of breaking the entire pipeline. "One thing we struggled with is if we have bad data, how do we handle it gracefully? Fortunately, Lakeflow SDP just does this for you," explains Craig Porteous, Associate Head of Data Engineering at Advancing Analytics. "We're no longer looking at this in a binary aspect of either failing the pipeline or allowing everything through. We've got much more control."
The stream grouping approach handles changes automatically. Large streams go in one group, smaller events in another. When new games launch with high event volumes, they're automatically categorized and balanced. As games age and player activity drops, events shift to smaller stream groups without requiring manual work. This continuous optimization means SEGA’s can better manage compute costs.
In addition to helping SEGA eliminate several manual processes, Databricks Lakeflow also enabled several other advantages. Data lineage through Unity Catalog enables the team to easily see where data comes from and where it goes. Automation handles scheduling, dependency resolution, retries, and scaling without custom code. SEGA defines the transformations, and Lakeflow SDP automatically determines the execution order and keeps tables updated.
Real-time insights transform game development and player engagement
SEGA now runs a complete Lakeflow SDP pipeline for all data streams from its game titles, fundamentally changing how the company manages game data. The system can be orchestrated and developed using SQL, making it far easier for the team to develop and troubleshoot compared to the original Scala-based pipeline.
The automation has eliminated a lot of the team’s ongoing maintenance overhead. “We literally don’t really think about the pipelines anymore. If the studio adds a new event or series of events, Lakeflow SDP just handles it.”
The pipeline scales up during high-traffic periods in the evenings and scales down overnight when fewer players are active, delivering cost reductions of approximately 4x. “It’s been a real game changer,” says Baker.
Perhaps most critically, data latency has dropped to under five minutes from the point it leaves the game to being actionable, enabling highly responsive decision-making. These real-time data capabilities have transformed how SEGA understands player behavior and develops games. "We can track player activity across all platforms and measure returning players from previous game iterations, enabling comparison of retention rates across different titles," explains Baker. “Developers get the granular data they need for game balancing to ensure players have an optimal gaming experience, while marketing teams leverage player cohort analysis to gain deeper customer insights and tailor content or offers based on gameplay behavior or purchase patterns.”
Ultimately, faster access to quality data enables quicker business decisions, allowing SEGA to react rapidly to changes in player sentiment or game balancing issues. This responsiveness leads to better gaming experiences. “We know our players better now because we understand how they play and what games they like,” says Baker. “That leads to more personalized communication and a better relationship between the studio and the players."
