Lower infrastructure costs
Events per day processed for analytics
Faster data pipelines
Enabling machines to “see and understand” requires processing a tremendous amount of data in near real-time. GumGum’s advertising business teams, for example, need to extract real-time insights to make crucial business decisions for successful ad delivery and performance.
The ongoing hunt for data insights that can improve a brand’s audience-buying strategy along with a rapidly changing programmatic advertising landscape makes it hard to provide accurate real-time bidstream analytics. With a massive scale of over 35 billion bidstream data points per day, GumGum’s business teams are constantly looking for real-time bidstream insights to increase campaign performance.
Of course, processing a dataset of that size every day across more than 100 data pipelines running 24×7 isn’t an easy proposition — especially when you don’t have the right technology under the hood. Prior to Databricks, GumGum struggled with managing compute-intensive workloads with AWS EMR and Apache Storm, and the resulting inability to auto-scale, spot instance allocation, perform quick data analytics, and collaborate across teams.
“We’re in the digital advertising business which means our traffic trends vary vastly from Q1 to Q4, so we needed to be able to scale up and down effortlessly,” said Rashmina Menon, Senior Data Engineer at GumGum. “And apart from the different traffic trends, we’re growing as a company, the data is exploding, and we need tools that can grow with us.”
For GumGum’s data engineering team, that type of growth meant they needed a tool that would make it easy to access data and build ETL pipelines. For the data science team, they needed to scale data exploration and model training. And for data analysts, they required access to timeline business insights.
They needed a single platform that would foster collaboration across the board for more effective productivity and faster time-to-value.
Since implementing Databricks, GumGum has been able to quickly transition distributed data processing workloads from Apache Storm (and AWS EMR) to Spark Streaming with fully-managed compute clusters that auto-scale as needed. Delta Lake has enabled them to build out cost-effective, reliable, and fast query performance, and spot instances are now used on all workloads to improve operational efficiency and reduce costs.
“Databricks with Delta Lake has not only enabled us to achieve faster query performance, but we are also able to make the entire project more cost-effective,” added Jatinder Assi, a Data Engineering Manager at GumGum. “With Delta Caching, our data is entirely stored on the disk which frees the memory for map-reduce operations.”
The data then flows downstream to the data scientists and analysts. The data scientists use MLflow to streamline model management including packaging code for reusability and creating a multi-step model evaluation pipeline for improved deployments. And the analyst team leverages the integration with Looker for business intelligence to make smarter decisions.
As for collaboration, Databricks interactive notebooks have brought the team together in a big way. Now, they can share data analytics work and results across multiple data functions, and support for multiple languages (Scala, Python, SQL, R) has enabled diverse workloads and users across data engineering, analysts, and data scientists to come on board as well.
Databricks unified data analytics platform with Delta Lake has allowed the GumGum team to accelerate their data processing and reporting capabilities by 5x while at the same time reducing the infrastructure cost by 2x. “Today our ad inventory forecasting application now boasts response times of fewer than 30 secs,” said Menon. “And we are delivering this performance in a cost optimized manner.”
With Databricks ensuring QA, release and deployment cycles are not only faster but more cost efficient compared to their previous pipelines with EMR, The GumGum data team is confident in their ability to drive the business forward.
“Databricks has given us the perfect balance of cost optimization and query performance,” concluded Assi. “It has allowed us to build and deploy a data-intensive workload faster and more cost-effectively, which in turn allows our sellers and our key business stakeholders to drive business in a data-driven fashion.”