SESSION

Rapid Pyspark Custom Processing on Time Series Big Data in Databricks

Accept Cookies to Play Video

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKData Science and Machine Learning
INDUSTRYHealth and Life Sciences, Retail and CPG - Food
TECHNOLOGIESApache Spark, Delta Lake
SKILL LEVELAdvanced
DURATION40 min
DOWNLOAD SESSION SLIDES

Sleep Number Smartbeds were equipped with sensors underneath each leg to generate personalized sleeper insights with the weight in the bed. The raw readings were inherently noisy due to movements and position in bed. An intricate quality assessment was necessary to select stable segments with low entropy. To calculate at a granular level, a custom user-defined function for entropy was applied to rolling windows of time series big data. As the initial Pandas implementation did not suffice due to memory and time constraints, the operation was augmented using the potent synergy of Pyspark coupled with Databricks. The efficient and brute force methods were examined at varying data sizes and cluster configurations. The recommended Pyspark method rapidly processed 50 million records in nearly 0.3 seconds in Databricks. It remarkably performed convoluted custom calculations on rolling windows of time series big data in constant time complexity irrespective of data size.

SESSION SPEAKERS

Megha Rajam Rao

/Research Scientist
Sleep Number

Gary Garcia Molina

/Senior Principal Scientist
Sleep Number