Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Omkar has a keen interest in solving large-scale distributed systems problems. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
June 24, 2020 05:00 PM PT
Omkar Joshi offers an overview on how performance challenges were addressed at Uber while rolling out its newly built flagship ingestion system, Marmaray (open-sourced) for data ingestion from various sources like Kafka, MySQL, Cassandra, and Hadoop. This system is rolled out in production and has been running for over a year now, with more ingestion systems onboarded on top of it. Omkar and team heavily used jvm-profiler during their analysis to give them valuable insights. This new system is built using the Spark framework for data ingestion. It’s designed to ingest billions of Kafka messages per topic from thousands of topics every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. At this scale, every byte and millisecond saved counts. Omkar detail how to tackle such problems and insights into the optimizations already done in production.
Some key highlights are:
They used different techniques for reducing memory footprint, runtime, and on-disk usage for the running applications. In terms of savings, they were able to significantly (~10% – 40%) reduce memory footprint, runtime, and disk usage.