Shankar leads the Engineering at Recko. Recko has built a Financial Operations Platform and provide data reconciliation as a service to modern internet companies. In a career spanning 20+ years, he has built a variety of distributed systems. Before Recko, He was leading the Financial Engineering At Flipkart, the leading e-commerce company in India. This work is developed during his stint there. At LinkedIn, he led the optimization and productivity improvements of their Hadoop platform. Before that, he was with Microsoft, where he helped build the middle tier platform for Bing Search and a highly successful distributed test automation for Windows clusters. Many of his recent works are presented in major industry conferences like Kafka Summit, Spark Summit, DataWorks summit and many big data meetups in Bangalore and Bay Area.
May 27, 2021 11:35 AM PT
Availability of high-quality data is central to success of any organization in the current era. As every organization ramps up its collection and storage of data, its usefulness largely depends on the confidence of its quality. In the Financial Data Engineering team at Flipkart, where the bar for the data quality is 100% correctness and completeness, this problem takes on a wholly different dimension. Currently, countless number of data analysts and engineers try to find various issues in the financial data to keep it that way. We wanted to find a way that is less manual, more scalable and cost-effective.
As we evaluated various solutions available in the public domain, we found quite a few gaps.
In this presentation, we discuss how we developed a comprehensive data quality framework. Our framework has also been developed with the assumption that the people interested in and involved in fixing these issues are not necessarily data engineers. Our framework has been developed to be largely config driven with pluggable logic for categorisation and cleaning. We will then talk about how it helped achieve scale in fixing the data quality issues and helped reduce many of the repeated issues.
October 24, 2017 05:00 PM PT
As a Spark developer, do you want to quickly develop your Spark workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write end-to-end tests for your workflows and add assertions on top of it? In just a few years, the number of users writing Spark jobs at LinkedIn have grown from tens to hundreds, and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production. Currently, there is no way users can test their spark jobs end-to-end. The only way is to divide the spark jobs into functions and unit test the functions. We've tried to address these issues by creating a testing framework for Spark workflows. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output. In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Spark users at LinkedIn to be more productive.
Session hashtag: #EUde12