Karol Sawicz is an IT Business Analyst at RB, where he is involved in various data projects with responsibilities ranging from building full end to end reporting solutions as well as project managing rollouts of global reporting platforms. Karol holds a B.Eng in Computer Science from the Polish-Japanese Academy of Information Technology.
May 26, 2021 12:05 PM PT
Reckitt is a fast-moving consumer goods company with a portfolio of famous brands and over 30k employees worldwide. With that scale small projects can quickly grow into big datasets, and processing and cleaning all that data can become a challenge. To solve that challenge we have created a metadata driven ETL framework for orchestrating data transformations through parametrised SQL scripts. It allows us to create various paths for our data as well as easily version control them. The approach of standardising incoming datasets and creating reusable SQL processes has proven to be a winning formula. It has helped simplify complicated landing/stage/merge processes and allowed them to be self-documenting.
But this is only half the battle, we also want to create data products. Documented, quality assured data sets that are intuitive to use. As we move to a CI/CD approach, increasing the frequency of deployments, the demand of keeping documentation and data quality assessments up to date becomes increasingly challenging. To solve this problem, we have expanded our ETL framework to include SQL processes that automate data quality activities. Using the Hive metastore as a starting point, we have leveraged this framework to automate the maintenance of a data dictionary and reduce documenting, model refinement, testing data quality and filtering out bad data to a box filling exercise. In this talk we discuss our approach to maintaining high quality data products and share examples of how we automate data quality processes.