How to Save up to 50 Percent on Azure ETL While Improving Data Quality
January 21, 2021 in Partners
The challenges of data quality
One of the most common issues our customers face is maintaining high data quality standards, especially as they rapidly increase the volume of data they process, analyze and publish. Data validation, data transformation and de-identification can be complex and time-consuming. As data volumes grow, new downstream use cases and applications emerge, and expectations of timely delivery of high-quality data increase the importance of fast and reliable data transformation, validation, de-duplication and error correction. Over time, a wide variety of data sources and types add processing overhead and increase the risk of an error being introduced into the growing number of data pipelines as both streaming and batch data are merged, validated and analyzed.
City-scale data processing
The City of Spokane, located in Washington state, is committed to providing information that promotes government transparency and accountability and understands firsthand the challenges of data quality. The City of Spokane deals with an enormous amount of critical data that is required for many of its operations, including financial reports, city council meeting agendas and minutes, issued and pending permits, as well as map and Geographic Information System (GIS) data for road construction, crime reporting and snow removal. With their legacy architecture, it was nearly impossible to obtain operational analytics and real-time reports. They needed a method of publishing and disseminating city datasets from various sources for analytics and reporting purposes through a central location that could efficiently process data to ensure data consistency and quality.
How the City of Spokane improved data quality while lowering costs
To abstract their entire ETL process and achieve consistent data through data quality and master data management services, the City of Spokane leveraged DQLabs and Azure Databricks. They merged a variety of data sources, removed duplicate data and curated the data in Azure Data Lake Storage (ADLS).
“Transparency and accountability are high priorities for the City of Spokane,” said Eric Finch, Chief Innovation and Technology Officer, City of Spokane. “DQLabs and Azure Databricks enable us to deliver a consistent source of cleansed data to address concerns for high-risk populations and to improve public safety and community planning.”
Using this joint solution, the City of Spokane increased government transparency and accountability and can provide citizens with information that encourages and invites public participation and feedback. Using the integrated golden record view, datasets became easily accessible to improve reporting and analytics. The result was an 80% reduction in duplicates, significantly improving data quality. With DQLabs and Azure Databricks, the City of Spokane also achieved a 50% lower total cost of ownership (TCO) by reducing the amount of manual labor required to classify, organize, de-identify, de-duplicate and correct incoming data as well as lower costs to maintain and operate their information systems as data volumes increase.
How DQLabs leverages Azure Databricks to improve data quality
“DQLabs is an augmented data quality platform, helping organizations manage data smarter,” said Raj Joseph, CEO, DQLabs. “With over two decades of experience in data and data science solutions and products, what I find is that organizations struggle a lot in terms of consolidating data from different locations. Data is commonly stored in different forms and locations, such as PDFs, databases, and other file types scattered across a variety of locations such as on-premises systems, cloud APIs, and third-party systems.”
To help customers make sense of their data and answer even simple questions such as, “is it good?” or “is it bad?” are far more complicated than organizations ever anticipated. To solve these challenges, DQLabs built an augmented data quality platform. DQLabs helped the City of Spokane to create an automated cloud data architecture using Azure Databricks to process a wide variety of data formats, including JSON and relational databases. They first leveraged Azure Data Factory (ADF) with DQLabs’ built-in data integration tools to connect the various data sources and orchestrate the data ingestion at different velocities, for both full and incremental updates.
DQLabs uses Azure Databricks to process and de-identify both streaming and batch data in real time for data quality profiling. This data is then staged and curated for machine learning models PySpark MLlib.
Incoming data are evaluated to understand its semantic type using DQLabs’ artificial intelligence (AI) module, DataSense. This helps organizations classify, catalog, and govern their data, including sensitive data, such as personally identifiable information (PII) that includes contact information and social security numbers.
Based on the DataSense classifications, additional checks and custom rules can be applied to ensure data is managed and shared according to the city’s guidelines. Data quality scores can be monitored to catch errors quickly. Master Data Models (MDM) are defined at different levels. For example, contact information can include name, address and phone number.
Refined data are published as golden views for downstream analysis, reporting and analytics. Thanks to DQLabs and Azure Databricks, this process is fast and efficient, putting organizations like the City of Spokane in a leading position to leverage their data for operations, decision-making and future planning.
Get started with DQLabs and Azure Databricks to improve data quality
Learn more about DQLabs by registering for a live event with Databricks, Microsoft, and DQLabs. Get started with Azure Databricks with a Quickstart Lab and this 3-part webinar training series.