Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
Tao Feng: Hello everyone. Today, we are going to talk about data discovery at Databricks used with Amundsen. My name is Tao, I’m an engineer at Databricks. I’m a co-creator of Amundsen Project. I’m Apache Airflow PMC. Previously, I worked at Lyft, LinkedIn and Oracle. And my colleague Tianru is also an engineer at Databricks, previously work at AWS Elasticsearch. So first, let me talk a bit about data discovery and the challenges. Before we talk about why we need data discovery, let’s talk about how to make a good decision, that is a good decision are all based on data. So, who needs data? The answer is, basically anyone, any persona who needs to make good decision in this data, including data analysis, data scientists, product managers, general manager, engineer, experimenters.
For example, HR who want to ensure the salary are competitive with the markets, politician wants to optimize campaigns strategy, when they make the decision, they all need to base on some kind of data. So, the typical flows of data-driven decision is when the data is first collected and analysis will find the right data to use. Then they move on, say analysis needs to understand the data. After understanding the data and explore a bit, some analyst will create a report then share the result with the stakeholder. But if the result is not good and find out that they are using the wrong data, they need to go back to step two to redo the … we go through a process until the results are satisfying.
And then someone with the result, we’ll make a decision. From the workflow, step two and steps three, consider the data discovery and data exploration phase. In fact, so doing it for study, you could see data scientists sometimes could spend up to 30% of their time in data discovery. And we could tell it’s not very effective. You want people to spend as less as possible in the data discovery phase while spend most of his time doing that data analysis finding the right insight from the data. So, how could we shorten the time in data discovery? The answer is using a metadata/data catalog solutions.
So, what is data catalog? So, a good data catalog help the ease of documentation and discovery. It provide us single searchable portal, display the dependency and lineage between data entities, it helps to answer lots of question like, where can I find the data? What is the context about this data? Who are the owners? How is the data got created? How frequent is it refreshed? Is that data trustable? Some kind of these questions. Now, let me do a bit of introduce Amundsen. So, what is Amundsen? In a nutshell, it is an open-source project focusing on data discovery and data catalog and metadata. It aims to improve the productivity of data scientists, data analysts, data engineer when interacting with data.
Currently, the project is hosted at Linux Foundation Data and AI as incubation project with open governance and RFC process. Now, let me do a bit of share some of the notable Amundsen features. Here is the typical homepage for Amundsen, you could see there’s a search bar, which you could search any entity that have been indexed in Amundsen. Then, you could see some of the tab, which is grouped some of the important dataset based on certain tech. And you could see some of the typical dataset that have been used and I bookmarked. Lastly, there’s a popular table section showing that some of the most used tables in the organizations. Here is a dataset detail page. It includes some of the important metadata and information regarding this dataset like descriptions, partition range, owners, last updated, frequent use, et cetera.
You could also show some of the lineage between dashboard and datasets, saying which dashboard has been using this dataset. You can also search existing dashboards. Now, let’s see the dashboard detail page. The dashboard detail page, you could show like the owners, last successful runs, even showing the dashboard preview. On the right-hand side you can showing like, what are the table they have been using in this dashboard? And you can even search the co-workers and see that for the given co-workers what are the table he own or using. Here also some of the useful feature, for example like announcement page, it provide plugin client to support the admin user to the surf some of the new feature for the tool regarding the new dataset.
Amundsen can also config to as a central data quality issue portal, allow user to report that any data currently related to a given dataset by connecting, let’s say Jira. Also, you could use for data preview like the plugin client support like different BI visualization tools like Superset, BigQuery, Redash. Now, let’s talk a bit about how we use Amundsen at Databricks internally. Databricks is a data and AI company is provide unified platform for all your workload including data, AI. It work with more than 5,000 plus customers across global. It is also the original creators for some of the notable open-source project like Apache Spark, Data Lake and MLflows.
So in high level, ways that Databricks Lakehouse its support or the structured, semi-structured, unstructured data and consume into a Data Lake, which could query as surface through different tools, including a SQL analytics Redash, Data Science workspace or MLflow tool based on different persona. For Amundsen, we are focusing here is how can we put surfacing and discover like Delta table or Redash dashboard, those are the two main feature from Amundsen. So curiously at Databricks, the internal dataset discovery are still using a static maintained Wiki page for golden table of central workspace. And the metadata is in fact, easily becomes stale as it’s a manual edit. So, how do we solve it? We are using Amundsen to improve the data discovery experience. Now, my colleague Tianru will talk about our data internal deployments.
Tianru Zhou: Next, I will talk about how we deploy Amundsen in Databricks. Our Amundsen deployment consists of three main components. Respectively, Amundsen front-end, Amundsen search and Amundsen metadata used on customization on top of the open-source version. All of them utilizing the Databricks control plane as Kubernetes pod. So, what is Databricks control plane? In a nutshell, Databricks control plane is a Kubernetes cluster that contains all Databricks services. For example, authentication service, which is responsible for user login, web app, which controls workspace UI we see every day and online infrastructures.
In contrast, data plane is an environment where our Spark has to revise to execute different tasks. Amundsen’s front-end could be integrated with Octane to do end-to-end authentication. All Databricks full-time employees have permission to do this step. More information about open ID connection setup can be found in the Amundsen documents. The front-end service such as web app portal for user interaction. It is a SaaS-based web app whose representation layer is built with React, with Redux, Bootstrap, Webpack and Babel.
We also leverage front-end service to do fine-grained user action tracking to provide the hook called action logging, which allows us to optimize the behavior of usage logs. In all use case we write these logs to the AWS MySQL Server, which can be accessed from Databricks notebooks to do further analysis. Amundsen search service proxy leverages Elasticsearch search functionality and it provides a RESTful API to store such requests from the front-end service. Currently, both table and dashboard resources are indexed in search box. We use AWS Elasticsearch for convenience. Amundsen metadata service currently use Neo4j proxy to interact with Neo4j graph database and serves front-ends metadata.
The metadata are then represented as a graph model. For new project database, we have our own deployment as Kubernetes pod in control plane as well. And [inaudible] web store is attached as a system volume to store actual data. We leverage EBS snapshots to do the data backup. Things that we can review in Elasticsearch index from Neo4j data pretty fast. Snapshot for Elasticsearch is not nestled. So, for the real Kubernetes deployment, we replicas for front-end metadata and search proxy. However, we only have a single Neo4j pod which thinks that EBS backup system volume is auto provisioned and EBS does not support [inaudible] runs manual mode.
Amundsen provides a data ingestion library called Databuilder, which supports users to build any soft metadata ingestion. By default, people will use Apache Airflow to orchestrate metadata extraction and ingestion. Apache Airflow is an open-source platform to also schedule and monitor workflows. It helps us to create workflows within Python programming language and these workflows can be scheduled and monitored easily. However, in Databricks, we use the similar alternative called Jobs service. A job is the non-interactive way to run an application in our Databricks cluster. For example, an ETL job, we want to run immediately on a scheduled basis, we can monitor job run results in the UI using a CLI, query the API and email alerts.
These characteristics is perfectly into our use case. For more information about Databricks jobs, please check out our official documents. We leverage Databricks Jobs service to run current jobs to ingest data into a Neo4j database daily and update corresponding Elasticsearch index. Metadata extraction and ingestion logic resides in several Databricks notebooks. We will talk about the details later. Communication between control plane and data plane are down to IP wireless thing. Our data plane cluster has a stable IP address, which is why it’s listed in Pulse Elasticsearch and Neo4j services.
So previously, we mentioned that for Databricks deployment, we need to do customization for the [inaudible] open-source projects. So, our customization can be separated into two parts. One is configuration changes and the other is function overrides. For configuration changes, we change the default logo from Amundsen to Databricks. We integrate Amundsen with Octane to do authentication. We change a memory synchronized database to AWS MySQL Server to store connecting information so that we can scale up our front-end deployment easily. We also do some UI changes like number of text, they were tables to show.
For functioning overrides, we customize the behavior of action logging to track our action. All usage logs will be written to AWS MySQL Server. We change the heuristics for popular table calculation from totally used or usage-based to include database reference count as well based on our internal use case. We also change the tag updating logic to make it searchable immediately after modification. So, how do we do the deployment with so many customization? Basically, it’s the module of open-source version Amundsen service into our own private repo and we use it as a base layer. We apply all local changes on top of it that overrides its original functionality. We will upload the newly-built images into the Databricks private dock or repository and use them for our Kubernetes deployment. All our metadata ingestion logic resides in several Databricks Notebooks.
So, how do we do varying control over the development process? This is very important, especially when multiple people working on the same notebook. In case we messed up the code, it also allows us to quickly recover from it. Databricks Notebooks allow us to distinct changes to or from our own gateway pod. The configuration is simple, which that lead to generate our personal extended token, which had repo permission and granted to the recent workspace. More setup information can be found in Databricks official documents.
Next, I will talk about the metadata surfaced in Databricks Amundsen. It can be separated into three parts, lineage information, statistics and extended information. I will dive deep into each part later. So first, lineage information. Let’s take a look at how lineage information is presented on the UI. On the upper-left side, you can see the writer/owner of the table. On the bottom left, we can see downstream jobs, downstream users, upstream and downstream tables. For upstream and downstream tables, we can follow the URL to their own dataset page. Both downstream jobs, the link can lead us to the actual Databricks jobs. And on the upper-right corner, there is also a link pointing to the Databricks job and right to this table.
All the lineage information here are structured from a pre-created lineage table, which I will talk more about later. So, what is table lineage? You can describe the table lineage for our table by a simple graph structure, which showed that a parent-child table relationship can be established by looking at the common workload that ties the parent and child table together. In the graph, we have a workload distinguished a local ID that reads from upstream table and writing to a downstream table. These actions performed by the workload with the parent and child table interact lineage of each other. Here, you can think of the workload as a daily works job.
This slide shows the details how we generate the lineage table. So, the usage logs access the automated choose for our lineage information. It contains two main tables. Read event table and write events table. Their names pretty much explains everything. We need to combine them with the insight table as well thinks that latter one contains necessary mapping from zero ID to real name. Now, we have the raw lineage table. We need to do some further treatment like normalize table ID by matching them to the four-storage paths, we also need to map it back to table names whenever possible. After all the stream processing, we get our final version of the lineage table, which gives us all the necessary information presented as well.
Next, we are going to talk about statistics information. For column statistics, we run Spark, analyze computer statistic command on these data tables, which scan all the rows to do computation for numeric-type columns. Then, we can run Describe extended table to fetch the data computing above. However, since we need to scan the whole table or at least all the jobs for specific columns to get to the final results, the current step is time-consuming. This becomes intolerable, especially when we have a large number of tables. While the mitigation here is to create computes and stats in that log so that we can extract some information to actually without computing on demand. For frequent users, we get your usage data from our usage logs table as well and do replication to get usage count information for each user table pair.
Metadata service will generate a list of frequent users for each table based on some categorized heuristics we discussed above. We also surface Delta extended metadata in our Amundsen deployment like mounting point, table proxies and recent table history. We can get these information by describing the corresponding Delta table or view name. The Delta extended metadata leverages a feature called programmatic description in Amundsen, the feature is designed that any company could provide its own customized table metadata. Basically, you could provide [inaudible] evaluate pair table metadata and aligned to the table entity and [inaudible].
Next, I will use lineage metadata as Delta table extended amended data ingestion. As an example, talk about our notebook structure. So open-source Amundsen data view library already provide a ready-made metadata structure and we can even be extended to surface more information. As shown in the picture for each graph table we can use as much information as we want. For example, table owner, downstream and upstream tables structure. After we have a fully functional structure, we just need to change with the Neo4j publisher and the Elasticsearch publisher to publish the data. There are two things worth mentioning here.
We use Neo4j data structure to extract information from Neo4j database and with Elasticsearch index from there, which means that we don’t need to snapshot Elasticsearch as long as our graph database data is safely backed up. Second, when we publish data, we attach them using today’s date which can save us a lot of trouble when cleaning stale data later. This slide shows the process to clean up stale data. Remember to specify the tagging notes you want to delete. Amundsen has native support dashboards. We integrate it with Redash dashboard, which is also an internal service in Databricks. From the picture, we can see all the dashboards that use the current dataset. We can also follow dashboard link to the detailed dashboard page.
From the dashboard page, we can find dashboard-led data like description, owner, create time, last update time from the left panel. On the right side, we have tables tab, which gives us a list of tables used by the current dashboard. We also have charts belonging to it, we have queries that compose these charts. For these queries, we can even copy them and even view them in Redash dashboard [inaudible]. For each table or dataset, we also provide a useful link that can lead us to the actual dataset page, which shows all the table schema and whereas some sample data. Lastly, this is our weekday usage, weekly active usage for Amundsen deployment in Databricks. We are gaining adoption internally. Above is a general introduction about Amundsen deployment within Databricks. Hand over back to Tao to talk about Amundsen open-source in general.
Tao Feng: Thanks Tainru. So now, let me talk a bit about Amundsen open-source in general. Since Amundsen was open-source two years ago in 2019, it has gained a lot of adoptions. As of today, we have more than 1,500 plus member in our Slack community. More than 2000 stars for our repos and more than 30 companies have been using in production. And also, it’s among the top 20 most popular open-source data projects in 2021 based on the recent Data Consult survey. And for Amundsen, yeah, few notable RFC or PRs that happened recently. For example, the community pool I support using AWS-managed Neptune, which is a managed graph database as another offering for metadata data store.
Same as for MySQL, using MySQL as the metadata store as well. And also live feature like a wide native support showing lineage front-end as well back-end. And currently, Amundsen is more like a pool model for ingestion metadata. There’s another PR pull requests to show how we could do the push model. For other RFC could be found in the following link. So in summary, we first talked about data discovery in general and what is the related challenge and how you can solve it in Amundsen. Then we talked about how we integrate Amundsen with Databricks infrastructures internally and how you could do something similar as well. And lastly, Amundsen open-source community has been growing tremendously and we want you to be part of it to solve that data discovery and data catalog challenge together. Thank you.
Tao Feng is an engineer at Databricks. Tao is the co-creator of Amundsen, an open source data discovery and metadata platform project, and a committer and PMC of Apache Airflow. Previously, Tao worke...
Tianru Zhou is currently working at Databricks on data discovery related projects, including integrating Amundsen with existing infrastructure to do data discovery. Previously, he worked at AWS Ela...