Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.
In short, Amundsen is built on 3 key pillars:
1. Augmented Data Graph: Amundsen uses a graph database(Neo4j by default) under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What’s unique to Amundsen is that we bring all related metadata (usage, last updated, watermark, stats, etc) into this graph. One example is that we also treat people as a first-class data asset – in other words, there’s a graph node for each person in the organization that connects to other nodes (like tables, and dashboards). This solves interesting problems such as ramping up problems by answering “what my team member’s frequently used table”?
2. Intuitive User Experience: Amundsen strives to deliver data discovery relevant to the user by running PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet.
3. Centralized Metadata from different sources: Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress. It also provides the data lineage across different sources and allows users to understand the data connection.
In this talk, we will discuss what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. Then we will deep dive into Amundsen’s architecture, discuss how it achieves the 3 discussed design pillars. More importantly, we will discuss how Amundsen could be customized and extended to other companies’ data ecosystem. Lastly, we will close with the future roadmap of the project, what problems remain unsolved, and how we can work together to solve them.
Speaker: Tao Feng
– Hello everyone, today I’m going to talk about Solving Data Discovery Challenge. We Amundsen, an open-source metadata platform. So my name is Tau. so a bit of an intro about myself, I’m an engineer and Lyft Data Platform and Tools Team. I’m a Apache PMC member and a committee member. At Lyft, I’m working on different data products, including Airflow, Amundsen and led the data org cost attribution effort. Previously I was working on LinkedIn at Oracle. So here is today’s agenda, I will first talk about what is Data Discovery, and I move on to talk about challenge in Data Discovery. Then I will start introducing Amundsen, Amundsen architecture, a bit of deep dive, and lastly, we will talk about the impact of Amundsen and the field future work. So what is data discovery? So you ought to answer that, let’s talk first talk about data driven decisions. I hope everyone agree, all the good decision are basing data. So in this case who needs data? To answer that, every persona needs data. For example, analysis, data scientists, farm product manager to general manager, to engineer to experimenter, everyone. Whoever needs to make good decisions, they need to be made based on the data. Take a few more examples, for example, HR wants to ensure like the candidates salary are competitive with the market, the politicians when they need to optimize their campaign strategy based on data. So the typical data-driven decision flow is like, first, assuming that data is collected. Once the data collected, analyst will try to locate and find the right data. Once they find the right data, they will try to understand a bit of deep dive this context of all the data they find. After they understand that data, the analyst will start creating report, like all the dashboards, research study. Once they feel comfortable, they were share the result. So, but this is not a single way, if the result is not good and find that the data is not correct or you’re not using the right data or not enough data, they may need to go back to step two to redo the whole process again, until the result assesses by, by the stakeholder and move on, as someone needs to make a decision based on that. In this case, the step two and three are mostly the data discovery portions. So let’s talk about the Challenge Data Discovery. So, let’s use a case study to answer that, for example, let’s say we want to predict the meetup attendance before pandemics. Why is that? Because even though we know unknown large number of people RSVP, but sometimes, some of the RSVP, people will not no-show. We want to procure pizzas, and drinks, and chairs to make sure it’s not over-provision or under-provisions. So how are we going to do that? We will use that data for past meetup, build a predictive model. For example, once we know the code, we are going to locate and find that data. We are going to ask them friend or expert if they know any past study has been done, or ask the question in Slack channel. Lastly, if any of this doesn’t doesn’t work, we will search in all the GitHub repos, or other documents to locate any past study has been done. So now we find a table named core_meetup events with a few falling columns, attending, not attending, date and initial date. Now the question is like, what does this columns mean? For example, does this attending column mean they actually show up, or just RSVP? And what’s the difference between date and initial date? Lastly, once we even, I understand that context of Symantec of these tables, is this data trustworthy and reliable to be used? So, in order to do that, does the approach, we’ll say, I’m going to ask the table or data owner. Then, how do I find the owner? I’m going to look for, for example, documentations, who were the last people to come meet any change with this table in the GitHub and Conference page document. And then lastly, I also want to understand a bit more of the data sheet, and do a run queries I select star from this table, limit 100, to know the data shape. So you could see this whole process is not super productive. Based on research study, scientists spend more than 30% of time on data discovery and take… So data discovery in itself actually just provide little to no intrinsic value. So you want to make a data driven decision, the most impactful work happen in that analysis. So we want to shorten the time spent on Data Discovery. So how to do that, the answer will be, leveraging the metadata. So let me give a bit of intros of Amundsen. So what is Amundsen? In a nutshell, ina nutshell, Amundsen is data discovery and metadata platform for improving the productivity of different persona when interacting with data. Currently, the project is hosted at Linux Foundation AI, AKA, LFAI as it’s incubation project with an open governance in the RFC process. So, take a step back, before Amundsen exist, how is Lyft Data Discovery looks like? You could see like from the left-hand side, we used to only have a few static confident pages, document or the core tables, and a few static metadata, like descriptions owners, ETLSLA, et cetera. So the metadata is refreshed through a cron job, no human curations and the metadata is not easy to extend. So now let’s go to see what is Amundsen look like? Here is the homepage of Amundsen, it has a generic search bar to search any datasets, indexing Amundsen. Then you can see all the popular tags, different data sets around the tags. Then you could see the bookmarks at each table have been using frequently and also the popular table. What table used most frequently in the whole company. Let’s dive into a search bar, so for example, if I want to type certain term, like tables, behind the scene, Amundsen will kind of search based on both popularity and relevancy to search the most relevant table. So once you select the dataset or table page, you go to a DSM page. You can see like schema name, people names, descriptions. Like they range, tags, last updated timestamp, or the frequent user, owners and a lot of unstructured metadata underneath as well. On the right hand side, you could see all the columns that belong to this dataset, like column name descriptions. Once you click any of the columns, you could expand with a syntax of surface like the column description as well as the column statistic. For example, a distinct num nulls et cetera, based on the type. You can also link and surface audit dashboard that has been used this table. For example, dashboard is like a research study, if the table has been used, and main dashboard you’re using is a trustworthy table. You can also search like dashboard, for example, in the search bar, you can search Amundsen and then now you can turn any dashboard has this Amundsen table index. Once you click find a dashboard page, and you will see a field metadata-related dashboard page. For example, descriptions, owners, tags, when it’s last run it, when you’re successful last run time, and you could see the real come up this dashboard, so is the dashboard preview page. On the right hand side, you will see what other table has been used by this dashboard page, that what are the search name, what are the query names. You can also search like the user. So for example, if I’m a new hire, I want to know the returns in the team, what table they have been using the most. So I would go and see, for example checking their page and see what table they own, what dashboard they own, what’s the bookmark, et cetera. So now let’s go to the Amundsen Architectures. So Amundsen Architecture is a microservice architecture. Yeah, have few services, frontend services, metadata services, and search services. Metadata services serving the metadata request from both fronted services as well as other microservices. And the store or metadata is pluggable, it supports multiple type. And then ingestion or metadata is through a databuilder framework called databuilder, which could connect on different heterogeneous metadata source and persist the metadata into Amundsen. So Fronted services, so a frontend is written in a modern data frontend stack, like TypeScript, ReactJS, stay management, Redux. So let’s talk a bit about Metadata Service. It’s a proxy layer to interact with different graph database with the API. Currently we support different graph database like Neo4j, which is Cipher language based. AWS Neptune, which is Gremlin based. We also support Apache Atlas, which is another popular metadata framework as meta-store. We also support Rest API for other service, we’re pushing and pulling metadata directly. At Lyft, service communication like between different microservices are authorized through Envoy app. Let’s move on to Search Service. It’s also a proxy layer to interact with search backend. It’s supports Elasticsearch, Apache Atlas out of the box. We support different search patterns, like Fuzzy search, a search based on popularity and multi facet search which is people want to search more pinpoint search. So let’s talk a bit about that Databuilder Framework. Databuilder Framework is a generic ingestion framework repeal which is highly inspired by Apache Corporate. So you include a few stages, extractor, transformer, loader and publisher. Extractor connect to different heterogeneous data source and fetch records one at a time, and pass it to the transformer stage. Transformer will do any transformation, For example, changing description, adding more texts, and then pass it along to Loader. Loader will load in the data into a staging area, once they load all the data pass it into a staging area, you will pass it along to the publisher to publish it to the downstream sync. Metadata could be very heterogeneous, for example, at Lyft, we support many different metadata source like Redshift, Hive, druid, Postgres, Presto. Even at Asheville, we have three types, like Apache Superset, Mode Analytics, and Tableau. So we want to support that, that’s why we build databuilder. Here is an example how Databuilder works. For example, a typical scenario would be, I want to patch metadata from Hive and persistent Apache. We will go to using like Hive table metadata extractor, which is a subclass of SQLAchemy Extractor. Once we’ve fetched the record, we pass it along to Noop Transformer because we are going to want to do any transformation. And then pass it along to file system Neo4J loader loading to staging. After loading finish, you will pass it along to a publisher to publish to Neo4J. So how is Databuilder orchestrated? Amundsen laverage workflow engines, for example like Apache Airflow, which is the date distributed parallel workflow execution, which also allows specified dependency to orchestrate Databuilder job. For example, you could see the graph and the mean beside many different metadata are executing parallel, in the meantime. They also have certain sequence, for example, all the metadata we want to start execute before the table metadata is started. So currently, amongst the support are rich billion connected, thanks to our community, whenever they feel something, they contribute back to our Amundsen Open Source Ripple. If you have anything that is not included and have build your own, please contribute back to us. So let’s do a bit of dip dive at different architectures in Amundsen. First metadata model. So metadata, how we model metadata is very important. To start answering that question, first let’s say, what kind of information or what metadata we consider us metadata. So a term from quam paper named aka ABC or metadata could help answer the question. A stand for application context, meaning like all the metadata needed by humans application to operate that we want to store. For example, where is the data, what are the semantic or context of the data? And B stand for behavior. How is this data created and used over time for example, who is using the data, who has created the data. C is time for change. How is that change data over time? How is the data evolving? And while the code has been changing to generate the data? So after we talk about like, what are metadata, so now let’s talk a bit more, what data we want to index. The short answer is, any data in your organization. And the long answer would be, for example, anything from data store, to people, to dashboard, to ETL process, to notebooks, to human streams, we all want to index. At this point, amongst us are support like data store, or Dataset, people and dashboard, and we’re moving to support more entities in the future. So take Dataset as an example, so Amundsen have three main classism entities so far table, user, dashboard. Table is essential note in data set model, it extend and connect to other extended metadata, let’s say column names, column statistics, description watermark and so on and so forth. And as of today, the Dataset metadata includes both manual curated, as well as programmatic curated. It support many different general metadata like description, and partition date range, tags, GitHub source definition, which ETL has been produced. Also we support other structure metadata. For example, not every companies share the same set of metadata, so we allow people to enjoy any arbitrary as structure metadata into the surface in Amundsen as well. So, one of the main challenges, like even being in the same organizations, not every Dataset define the same set of metadata or follow the same practice. For example, some metadata is only available with some Dataset like Tier, SLA and something like those operation metadata. Next, User, User model. We believe User has the most context and tribal knowledge on data as assets, that’s why we want to index User and connect user with the entity to surface those tribal knowledge. Next, Dashboard. Dashboard represents the assisting user research study. The model extending and connect. Let’s say, Dashboard has dashboard group, query, metrics and Chart, and also connect with User and then connect back to the tables. Currently, Dashboard Metadata include description, owners, timestamps, which table using dashboard, preview, tags, et cetera. The main challenge is like, although we model dashboard by generic but we take more elevators as a MVP. So a model type. So not every dashboard metadata is applicable for other dashboard type, for example, Redash doesn’t have a dashboard concept. Next, let’s talk a bit about Push vs Pull. So there are two typical approach to ingest metadata, one is Pull Model one is Push Model. Pull model mean periodically update database by pulling the system through a ingestives. Is Preferred if this waiting for indexing is okay, or also it is easy to bootstrap to appeal up the central metadata for a company. Push Model means like if the upstream source or client side pushing the metadata directly to a messaging queue where the downstream services are quite persistent information into a craft. This is preferred if you require near real-time indexing metadata, as long as you still have a clear interface to define the metadata. So currently in Amundsen, if they are using Neo4J and AWS Neptune as backend, is leveraging a Pull Model approach, but you could easily extend to be a Push and Pull hybrid model. Take the graph as example, External service, we could build an SDK with a defined schema and the external services could push the metadata. Now we just skate, push the metadata at the top card, and then we could use Databuilder to describe the relevant top cover topic and persisting information into the cloud. So if you are you using Apache Atlas as the metadata backend for Amandsen, then you will be Push model. So Apache Atlas lavish different hook from external source to get the metadata where, during the acquire context, the downside of this approach is that Atlas doesn’t support external source, for example Redshift, if it doesn’t have the support the hook interface. Let’s talk a bit about why we choose Graph Database.
– Why choose Graph Database? Because data entities with its relation could be easily represent a model as a graph. All the performance is much better than RDBMS, once the numbers of nodes and relationship in the graph are in a large scale. So also adding a new kind of structure metadata in the graph is super easy, it’s just adding a new different new node type with the connection and then boom, it’s dead. So let’s talk a bit about the Search Trade-off. Search Result, so it’s ranked on both relevancy and popularity. So what is relevance? Take search for “apple” on Google, for example, if you’ll get a result as orange, you mean low relevancy for search. If you’ll get an actual apple, it means a high relevance. Popularity, so let’s take the same example, if you’ll get an actual apple, it is kind of low popularity but you look at an Apple company, what we sell, it’s a high popularity because Apple company is pretty popular, right? So we try to strike the balance to involve both consideration, like we take the name, descriptions, other related field, because the original meant doing the search. In the meantime, we also take popularity into consideration, like we take that query activity, you have to wait it the result. Next, let’s talk a bit about the Metadata Source of Truth. So what it means, so metadata is often pretty fragmented. Amundsen is built to centralize all this fragmented data, metadata as a popup manual curated, programmatic curated. If the metadata is both manual curated and programmatic curated, we treat Amundsen graph as a source of truth, but if the metadata is only programmatically curated, always updating an option source, we disable the manual curator and treat the upstream source as a source of truth. Take the graph as example, if there’s a description available in GitHub source, when the table is first created either available in high metastore and persist in the Neo4J. After that, if human is using Amundsen to modify description, we won’t persist back to high and GitHub a bit. And then we only treat Amundson Graph as aa source of truth. So other features. First is Announcement page, we built a plugin client to support a new feature “Announcement” on new Dataset announcement Next, Central data quality issue portal. So we built a climb to integrate with JIRA and to allow users to report any data issue. So you can also see all the past issue, data quality issue as well for the day given dataset. It was a good even request for the contest or description from the owners through the portal. Data Preview, so we support data preview as well, we have a plugin client with different BI Viz tools. Data Exploration, you could also support integration with other BI tool for doing complex data explorations. Let’s talk about impacts. So Amundsen is a pretty popular at Lyft. We have 750 WAUs, 150k tables, 4,000 employee pages and 10,000 dashboards. So Amundsen Open Source, we have more than 950 plus in the community channel, 150 plus companies in community, and more than 25 plus company are using it in productions. Here is a landscape of the community, so Amundsen is pretty extendable, but different companies using differently with some analysts using to integrate with that data quality services as well as Databricks’ Delta platform. ING is using data on Amundsen Atlas for their discovery. Workday is building a whole analytics platform named Goku and Amundsen is a landing page for Goku. Square using Amundsen to solve the compliance and regularity use case and they contribute a lot. And here are the recent contribution from the community, like Redash dashboard integration Tableau dashboard integration and Looker from Brex and Delta analytics from Edmunds. So here a bit, we’ll talk about the Future. So we we’ll focus on main focus on Q4 will be data lineage. So RFC will be coming to top to cover how we service lineage using different mechanisms, Push-based, SQLparsing mechanism. ML Machine Learning Features, entity. We were going to chip ML features another first classes and entity that service features that feature upstream dataset sending each service, whereas metadata around ML features. Metadata platform, so service like support other service metadata program idea, assess using graphqL API. And also for example, if you use case expose metadata to BI tool, integrate data quality service to surface health score, also support a hybrid metadata ingestion, like we mentioned before. So thank you. Questions.
Tao Feng is an engineer at Databricks. Tao is the co-creator of Amundsen, an open source data discovery and metadata platform project, and a committer and PMC of Apache Airflow. Previously, Tao worked at Lyft, LinkedIn and Oracle on data infrastructure, tooling, and performance.