Today a Data Platform is expected to process and analyze a multitude of sources spanning batch files, streaming sources, backend databases, REST APIs, and more. There is clearly a need for standardizing the platform that scales and be flexible letting data engineers and data scientists focus on the business problems rather than managing the infrastructure and backend services. Another key aspect of the platform is multi-tenancy to isolate the workloads and able to track cost usage per tenant.
In this talk, Richa Singhal and Esha Shah will cover how to build a scalable Data Platform using Databricks and deploy your data pipelines effectively while managing the costs. The following topics will be covered:
Richa Singhal: Hello everyone, welcome to Data + AI summit 2021. I’m Richa Singhal. Me and my colleague, Esha, are here to talk about managing and scaling data pipelines with Databricks. We are both Senior Data Engineers and a part of Go-To Market Data Engineering team at Atlassian. So little bit about our team, so our team enables marketing and sales teams to drive company growth and strategic business priorities by democratizing the data.
Esha Shah: The main focus of our session is to share how Databricks has helped us scale. We start with who we are, followed by a glimpse of the challenges we faced a few years ago, and then we’ll walk you through how we use Databricks. Now for folks who don’t know about Atlassian, we believe that teams can change the world and it is our mission to unleash the potential in every single team. Our products, some of it you see on the screen, helps team organize, discuss, and complete their work. So over the last five years, we’ve seen rapid growth, in terms of data and also in terms of the users accessing it for insights. And thankfully we’ve been able to scale and keep up with the business’s growing needs, and today we operate on a multi petabytes scale. We’d like to share what we think helped us the most in our journey.
So this is the architecture from five years ago, and as you can see here, our architectural key components are common to most analytics flows. So we ingest data from a bunch of in-house, as well as enterprise sources, in addition to real-time events, using third-party connectors, APIs, event streams, and batch processes. And then for critical sources, we implement change data capture in the transformation layer. Our storage layer consists of a data lake and our analytical data model which is then consumed for operations and analytics. As you can see here, we used to heavily rely on EMR clusters and Spark for all of our ETL use cases, and all of our creating was mainly done using Athena.
Now, with EM, we had the ability to transform data at scale, but we ran into issues. For one, our dev cycle took too long, and even basic development operations were extremely hard. Example, every single time we iterated over code and made changes to code files, those would have to be copied over to the clusters. Accessing and inputting logs was hard, and we had no clear visibility into the historical job runs. Also since we were responsible for spinning up and shutting down clusters, we started running into instances where we’d see idle instances, which continue to exist and add to cost, just because developers had forgotten to shut them down. Now, another major issue for us was cross-team dependencies. Analytics was heavily dependent on data engineering and we gave the engineers on the [inaudible] team.
Our biggest overhead was with supporting analysts when it came to creating datasets. Our analysts had to wait on DE to unblock them for basic tasks. Also, when it came to visualization during the depth process or during the UAD process, that was a completely different task, and there was no easy way to share those results. We were dependent on the platform team for all of the cluster management activities, softwares updates, additional libraries, and resource permissions. To add to this, our auto-scaling algorithm, using spot nodes, was blurred, and we constantly run into issues while managing workloads. You could say that collaboration was non-existent unless you were sharing screens, and all context was lost when you shut the clusters down.
Richa Singhal: So we definitely needed a platform that not only caters to the performance needs of our growing data, but also provides self-service to empower business users and decision makers, to do their job effectively. Standardization, to make data processes consistent across teams. Automation, to provide easy access to data, and improve mobility, and maximize the returns. Agility, to try and be responsive in rapidly changing environment. And last but not the least, is the ability to track and optimize platform and operational cost at scale. So Databricks seem to fit all areas of our checklist and provided much more, and hence our platform team made a decision to invest in Databricks platform. So this is how our data architecture looks today. We started using Databricks platform heavily across four main pillars, highlighted in blue here. So use case including streaming and batch injection, data back fills, change data capture for data processing.
We have started using Databricks Delta for our Delta Lake, also for Databricks Notebooks for analytics and ad hoc analysis, using ML Flow to operationalize data science platform. So let’s look at our success story. So in last three years, we have seen key improvements by using Databricks platform. So these improvements are primary around rapid development, collaborations, scaling, and self-service. For rapid development, with ready-to-use platform, we were able to reduce overall development time by increasing good reliability due to better Desk practices. Collaboration, reduce duplicate effort and code redundancy by simplified and co-authoring. And then scaling, so as data grows, we were able to support data scaling needs while managing infrastructure usage cost. And self-service, we make teams self-serve and self-reliant while removing cross-dependency to other teams for simple tasks. So all of these key improvements helped in reducing the overall platform cost.
Let’s discuss about our Databricks adoption journey and deep dive into following areas. We will be discussing how we build data pipelines, orchestration, Delta Lake analytics, and data science use cases. So building a data pipeline is the core part of any data team, it can range from ETL batch processing, you can say running data science models, et cetera. So the goal here is to standardize the data pipeline processes across different teams. At Atlassian, teams spanning from data engineering, analytics, data science use two different ways to maintain their data pipelines. So one is using the Databricks Notebooks, and the other is DB-Connect Library. So why we adopted two different development methods here? So Notebook style development is primary use for lightweight and ad hoc development that requires interactive access to data. DB-Connect, on the other hand, is useful in during intensive pipeline development. We’ll discuss these two development methods in next few slides, and how we couple it with Atlassian’s Bitbucket for seamless DevOps experience.
So let’s start with development using Databricks Notebook. The flow here shows a typical development cycle using the Databricks Notebook, starting from top left, at Atlassian all development starts with creating a Jira ticket, which is great for project planning and tracking. Next, the developers start working on the ticket by creating a branch out of work repository, so the integration between Jira and the Bitbucket make this process seamless for developers, and at the same time it tracks the work progress. To start development, a developer pull branch in their local, and they can either choose to start working on their local branch, or they can connect to Databricks workspace to directly work on Notebooks. Also, testing is really easy as developers can attach a notebook to either the Databricks interactive cluster or ephemeral job cluster, which uses AWS resources under the hood. So this gives a flexibility to developers to perform data load testing by setting the cluster configuration similar to that of production. Once the dev effort is completed, the developer exports a notebook from Databricks workspace to their local, and create a PR to merge to staging environment.
Let’s talk about a setting multi-stage environment using Databricks workspace. So when it comes to workspaces, each team has separate folder on Databricks workspace, which is folder divided into environments, like dev stage and prod. The flow here shows a multi-stage environment for a single repository or upload base. So for local or development, developer works on teams dev folder in Databricks workspace. When it comes to deploying in stage and prod, the CICD process syncs go to respective Databricks workspace folder. Coding these folders can spin up the environment specific Databricks cluster. While dev folders have open access to developers working on repos, staging and prod folders have restricted access and code can only be written via Bitbucket CICD. Let’s deep dive into how a typical Bitbucket CICD process looks like.
With Bitbucket, it is easy to set up a CICD, which gives visibility to what is being deployed. Also, it not only maintains versions, but also triggers CICD every time we push the code to main branch. So this code shows the CICD pipeline for Databricks Notebook deployment. It includes three steps. So first is to check the configuration file, it installs library link checks, et cetera. Second is to move, go to Databricks. So this syncs the source code from Bitbucket to specific Databricks workspace folder, and in this case, you can see it’s prod. Last one is to update Databricks job metadata. So although this code shows three steps, CICD, there can be multiple other operational activities that can be performed as part of deployment. So to standardize these processes across different teams, we maintain a single repository named as Databricks CICD. Any team at Atlassian that uses Databricks Notebooks, and which is to establish a standardized CICD process, can use the Databricks CICD repo as a [inaudible].
Before going over the development using DB-Connect library, let’s iterate on developing using Notebooks. So although Notebooks are excellent for ad hoc development and have multiple advantages, it simplifies infrastructure for developers, great for exploring data, easy to collaborate and share, they lack some key software development features like challenges around building and importing classes and modules, low test coverage, maintaining code quality with engineering standards.
So some of the teams that were building more engineering intensive pipelines are drawn towards using DB-Connect Library for local development and testing. So DB-Connect, you can say is like a magical local instance of Spark, your machine will think it’s using local installation of Spark, however, in reality, it will use remote Databricks instance. The development flow here remains pretty much same as discussed in earlier slides, the only difference is the notebooks are now replaced with your favorite IDE. So unlike the notebooks style development, the developers can directly start working on their local, and does not require to import or export notebook code from Databricks workspace, this also simplifies the PR process. This thing is really straightforward here as well, so developers can connect to Databricks remote cluster via the DB-Connect Library and start testing the code. It feels as if you are running from local, but still have power to utilize the cloud resources.
So let’s talk about multi-stage environment using AWS S3. This is how the multi-stage environment looks like for the data pipelines using DB-Connect Library. So for our local development, the developer works on their local IDE and can connect to Databricks cluster via the DB-Connect library. Developer can also access AWS resources like S3 via the remote cluster, for staging and production. So once your PR process… Peers is reviewed and approved, the CICD process syncs codes to corresponding S3 bucket, these code files can be called via Spark Submit Operator from Databricks job cluster. Let’s talk about orchestration. So orchestration is key component of any data pipeline, so in layman terms, you can say it is a collection of all tasks you want to run, organized in a way that reflects their relationship and dependencies. At Atlassian, we have standardized [inaudible] airflow as a data orchestration tool.
Our airflow instances utilize Kubernetes operator for your resource optimization and scalability. Data teams at Atlassian have built their own airflow operators by creating a wrapper around the airflow capabilities operator. This wrapper includes integration with other systems that provide addition functionality for the job task. For example, in this flow, the air flow Kubernetes operator has integration with different platforms and tools. So let’s start with Slack, on top, so it sends task failure, success notification to respective team channels. SignalFx on the left, that routes task failure signals to Atlassian’s Opsgenie, which is an incident management tool, which in turn alerts the on-call person. YODA, which is our in-house data quality tool and Databricks platform that supports both Notebook tasks as well as Spark Submit tasks. So, how do we track resource usage and cost? To monitor cost and accurately attribute Databricks usage to organization’s business unit and teams, we tag workspaces and job clusters.
So in this flow, all Databricks jobs triggered either via the interactive cluster or by the Databricks job cluster are associated with tags, which can point to specific teams or business unit environment resource owners, et cetera. So these stats are available as part of job metadata, which is stored in our Data Lake, and are used for other reporting by different teams. It also gives team visibility on resource usage and cost at job level. So with Databricks platform, we have not only achieved multi-tenancy to isolate the workloads, but it also reduced platform management overhead from admins and the [inaudible] teams. So in next section, we will discuss how we manage our data at scale.
Esha Shah: Now, with DB-Connect, CICD repos, and [inaudible], we started seeing advantages of standardization at scale, with reusable components, easier maintenance and continuous improvements, with more and more users contributing to changes. Another component critical to scale for us was our transitioning Databricks Delta. Now, Delta makes our multi petabyte Data Lake behave as a data warehouse when it comes to operations, it unlocks capabilities which were not feasible earlier and enables easier operational maintenance. Let’s see how.
So until now, anytime we wanted to reproduce results, and we knew that our inputs could change, we would create copies of our input datasets, same holds true for creating data sets for [inaudible]. With Delta, we get this by default, it maintains versions of your data to track all inserts, updates, and deletes. What it also allows you to do is to point to a specific version for use by simply adding the keywords as of version or as of timestamp.
And this time travel feature is a lifesaver to recover from corrupted data or accidental overrides. Now, versioning is a very attractive feature for source ingestion, right? But it is especially so with the easy CDC implementation using merge functionality. With Delta’s merge function, our files behave like tables, and we write more statements the way we would in the relational world. So most of you working with Data Lakes would know that updating records in Parquet files is not possible, and the expensive rebuilds are just not a feasible option. So we had moved to an append model where incremental data was appended in date partitions. And then now with Delta, we unlocked powerful BI with the introduction of a CD type two dimensions, which we had forgotten ever since we moved to [inaudible] from a relational data warehouse. Now, after this, we also have an option of automatic schema evolution, where a new column value just automatically shows up in a Delta table if it has been added at source, this simplifies source CDC maintenance by leaps.
And there are more maintain instruments, there’s auto-optimize. For streaming data, auto compaction just merges file sizes to the optimal sizes, you don’t have to do anything at all. Delta tables in general have inbuilt optimizations, which result in better query performance. And we’ve seen that our query performance jumped by almost 70% in some cases, and hence we’ve started moving to a model where almost all of our intermediate tables in our data pipelines are Delta. It helps with, dry fields where we need historical as well as the current picture to calculate values. So in conclusion, source data, intermediate tables, or your final dimensioning model, Delta has wins at every single stage.
Now, until now, we’ve been focusing on the data engineering side of things, and for us Databricks adoption grew organically with word of mouth. So let’s talk a little bit about how other teams at Atlassian have been using it. Now, as you can see here, the analytics team uses Databricks for exploratory analysis, for data driven strategic decisions, for proof of concepts, when it comes to new metrics, to create and manage their own data sets or [inaudible], and even for creating templates with data exercises to help new team members onboard. Analytics folks use it for pretty much everything, the only component outside of Databricks is visualization for the broader audience, which is mainly done via Redash or Tableau. For us, one of the biggest steps towards scaling was self-service, and this is also one of our biggest wins. Analytics being able to self-serve without data engineering being the bottleneck.
So today, when any of our analytics team members need ad hoc data sets with derived data, to maybe simplify a bigger query, or to build a dashboard, they have Databricks Notebooks to create their own tables. If their analysis or transformations need additional libraries, simply using Dbutil’s install libraries gets the work done. And our platform team has also created a utility to enable anyone at Atlassian to upload files, to create tables in our Data Lake. So you could just upload CSVs and start creating that from our Data Lake. Now this utility even hashes BII during the upload, further enabling self-serve. [inaudible] are no longer the bottleneck for creating datasets. Analysts usually clone production notebooks, they make business logic changes, they create their own test data, and then at Atlassian, they even create Bitbucket pull requests for us, all of this, without the need to know anything about spinning clusters and executing or submitting jobs.
Now, another major win for us was collaboration. It is just so much easier to share a link to a notebook with all your code, and it isn’t just code, you can organize it with documentation cells and then cells where you visualize your results. These notebooks are reusable and you can always clone them and make more modifications, or simply make modifications and then revert to a prior version. All of our investigation and [inaudible] are now in notebooks, maybe just add a notebook link towards Jira ticket. We are also moving towards storing data sets, specific manual UAT, or quality checks as templates for reusability. And then needless to say, teamwork is so much easier when you can collaborate in real time.
So moving on to data science and our biggest wins here. We use Databricks for all possibilities stages, thanks to the open and seamless [inaudible] flow functionality. The team uses it for end-to-end cycle, starting with exploration, moving to training, scoring, experimentation, performance analysis, and even model serving. Our biggest wins, no infrastructure overhead. So our data scientists no longer share clusters and they don’t wait for resources, nor do they manage infrastructure at any stage of the data science life cycle. The reduced overhead helps them to focus on other value adding activities. They’re self-reliant when it comes to ingesting data from streams, APIs, or batch loads using Delta. Delta also helps them bag and drag inputs to use at a later stage with the same or different model, or parameters. Now, when it comes to productionalizing code, there hardly any steps to go from writing and testing local code in your notebooks, to pushing it to cloud, and local development can happen using multiple languages in the same notebook.
So once verified, scheduling repeatable code or via airflow or serving the model is all that is needed. And this is one of the biggest wins, where the overall cycle is reduced from weeks to days. With [inaudible] tracking, we bring in standardization and governance because we now have lineage and logging to a parameters, code versions, metrics, and artifacts for all possible ML activities in the same place. The organized tracking makes it easier to look at results and to choose the best optimal version. This also facilitates reusability and shared learnings, because now multiple teams can look at prior versions and associated metrics and then decide to reuse existing models or build new ones. Lastly, since ML Flow’s obstructions and integrations make it so much easier to implement machine learning models, we’ve seen teams other than data science self-serve and use it for their specific use cases.
So that was the end of our session. And to summarize, we’d like to wrap up with our key takeaways. Well, we are Databricks fans and the numbers on our slides speak for themselves. Our dev cycles are shorter and more reliable with easier testing. So the overall delivery time has reduced considerably. We’ve also decreased infrastructure costs, and [inaudible] talking about clusters using the same number of nodes. Just transitioning from EMR to Databricks without any other changes has resulted in cost savings, because Databricks is much more cheaper for smaller EC-2 instances. We’ve seen rapid adoption and almost half of us at Atlassian now use Databricks in one way or the other, just because it’s so easy to use, and it’s so intuitive.
Lastly, more and more teams, which work with data are now self-reliant. What we are trying to say is that if analytics is dependent on PE for, let’s say, 10 things, now they can do seven of those 10 things on their own. And the same holds true for dependencies between data engineering and the platform team. So we sincerely hope you had some takeaways from this stock, but thank you so much for listening. And just a side note, we are hiring, visit Atlassian careers, come join us.
Also, a huge thank you to all Atlassian folks who helped us with our content. Lastly, just a reminder, your feedback is important to us. So if you have some time, please don’t forget to rate and review the sessions. Thanks.
Richa Singhal is a Senior Data Engineer at Atlassian building and deploying data pipelines enabling analytics and data science teams across Atlassian. She has 10+ years of experience building large-sc...
8+ years of experience in the data engineering space over finance and marketing domains. Currently managing data engineering pipeline initiatives across marketing, sales, and enterprise at Atlassian.