DoD Data Decrees and the path to Lakehouse
April 13, 2022 in Industries
Throughout both private industry and government, data-driven decision-making has made the quantity and quality of information critical for organizations. In 2018, the United States Congress signed into law the Foundations for Evidence-Based Decisions Act, which establishes a framework for using data to facilitate the use of evidence in policy making. And more recently, in May of 2021, the Department of Defense (DoD) issued five decrees to “create a data advantage” by improving data sharing throughout the department, which will help eliminate data silos. According to the memo, the DoD rightly understands that becoming a “data-centric organization is critical to improving performance and creating decision advantage at all echelons from the battlespace to the board room.”
These five decrees recognize data’s importance as a strategic asset and aim to help the DoD build capabilities around data engineering and analysis, which are just as important to national security as any weapon system. Without good data and strategies around data management, engineering and security, the DoD could fall behind in critical command and control of data.
The DoD’s Five Data Decrees
As data grows across government, a number of bad patterns can emerge if there is not careful attention paid to management. Whether its data being siloed in databases resulting in sharing challenges, or applying data of uncertain quality for decision making, without clearly defined governance patterns poor data practice can and will proliferate. The DoD data decrees aim to reduce these negative scenarios.
The five decrees are as follows:
- Maximize data sharing and rights for data use: All DoD data is an enterprise resource.
- Publish data assets in the DoD federated data catalog along with common interface specifications.
- Use automated data interfaces that are externally and machine-readable; ensure interfaces use industry-standard, non-proprietary, preferably open source, technologies, protocols and payloads.
- Store data in a manner that is platform- and environment-agnostic, uncoupled from hardware or software dependencies.
- Implement industry best practices for secure authentication, access management, encryption, monitoring and protection of data at rest, in transit and in use.
Meeting the tenets of the Decrees
The Databricks Lakehouse Platform meets each of the tenets described in the memo by combining the best elements of data warehouses and data lakes to provide strong data governance with the performance of a data warehouse.
Sub: Data Sharing and Catalog
Although the DoD has moved from a “need to know” security basis – where access is only granted to those data which are necessary for one to conduct official duties – to a “need to share” approach which fosters much broader sharing of data between departments and agencies, sharing data between agencies can be challenging – and risky – with legacy tools that enable copies of data to proliferate. Data sharing can improve information-gathering within the department, and facilitate better cooperation with allies. However, unencrypted legacy technologies, like FTP for example, make it challenging to easily and securely share data. Having technology that forwards data sharing in a secure and open fashion is the key to this tenant.
Databricks’ Unity Catalog meets the data catalog tenet and provides a single-user interface to discover, audit, provide lineage, and govern data assets in one place. Some of the features include the ability to add role and attributed-based security and metadata, such as tags on columns or tables, which help make data more identifiable and secure. The Unity Catalog also provides a single interface that builds on the open source Delta Sharing protocol to manage and govern shared assets within an organization. This allows you to publish data assets in the federated DoD catalog, along with common interface specifications.
Each of these guiding principles aims to get data out of closed systems that cannot be easily shared across DoD. Tenets 3 and 4 set a direct course toward modern data platforms like Databricks, which allow datasets to be stored in low-cost object-based storage, and to be separated from compute. This allows for flexibility in choosing your compute tier and, more importantly, allows for much easier sharing of data than those locked in proprietary databases within agency and department walls.
Sub: Using open source technologies and uncoupling storage from compute
By moving data out of proprietary databases -- where it needs to be extracted and loaded into another system -- and into a data lake model, sharing data within the DoD becomes a much easier task. While there is always a lot of data gravity involved when you have petabytes and exabytes of data, modern storage makes sharing much easier.
Traditionally, analytics, business intelligence, data science and machine learning workloads were separate systems, creating organizational silos within an organization. With the data lakehouse architecture, you bring all of those tools together in one open system with Databricks.
Databricks for DoD
The Databricks platform is built on top of the open source Delta Lake storage platform, which brings reliability, security and performance to your data lake for streaming and batch workloads. This provides a single location where you can store structured data like CSV files, semi-structured data like JSON or XML, and unstructured files like video and audio. Delta Lake, an open source project powered by Databricks, provides a single source of truth and provides transactional support and schema enforcement that many data lakes and other big data platforms lack.
Databricks also provides a single interface for managing access to those data assets in the catalog. In addition to multi cloud support, Databricks implements best practices for secure authentication, access management and data protection, and meets the high demands of federal compliance protocols such as FedRamp-High and DoD IL6.
Making the shift
Making the organizational shift to being more open with data can be a challenge both organizationally and technically. However, the shift becomes easier when it has broad organizational support. Databricks can help the DoD meet these goals and support better data sharing throughout the department in the following ways:
- Allowing for better data access and easier data sharing of key data assets in the DoD.
- Federating data assets into a catalog that is accessible throughout the organization.
- Databricks is built on robust open source technologies like Apache Spark™ and Delta Lake.
- Decoupling storage from compute allows for a great deal of flexibility in the tools used for data analysis, and reduces data gravity.
- Databricks is built with strong security controls that meet the highest-level DoD standards.
To learn more about how Databricks enables the DoD to create a data advantage, visit our Federal solution page.