Skip to main content

Database modeling: a practical guide to techniques and best practices

Learn the phases, models and best practices of effective database design

by Databricks Staff

  • Database modeling is the process of defining a database's structure, relationships and constraints before implementation begins, serving as a blueprint that helps teams align on requirements and avoid costly design mistakes.
  • The modeling process moves through three phases — conceptual, logical and physical — progressing from high-level entity mapping to a fully optimized, platform-specific schema ready for implementation.
  • Choosing the right database model used to mean picking between relational, NoSQL or dimensional, but modern lakehouse platforms like Databricks Lakebase collapse that decision, letting teams run transactional and analytical workloads on a single, unified platform without forcing the tradeoff.

The database landscape is shifting. For decades, teams had to choose between systems for transactions, analytics, and flexible data structures. This separation shaped how organizations built applications and data architectures.

AI agents and real‑time applications are collapsing the boundaries between transactional and analytical workloads. As these systems grow more capable, the decisions made at the modeling stage matter more than ever. A well-structured schema can determine what downstream analytics, BI and machine learning can actually do with the data.

Database modeling is the process of defining structure, relationships and constraints before a database is built. It provides the blueprint that keeps systems coherent, whether they’re serving OLTP workloads, powering dashboards or feeding feature pipelines. Modeling is how teams ensure that data remains consistent, interpretable and scalable as the system evolves.

It’s worth noting that database modeling sits within the broader field of data modeling, which includes conceptual domain modeling, semantic layers, governance and analytical design. (For a deeper overview of that broader discipline, see the Databricks guide to data modeling.) This blog focuses on the key phases of the database modeling process, best practices and common mistakes and how the process plays out in modern, cloud-native environments.

The database design process

Conceptual design

The conceptual design phase for building a database establishes its foundation. At this stage, the focus is on identifying the real‑world data points that the organization cares about, such as customers, orders, products or accounts, and defining how they relate to one another. These entities and relationships help business stakeholders, analysts and technical teams align on what the database needs to do.

Conceptual design avoids technical detail. Instead, the emphasis is on accuracy and clarity: capturing the essential structure of the business domain in a way that is easy to discuss and validate. This makes conceptual models a communication tool as much as a design artifact, helping teams surface gaps, resolve ambiguities and ensure that the data model reflects how the business actually operates.

The primary output of this phase is a conceptual entity‑relationship diagram (ERD) or a simple entity map. A strong conceptual design provides the blueprint for the more detailed modeling work that follows.

Logical design

The logical design phase adds structure and precision to the conceptual model while remaining independent of any specific database technology. In this phase, each entity is expanded into a fully defined data object, including attributes, data types and constraints. Designers identify primary keys that uniquely identify each record and foreign keys that establish referential integrity between related entities. Relationship cardinality — one‑to‑one, one‑to‑many or many‑to‑many — is explicitly mapped to reflect how the data behaves in the real world.

This phase is also where normalization principles begin to shape the model. Redundant attributes are removed, composite fields are broken into their various components and relationships are reorganized to reduce anomalies and improve data quality. The goal is to create a logical structure that is internally consistent, scalable and aligned with the organization’s analytical and operational needs.

Even with this added detail, the logical model remains technology‑agnostic. It doesn’t assume any specific database engine or storage system. Instead, it defines the data in a way that can be implemented across multiple systems. The output is a detailed ERD — including entities, attributes, keys and relationships — that is ready to be translated into a physical schema.

Physical design

Physical design transforms the logical model into a specific implementation tailored to a particular database management system. This is where tables, columns, indexes, constraints and storage parameters are defined according to the platform's capabilities and conventions. It is also where decisions about partitioning, clustering, file formats and distribution strategies come into play.

Performance optimization is a major focus here. Designers must evaluate indexing strategies, consider denormalization to support high‑volume analytical queries and plan for how data will be accessed, updated and stored.

The final output of physical design is an implementation‑ready schema, typically expressed as SQL DDL or an equivalent definition. This schema reflects not only the logical structure of the data but also the operational realities of the platform on which it will run.

Choosing the right database model

Relational model

The relational model organizes data into tables made of rows and columns, with relationships enforced through primary and foreign keys. SQL provides a powerful, declarative way to query and join these tables, making relational systems ideal for workloads that require strong consistency, structured schemas and well‑defined relationships between entities.

Because of their reliability and maturity, relational databases remain the most widely adopted option across industries, powering everything from transactional systems to operational analytics. The relational model continues to evolve with cloud‑native capabilities, advanced indexing strategies and increasingly sophisticated query optimizers.

Document and NoSQL models

Document-oriented and NoSQL databases take a schema‑flexible approach, allowing data structures to evolve without rigid table definitions. These systems excel at handling unstructured or semi‑structured data, supporting rapid iteration and scaling horizontally across distributed environments. Their flexibility makes them well-suited for applications with frequently changing data shapes, high‑velocity ingestion or large‑scale distributed workloads.

However, this adaptability comes with tradeoffs, namely, consistency guarantees may be weaker than in relational systems, and complex querying, especially involving multi‑document relationships, can be challenging. NoSQL models shine when agility, scale and schema evolution outweigh the need for strict relational structure.

Dimensional model

The dimensional model is purpose-built for analytics and data warehousing, organizing data into fact tables that store measurable events and dimension tables that provide descriptive context. Star and snowflake schemas simplify how analysts query data by aligning the structure with common business questions, enabling fast aggregations and intuitive navigation.

Because dimensional models are optimized for read‑heavy analytical workloads, they are not intended for transactional systems that require frequent updates or strict normalization. Instead, they support business intelligence (BI) tools, dashboarding and large‑scale analytical processing where clarity, performance and usability are essential. In modern lakehouse architectures, dimensional modeling continues to play a central role in shaping curated, analytics‑ready datasets.

Hierarchical and network models

Hierarchical databases follow a tree‑like, parent‑child structure. Network models extend this approach by allowing many‑to‑many relationships through graph‑like connections. While historically important, both models are now mostly of academic or legacy interest. Their rigid traversal paths and limited flexibility make them a rare choice for new systems, though familiarity with them can provide useful context for understanding how modern models evolved.

How to match the model to the use case

Selecting the right database model depends on the shape of your data, the workload and your consistency requirements. Systems with structured, transactional data and complex relationships align naturally with the relational model. Applications that rely on flexible schemas, rapidly changing data structures or document‑centric storage benefit from document-oriented or NoSQL databases. Analytical workloads that power BI dashboards or reporting environments are best served by dimensional models designed for fast, predictable queries. When the core challenge involves highly interconnected data, such as social networks, recommendation engines or fraud detection, graph databases offer the best fit.

A simple decision matrix that maps workload type, data structure and consistency requirements to recommended models can help teams quickly narrow the options and choose the most effective approach.

Building ERDs

ERDs are the primary visual language of database modeling, providing a clear way to represent how data is structured and how different entities relate across the three design phases.

At their core, ERDs use a small set of visual elements: entities (typically rectangles), attributes (listed inside the entity or shown as connected ovals, depending on notation) and relationships (lines that describe how entities interact). These simple components make ERDs accessible to both technical and non‑technical stakeholders, which is why they are foundational in modern data modeling.

There are two major notation styles for building an ERD. Crow’s foot notation is the most widely used in industry because it visually encodes cardinality directly on the connecting lines. Chen notation, more common in academic settings, separates entities, attributes and relationships into distinct shapes, making it useful for teaching but less compact for real‑world design.

Regardless of notation style, the goal is the same: to create a shared, accurate representation of the data domain. A simple e‑commerce example illustrates how ERDs bring structure to a domain. A Customer places many Orders, and each Order belongs to exactly one Customer, forming a classic one‑to‑many relationship. Orders also contain multiple Products, and each Product can appear in many Orders. This many‑to‑many relationship is resolved through a junction table — Order_Items — which links Orders and Products while capturing additional details such as Quantity or Price at the time of purchase. Even in a small model, ERDs make these relationships explicit and easy to reason about.

Modern tooling makes ERD creation fast and collaborative. A wide range of diagramming and modeling tools support shared editing, versioning and export to SQL.The most effective workflow begins with a conceptual ERD to align stakeholders on entities and relationships, then progressively adds attributes, keys and constraints during the logical and physical design stages. This iterative refinement ensures the final schema is both technically sound and grounded in the real‑world processes it represents.

REPORT

The agentic AI playbook for the enterprise

Applying normalization

Normalization is the process of structuring relational tables to eliminate redundant data and prevent the three classic anomalies that lead to inconsistencies over time: insert, update and delete. By organizing data so that each fact is stored once and referenced cleanly, normalized schemas improve integrity, reduce storage waste and make write operations predictable and safe.

The process is typically described through a series of normal forms. First normal form (1NF) requires that every column contain atomic values, thus no lists, nested structures or repeating groups inside a single row. Second normal form (2NF) builds on this by ensuring that every non‑key attribute depends on the entire primary key, eliminating partial dependencies that occur in tables with composite keys. Third normal form (3NF) goes a step further: non‑key attributes must not depend on other non‑key attributes, removing transitive dependencies that blur the boundaries between entities.

Here’s why normalization matters. Imagine a denormalized Orders table that repeats customer name, email and address on every row. Updating a customer’s email requires touching every order that the customer has placed. Plus, deleting their last order could accidentally erase their contact information. Normalizing this structure produces two tables, Customers and Orders, which are linked by Customer_ID. Customer details live in one place, orders reference them cleanly, and the anomalies disappear.

Normalization is not an absolute rule, though. In read‑heavy analytical systems, especially data warehouses, designers often denormalize intentionally to reduce joins and simplify queries. Star schemas, for example, duplicate descriptive attributes in dimension tables to optimize scan performance.

The tradeoff is clear: normalization maximizes write integrity and storage efficiency, while denormalization maximizes read speed and query simplicity. The right balance depends on workload patterns and the system’s role in the broader architecture.

Database modeling best practices

Align all stakeholders on the database requirements

The most reliable database modeling designs begin with a clear understanding of requirements – the business processes, access patterns and constraints the database must support. Choosing a model type or opening a diagramming tool too early often leads to structures that look tidy on paper but fail under real workloads. Grounding the design in real use cases ensures the schema reflects how data actually moves through the system.

Create and document consistent naming conventions

Clear, consistent naming conventions make a schema self‑documenting. Tables and columns should communicate their purpose without guesswork. For instance, customer_id is immediately understandable, while cid is not. Naming consistency also improves query readability and reduces onboarding time for new developers.

Choose a well‑defined primary key

Surrogate keys, such as auto‑incrementing integers or UUIDs, are often more reliable than natural keys, which can change or become ambiguous over time. Composite keys work in some cases, but they complicate joins and should be used only when they reflect a genuine business rule.

Relationships and constraints should be explicit

Foreign keys, unique constraints and NOT NULL rules enforce the integrity the model was designed to protect. When these rules live only in tribal knowledge or application code, inconsistencies inevitably creep in.

Consider future needs and scale

A balanced design aligns with workload patterns and the system’s role while anticipating growth, but without drifting into over‑engineering. Excessive normalization can create schemas that require dozens of joins for simple queries, while skipping normalization entirely can lead to redundancy and anomalies.

Validating the model with sample queries is one of the most effective ways to expose design flaws early. Testing common reporting queries, transactional lookups and filtering patterns reveals whether the structure supports real usage efficiently.

Build for future schemas

Remember that schemas evolve. It’s essential to treat them like application code. Version‑controlling DDL changes, especially alongside migrations, creates a reliable history, supports collaboration and prevents drift between environments.

Common database modeling mistakes

Many modeling problems originate from skipping foundational steps or making assumptions that don’t hold once the system grows. A few patterns recur across teams and technologies, so recognizing them early can help prevent costly structural issues later.

One of the most common pitfalls is jumping straight to physical design, i.e., creating tables, indexes or diagrams without first defining the conceptual and logical models. This leads to schemas optimized for a single query or feature rather than the broader domain, and can eventually create brittle structures that resist change.

Closely related is the issue of missing or incorrect foreign keys. When relationships aren’t explicitly defined, orphaned records accumulate, joins break and data integrity becomes dependent on application logic rather than the database itself.

Naming inconsistencies can also cause long-term friction and, over time, can generate bugs and onboarding headaches.

Several mistakes stem from misunderstanding normalization. Over‑normalizing transactional systems can turn simple operations into multi‑table join chains, degrading performance. Under‑normalizing analytical systems has the opposite effect: it forces downstream ETL jobs to reshape data that should have been modeled for read‑heavy workloads in the first place.

Other recurring issues include:

  • Ignoring indexing until performance degrades — index strategy belongs in physical design, not in emergency triage
  • Not accounting for NULL behavior — unclear or inconsistent NULL handling leads to unpredictable query results and application errors

Avoiding these mistakes requires discipline in early design stages and a willingness to validate assumptions before implementation.

Putting database modeling into practice

Strong database modeling is the foundation that keeps systems clear, consistent and adaptable as they grow. Principles such as requirements‑driven design, normalization, explicit constraints and balanced physical modeling remain essential regardless of scale, workload type or architectural pattern.

What has changed is the environment in which these models now operate. The long‑standing practice of choosing between a transactional database or an analytical system is becoming less common thanks to platforms that can support both. Modern applications need ACID‑reliable operations and large‑scale analytics and maintaining separate systems for each can involve significant costs in terms of infrastructure, latency and engineering overhead.

Databricks Lakebase is built to address this change. Designed to work with the ACID-compliant capabilities that are already part of the Databricks Lakehouse Architecture, Lakebase adds a fully featured transactional database engine designed for high‑concurrency operational workloads. This enables the schemas you design (using the techniques in this guide) to power operational applications and analytical workloads on a single platform. No duplication, no parallel architectures, no compromise.

If your team wants to move beyond maintaining separate systems and instead build on a unified platform where one database model serves every workload, it’s time to learn more about Databricks Lakebase.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.