Skip to main content
All

What is a Vector Database?

by Databricks Staff

A vector database is a specialized database designed to store and manage data as high-dimensional vectors. The term comes from vectors, which are mathematical representations of features or attributes contained in data. In contrast to traditional databases, which are well suited to handling structured data organized in rows and columns, the vector database structure arranges information as vector representations with a fixed number of dimensions grouped according to their similarity.

Each vector within a vector database consists of a specific number of dimensions, which can vary from just a few dozen to several thousand. The number of dimensions depends on the complexity and granularity of the data. This structure allows vector databases to efficiently handle complex, multifaceted information and perform rapid similarity-based searches and analyses.

When would I use a vector database?

According to International Data Corporation (IDC), 90% of the new data created is unstructured data, such as text, images and video. Learning-based models, such as deep neural networks, are increasingly used to manage this unstructured data for applications across industries, from e-commerce to healthcare. These applications work by turning the unstructured data into embedding vectors. Once the data have been “vectorized,” tasks such as searches, making recommendations and analysis can be implemented via similarity-based AI Search. The management of vector data takes place in vector databases.

Knowing when to use vector databases depends on the other processes and technologies you are using. They are a key component to powering many AI systems, and some (but not all) large language model (LLM) applications use vector databases for fast similarity searches or to provide context or domain knowledge. For example, they play a crucial role in retrieval augmented generation (RAG), an approach where the vector database is used to enhance the prompt passed to the LLM by adding additional context alongside the query.

Vector databases also enable hybrid search. This approach combines traditional keyword-based search with semantic similarity search to locate relevant information even when keywords are not an exact match. Vector databases can also be used for a number of natural language processing (NLP) tasks, including semantic and sentiment analysis, or in training machine learning (ML) models.

What is a vector?

A vector is a high-dimensional numerical array that expresses the location of a particular point across several dimensions. Picture a word vector space as a three-dimensional cloud where words are represented as points. In this space, words with related meanings cluster together. For example, the point representing “apple” would be positioned closer to “pear” than to “car.” This spatial arrangement reflects the semantic relationships between words, with proximity indicating similarity in meaning.

What is vector embedding?

A vector is generated by applying an embedding function to the raw data to transform it into a representation. These representations are called “embeddings” because an ML model takes a representative grouping and embeds it into a vector space. The vectors are embedded as lists of numbers, making it easier for ML models to perform operations with the data. In fact, the performance of ML methods critically depends on the quality of the vector representations. A whole paragraph of text or a group of numbers can be reduced to a vector, allowing the model to perform operations efficiently.

Key terms and definitions

  • Vector: A sequence of numbers that represents an object — such as a word, image or document — as a point in multi-dimensional space, enabling algorithms to mathematically compare objects and compute how similar or different they are
  • Embedding: A learned vector representation that maps discrete objects (words, documents and images) into a continuous vector space, so that semantically similar items end up geometrically close to one another
  • Cosine similarity: Measures the cosine of the angle between two vectors, capturing how similar their directions are regardless of their size, with values ranging from −1 (opposites) to 1 (identical direction): cos(θ) = (A · B) / (‖A‖× ‖B‖)
    Euclidean distance: The straight-line distance between two points in vector space, measuring how far apart they are in absolute terms rather than by directional alignment: d(A, B) = √Σ(Aᵢ − Bᵢ)²
  • Approximate nearest neighbor (ANN): A family of search algorithms that find vectors close to a query by scanning only a subset of the index, trading a small drop in accuracy for dramatically faster retrieval at scale
  • Hierarchical navigable small world (HNSW): A graph-based index that builds multiple layers of proximity connections, allowing queries to navigate quickly from coarse to fine neighbors
  • Inverted file index (IVF): Divides the vector space into clusters, then at query time searches only the nearest clusters, offering a practical balance between index build time and query speed
  • Locality-sensitive hashing (LSH): Hashes vectors so that similar ones are likely to land in the same bucket, enabling fast approximate search with low memory overhead
  • Metadata filtering: The practice of narrowing vector search results using structured attributes, such as date, category or user ID, so that results must satisfy hard business rules, not just semantic similarity
  • Hybrid search: Combines dense vector search (semantic meaning) with sparse keyword search (exact-match relevance via BM25/TF-IDF), then merges the two ranked lists, typically using Reciprocal Rank Fusion (RRF), to get the best of both approaches
  • Multi-vector search: Represents each record with several separate vectors (such as one each for the title, body and image) and searches across all of them, aggregating scores to surface the single most relevant result

How do vector databases work?

Vector databases are designed to efficiently store, index and query data through high-dimensional vector embeddings. Once a user inputs a query or request into the vector database, it commences the following sequence of processes:

  1. Vectorization: This first step involves generating embeddings from multimodal content, which could include text, images, audio or video. This process captures the semantic relationships in the data. For example, in text data, this process ensures that words with similar meanings (or vectors) will be placed close to each other in the vector space.
  2. Vector indexing: The next step sets vector databases apart from traditional databases. ML algorithms, such as product quantization or HNSW, are applied to the data to map the vectors to new data structures. These structures enable faster similarity or distance searches, such as nearest neighbor searches between vectors. This indexing process is essential for the database’s performance, as it allows for quick retrieval of similar vectors.
  3. Query execution: In the final stage, the initial query vector is compared against the indexed vectors in the database. The system retrieves the vectors with the strongest relationships, effectively finding the most relevant information based on semantic similarity rather than exact keyword matches.

These processes allow vector databases to perform semantic searches and similarity-based retrievals, making them ideal for applications like recommendation systems, image and video recognition, text analysis and anomaly detection.

Benefits of vector databases

Vector databases offer a range of benefits:

  • High speed and performance: Vector databases can rapidly locate similar data using vector distance or similarity metrics, a process that is integral to NLP, computer vision and recommendation systems. Unlike traditional databases, which are limited to exact matches or predefined criteria, vector databases capture semantic and contextual meaning. This optimizes data retrieval by enabling the performance of more nuanced, context-aware searches beyond simple keyword matching.
  • Scalability: While traditional databases may face challenges with scalability bottlenecks, latency issues or concurrency conflicts when dealing with big data, vector databases are built to handle vast amounts of data. Vector databases enhance scalability by using techniques like sharding, partitioning, caching and replication to distribute the workload and optimize resource utilization across multiple machines or clusters.
  • Versatility: Whether the data contains images, videos or other multimodal data, vector databases are built to be versatile. Given their ability to handle multiple use cases, ranging from semantic search to conversational AI applications, vector databases can be customized to meet a variety of business requirements.
  • Cost-effectiveness: Vector databases offer lower costs due to their efficient handling of high-dimensional data. Unlike querying ML models directly, which can be computationally intensive and time-consuming, vector databases use model embeddings to process the dataset more efficiently.
  • ML Integration: Vector databases make it easier for ML models to recall previous inputs, allowing ML to power semantic search, classification and recommendation engines. Data can be identified based on similarity metrics instead of exact matches, making it possible for a model to understand the context of the data.

Five vector database use cases

Vector databases are used across industries for a diverse range of applications and use cases. Here are some of the most common vector database examples:

Large language models (LLMs)

The rise of LLMs for tasks like information retrieval, alongside the increased popularity of e-commerce and recommendation platforms, requires vector database management systems that can deliver query optimization capabilities for unstructured data.

In multimodal applications, data is embedded and stored in vector databases, facilitating efficient retrieval of vector representations. When a user submits a text query, the system uses both the LLM and the vector database. The LLM provides NLP capabilities, while the vector database’s algorithms perform ANN searches. This approach can produce better results compared to using either component in isolation.

Vector databases are increasingly being applied to LLMs through RAG, which allows for increased explainability by applying context to LLM outputs. User prompts can be augmented through the inclusion of context to mitigate core LLM challenges, such as hallucination or bias.

Image recognition

Vector databases can play a key role in image recognition by storing high-dimensional embeddings of images generated by ML models. As vector databases are optimized for similarity search tasks, this makes them ideal for applications such as object detection, facial recognition and image search.

Vector databases are fine-tuned for the rapid retrieval of context through similarity. E-commerce platforms can use vector databases to find products with similar visual attributes, while social media sites can suggest related images to users. An illustrative example is Pinterest, where vector databases power content discovery by representing each image as a high-dimensional vector. When a user pins an image of a coastal sunset, the system can swiftly search its vector database to suggest visually similar images, like other beach landscapes or sunsets.

Natural language processing (NLP)

Vector databases have revolutionized NLP by enabling efficient storage and retrieval of distributed word representations. Models like Word2Vec, GloVe and BERT are trained on massive text datasets to generate high-dimensional word embeddings that capture semantic relationships, which are then stored in vector databases for fast access.

As they enable rapid similarity searches, vector databases allow models to find contextually relevant words or phrases. This capability is particularly valuable for tasks like semantic search, question answering, text classification and named entity extraction. Moreover, vector databases can store sentence-level embeddings, capturing word contexts and enabling more nuanced language understanding.

Recommendation systems and personalization

Once a vector database is trained using an embedding model, it can be utilized to generate personalized recommendations. When a user interacts with the system, their behavior and preferences are used to generate the user’s embedding. For example, a user can ask an LLM for a TV series recommendation and the vector database can suggest TV series that have plots or ratings similar to the user’s preferences. TV series with embeddings closest to the user’s encoding are then recommended accordingly.

Fraud detection

Financial institutions use vector databases to detect fraudulent transactions. Vector databases allow companies to compare transaction vectors with known fraud patterns in real time. The scalability of vector databases also allows them to manage risk and acquire new insights into consumer behavior. These databases can identify patterns that indicate activities by encoding transaction data as vectors. Furthermore, they facilitate the evaluation of creditworthiness and consumer segmentation by analyzing data to improve the decision-making process.

Common challenges of vector databases

Despite their many benefits and use cases, a complete understanding of vector databases needs to include their challenges as well.

New data pipelines

Vector databases require efficient data ingestion pipelines where raw, unprocessed data from various sources can be cleaned, processed and embedded with an ML model before it is stored as vectors in the database.

Databricks AI Search offers a comprehensive solution for this challenge. It automates vector generation, management and optimization, handling real-time synchronization of source data with corresponding vector indices. The software manages failures, optimizes throughput and performs automatic batch size tuning and autoscaling without the need for manual intervention.

This approach reduces the need for separate data ingestion pipelines, minimizing “developer toil” and allowing teams to focus on higher-level tasks that directly add business value rather than spending time on building and maintaining complex data preparation processes.

Increased security and governance

Vector databases require additional security, access controls and data governance along with the necessary maintenance and management. Enterprise organizations require strict security and access controls over data so users cannot access GenAI models that link to confidential data.

Many current vector databases either do not have robust security and access controls in place or require organizations to build and maintain a separate set of security policies. Databricks AI Search provides a unified interface that defines data policies to track data lineage automatically without the need for additional tools. This ensures LLMs won’t expose confidential data to users who shouldn’t have access.

High level of technical knowledge

As they offer powerful capabilities for similarity searches and the handling of high-dimensional data, vector databases are essential tools for data scientists working with AI and ML models. Databricks AI Search stands out as a serverless vector database that eliminates the need for manual configuration, allowing data scientists to focus on core work rather than infrastructure management.

Key advantages of Databricks AI Search include seamless integration with lakehouse architecture, automated data ingestion and up to five times faster results compared to other popular vector databases. It is also compatible with existing data governance and security tools through Unity Catalog, ensuring data protection and compliance.

Databricks AI Search offers flexibility for both novice and advanced users, with automated scaling for data ingestion and querying, as well as plug-and-replace APIs for those who prefer more control over their pipelines. This combination of ease of use and powerful performance simplifies building a vector database for data scientists at all levels of expertise.

Vector databases vs. graph databases

Vector databases organize data as points in a multidimensional vector space. Each point represents a piece of data, and the location reflects its characteristics relative to other pieces of data. This vector database structure is well suited to many GenAI applications, as vector embeddings are generated by LLMs and data can be searched and retrieved easily.

By contrast, graph databases organize data by storing it in a graph structure. Entities are represented as nodes on a graph, while the connections between these data points are represented as edges. The graph structure enables the data items in the store to be a collection of nodes and edges, with the edges representing the relationships between the nodes. The interconnected structure of graph databases makes them well suited for scenarios where the connections between data points are as important as the data itself.

Comparison: Vector database vs. vector index vs. traditional RDBMS vs. graph DB

Use this table to quickly compare how each database type stores data, handles queries and fits different workloads.

 Vector databaseVector indexTraditional RDBMSGraph DB
Data modelStreaming/continuous (seconds to minutes)Proactive, AI-driven analysisProactive, AI-driven analysisProactive, AI-driven analysis
Query typesAnalysts, executivesOperations teams, applications, automated systemsOperations teams, applications, automated systemsOperations teams, applications, automated systems
Typical latencyAd-hoc exploration, scheduled reportsPredefined metrics, alerts, automated triggersPredefined metrics, alerts, automated triggersPredefined metrics, alerts, automated triggers
ScaleHuman interpretation → decisionAutomated triggers, embedded recommendationsAutomated triggers, embedded recommendationsAutomated triggers, embedded recommendations
FilteringData warehouse, ETL pipelinesStreaming platforms, event processingStreaming platforms, event processingStreaming platforms, event processing
Transactional guaranteesEventual consistency typicalNone, read-only search layerFull ACIDACID (varies by tool)
Governance / securityImproving, varies by vendorMinimal, relies on host systemMature RBAC, audit logs, encryptionModerate, varies by vendor
Common toolsPinecone, Weaviate, QdrantFAISS, HNSW lib, ScaNNPostgreSQL, MySQL, SQL ServerNeo4j, Amazon Neptune, ArangoDB

What’s the difference between a vector index and a vector database?

A vector index and a vector database serve distinct but complementary roles in handling high-dimensional data.

Vector index: A vector index is a specialized data structure designed to facilitate fast similarity searches among vector embeddings. It significantly enhances search speed by organizing vectors in a way that allows efficient retrieval. Examples of vector indices include Facebook AI Similarity Search (FAISS), HNSW and LSH. These indices can be used as stand-alone algorithmic processes or integrated into larger systems to optimize search operations.
Vector database: A vector database is a comprehensive data management solution that not only incorporates vector indexing but also provides additional functionalities like data storage; create, read, update and delete (CRUD) operations; metadata filtering and horizontal scaling. It is designed to manage and query vector embeddings efficiently, supporting complex operations and ensuring data integrity and security.

How to choose a vector database

Choosing the right vector database depends on your specific workload demands, how large you expect your data to grow and how well the database fits into your existing technology stack. A solution that works perfectly for a small prototype may struggle under enterprise-scale traffic, while a feature-rich platform might be overly complex for simpler use cases. Keep these criteria in mind to choose a vector database that scales with your needs and plays well with existing systems.

  • Performance & latency: Understand what level of search accuracy (recall) and query response time are acceptable for your use case
  • Embedding dimensionality support: Make sure the database can handle the output size of your specific AI model, whether that's 768, 1536 or even higher
  • Supported index types: Confirm the database offers the right indexing algorithms for your data, such as HNSW, IVF or LSH, since these directly affect speed and accuracy tradeoffs
  • Hybrid search: Look for the ability to combine traditional keyword search (BM25) with semantic vector search in a single query
  • Exact + ANN fallback: Check whether you can switch between approximate and exact nearest-neighbor search, depending on how much precision you need
  • Metadata filtering: Ensure you can narrow results by structured fields like date or category alongside vector similarity
  • CRUD and ACID support: Evaluate whether the database supports full data operations and transactional guarantees, which is especially important if your data changes frequently
  • RBAC/ABAC and multitenancy: Verify that the database offers role- or attribute-based access controls and can keep different teams’ or customers' data properly isolated
  • Observability and evaluation: Look for built-in monitoring, logging and tools to measure search quality over time, so you can catch and fix performance issues early
  • Hardware acceleration: Consider whether GPU-accelerated indexing and search is supported and whether your current infrastructure can take advantage of it

Common pitfalls and best practices

  • Embedding drift → Establish a regular re-embedding schedule so that as your source data or underlying models evolve, your vectors stay current and accurately reflect what you're searching
  • Unversioned embeddings → Track which model version generated which vectors so you can reliably reproduce results, compare performance over time and roll back if something goes wrong.
  • Stale indexes → Define clear index refresh policies upfront, setting rebuild and update frequency based on how often your data changes 
  • Poor chunking for RAG → Test a range of chunk sizes (256–1024 tokens) with 10–20% overlap and evaluate retrieval quality at each setting
  • Near-duplicate content pollution → Run deduplication before indexing to remove redundant or near-identical content 
  • No evaluation metrics → Regularly benchmark using Recall@k, nDCG and MRR — aiming for benchmarks like Recall@10 above 0.85 for most production workloads — so you have a clear signal when search quality slips
  • PII exposure in embeddings → Mask or exclude sensitive personal data before it ever reaches the embedding stage and enforce fine-grained access controls on the vector store to limit who can query what

Q&A

Vector database vs. vector index — what's the difference?

These two terms are often used interchangeably, but they refer to different layers of the system.

Scope: A vector index is a single data structure — like HNSW or IVF — optimized to speed up nearest-neighbor search. In contrast, a vector database is a full system built around one or more of these indexes along with storage and query capabilities.
CRUD support: Vector indexes often have limited or inefficient support for updates and deletes. Vector databases provide robust create, read, update and delete operations on top of the index layer.
Scaling: A stand-alone index lives in memory and doesn't manage distribution or replication. A vector database, however, handles horizontal scaling, sharding and persistence across infrastructure.
Stand-alone vs. integrated: Vector indexes can be embedded directly in application code (e.g., FAISS). Vector databases are services with APIs, access controls and management tooling built in.

Is a vector database required for RAG?

A vector database is a common choice for production RAG pipelines, but it isn't always necessary. The right answer depends on your scale and complexity.

For production RAG at scale, a vector database becomes valuable when you need persistent storage, metadata filtering, access controls and the ability to update your dataset over time
Multi-tenant or regulated environments almost always warrant a vector database, since they require tenant isolation, audit logging and fine-grained access controls that stand-alone indexes don't provide
When your dataset is static and small, the overhead of a vector database may outweigh the benefits — a lightweight index loaded at startup can handle retrieval just as well
For prototyping, an in-memory index like FAISS or a simple file-based store is often sufficient and far easier to set up than a full vector database

How does hybrid (BM25 + vector) search work?

Hybrid search combines two fundamentally different retrieval signals — keyword matching and semantic similarity — into a single query result.

  • BM25 handles exact and keyword-based matches, scoring documents based on term frequency and relevance, which makes it reliable for precise queries like product names, codes or proper nouns
  • Vector search handles semantic matches, retrieving results based on meaning and context even when the query doesn't share exact words with the document
  • Score fusion merges the two signals into a single ranked list — Reciprocal Rank Fusion (RRF) is a common approach that combines rankings from each method without requiring careful score normalization
  • Hybrid search improves both precision and recall and is especially valuable in enterprise or domain-specific applications where users mix precise technical queries with broader conceptual searches

When is a vector database unnecessary?

Vector databases add real operational overhead, and there are several scenarios where that complexity simply isn't justified.
 

  • Small datasets that fit in memory are usually better served by a lightweight in-memory index like FAISS or Annoy, which can be loaded directly into your application without deploying a separate service
  • Use cases where exact keyword search is sufficient — like internal document lookup by title or ID — don't benefit from semantic search, making a traditional search index or database a simpler and more reliable choice
  • When you're already running PostgreSQL, the pgvector extension adds vector similarity search directly to your existing database, eliminating the need for a separate vector store and reducing infrastructure complexity
  • Low-traffic or single-user applications rarely need the scaling, replication or multitenancy features that justify a dedicated vector database, so the operational cost outweighs the benefit
  • If your dataset is static or changes infrequently, rebuilding or reloading an index periodically can be simpler than maintaining a fully managed vector database

Future trends for vector databases

The recent rise of LLMs and GenAI applications more generally has contributed to a concomitant uptake in vector databases. As AI applications continue to mature, the development of new products and the changing needs of users will decide the direction of future trends in vector databases. However, there are some generally expected directions for this technology.

Increased integration with ML models: The relationship between vector databases and ML models is the subject of increased research. These efforts aim to reduce the size and dimensionality of vectors, minimizing storage requirements for large datasets and boosting computational efficiency.
RAG customization: RAG is an approach used to improve the context provided to an LLM in GenAI use cases, including chatbot and general question-answer applications. The vector database is used to enhance the prompt passed to the LLM by adding extra context alongside the query.
Multi-vector search: Further research is expected on improving multi-vector search capabilities, which is important for applications such as face recognition. Current techniques often rely on combining individual scores, but this approach can be computationally expensive, as it increases the number of distance calculations required.
Hybrid search: The evolution of search systems has led to a growing adoption of hybrid approaches that combine traditional keyword-based methods with modern vector retrieval techniques

How to create a vector database with Databricks

Databricks AI Search is Databricks’ integrated vector database solution for the Data Intelligence Platform. This fully integrated system eliminates the need for separate data ingestion pipelines and applies security controls and data governance mechanisms, ensuring consistent protection across all data assets. 

Databricks AI Search provides a high-performance, out-of-the-box experience, allowing LLMs to quickly retrieve relevant results with minimal latency. Users benefit from automatic scaling and optimization, removing the need for manual tuning of the database. This integration streamlines the process of storing, managing and querying vector embeddings, making it easier for organizations to implement AI applications, such as recommender systems and semantic searches, while maintaining data security and governance standards.

Where can I find more information about vector databases and vector search?

There are many resources available to find more information on vector databases and vector search, including:

Blogs

eBooks

Demos

Contact Databricks to schedule a demo and talk to someone about your LLM and vector databases.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.