RAG with Vector Databases: Embeddings, HNSW Indexing, and Filters
Have you ever asked an AI a question about your company’s internal policies, only to get a confident but completely wrong answer? That is the hallucination problem. Large Language Models (LLMs) are impressive, but they are not search engines. They predict the next word based on patterns in their training data, which means they often guess when they don’t know the facts. This is where Retrieval-Augmented Generation, or RAG, comes in. It is a system that gives your AI access to real, up-to-date information before it answers.
RAG works by connecting your LLM to a vector database. Think of this database as the AI’s long-term memory. Instead of relying on what the model learned during its initial training, the system retrieves specific documents from your knowledge base and feeds them into the prompt. This grounds the AI in reality. To make this retrieval fast and accurate, we rely on three core technologies: embeddings, HNSW indexing, and metadata filters. Let’s break down how these pieces fit together to build reliable AI applications.
The Role of Embeddings in Semantic Search
Before a vector database can find relevant information, it needs to understand what words mean. Computers do not naturally understand language; they only understand numbers. Embeddings bridge this gap. An embedding is a mathematical representation of text, converted into a high-dimensional vector-a long array of numbers. These numbers capture the semantic meaning of the content. For example, the words "car" and "automobile" will have very similar vectors, even though they are different strings of text.
In a RAG workflow, the process starts by taking your domain-specific data-like PDFs, manuals, or chat logs-and breaking it into smaller chunks. You then run each chunk through an embedding model. Popular choices include the Amazon Titan Text Embedding v2 for enterprise environments or the open-source all-MiniLM-L12-v2 model from the SentenceTransformer library. When a user asks a question, the system converts that query into a vector using the exact same model. The database then searches for vectors that are mathematically close to the query vector. This is why embeddings are the foundation of semantic search: they allow the system to find related concepts, not just matching keywords.
Why HNSW Indexing Matters for Speed
Finding the nearest neighbors in a list of millions of vectors sounds simple, but it is computationally expensive. If you compare a query vector against every single document in your database one by one, it takes too long for real-time applications. This is where Hierarchical Navigable Small World, or HNSW, changes the game. HNSW is a graph-based indexing algorithm designed for approximate nearest neighbor search.
HNSW organizes data points into a multi-layered graph structure. Imagine a map with highways at the top layer and local streets at the bottom. When a query enters the system, HNSW starts at the top layer, quickly navigating across large distances to get close to the target area. As it moves down through the layers, the connections become finer and more detailed, allowing the system to pinpoint the most relevant results with high accuracy. This hierarchical approach strikes a powerful balance between speed and precision.
Real-world performance metrics highlight the impact of HNSW. In a study involving a PostgreSQL database with the Pgvector extension and a dataset of 1 million rows, building an HNSW index took about 33 minutes. However, once indexed, similarity search times dropped from several seconds per query to mere milliseconds. That is a performance improvement of over 100x. For any application requiring instant responses, such as customer support bots or interactive research tools, HNSW makes RAG commercially viable.
Alternative Indexing Methods: IVFFlat vs. HNSW
While HNSW is widely regarded as one of the top performers for accuracy and speed, it is not the only option. Another common method is IVFFlat (Inverted File with stored vectors). IVFFlat uses geometric partitioning to divide the vector space into smaller clusters or sub-indexes. When a query arrives, the system identifies which cluster is likely to contain the answer and only searches within that specific partition. This reduces the amount of compute power needed compared to scanning the entire dataset.
| Feature | HNSW | IVFFlat |
|---|---|---|
| Search Accuracy | Very High | Moderate to High |
| Query Latency | Low (Milliseconds) | Low to Moderate |
| Index Build Time | Moderate | Fast |
| Memory Usage | Higher | Lower |
| Best Use Case | High-precision, real-time apps | Large datasets, cost-sensitive projects |
The choice between HNSW and IVFFlat depends on your specific constraints. If you need the highest possible recall and can afford higher memory usage, HNSW is usually the better choice. If you are working with massive datasets where index build time and storage costs are primary concerns, IVFFlat might be more suitable. Some advanced systems even combine multiple methods using techniques like Reciprocal Rank Fusion to improve overall result quality.
Using Filters for Precise Retrieval
Semantic similarity alone is not always enough. Imagine a multi-tenant SaaS platform where Company A should never see documents belonging to Company B. Or consider a medical AI that must only retrieve guidelines approved after a certain date. This is where metadata filters come into play.
Filters allow you to constrain the vector search to a specific subset of data based on attributes like user ID, document type, creation date, or department. Most modern vector databases support pre-filtering, which applies these constraints before performing the similarity search. This ensures that the retrieved results are not only semantically relevant but also contextually appropriate and compliant with business rules. By separating storage from compute and using efficient filtering mechanisms, you can maintain security and relevance without sacrificing performance.
Implementing RAG Architecture
Building a robust RAG system requires careful architectural decisions. First, ensure that your embeddings are stored close to your source data to minimize latency. If you are using PostgreSQL with the Pgvector extension, you can generate embeddings directly within the database using stored procedures. This abstraction simplifies the application layer and keeps your data pipeline clean.
A typical table structure for storing embeddings might look like this:
id: SERIAL PRIMARY KEYembeddings: vector(384) - matching the dimensionality of your embedding modelcontent: TEXT NOT NULL - the original text chunkmetadata: JSONB - for storing filters like author, date, or tags
When serving queries, the system follows a clear path: chunk the input, generate the query vector, search the vector database for the top-k nearest neighbors, apply any necessary filters, and finally append the retrieved content to the LLM prompt. This workflow ensures that the AI has the correct context to generate accurate answers, significantly reducing the risk of hallucinations.
Choosing Similarity Metrics
How does the database measure "closeness" between vectors? There are several similarity metrics available, and choosing the right one matters. Cosine similarity measures the angle between two vectors and is particularly effective for text embeddings because it focuses on direction rather than magnitude. Euclidean distance calculates the straight-line distance between points, while negative inner product offers another alternative. Most modern embedding models, including Titan Text Embedding v2 and all-MiniLM-L12-v2, are optimized for cosine similarity. Always check the documentation of your specific embedding provider to see which metric they recommend.
What is the main benefit of using RAG over standard LLM prompts?
The main benefit is reduced hallucination and access to up-to-date information. Standard LLMs rely on static training data, so they cannot know about recent events or private company documents. RAG retrieves factual data from a vector database and includes it in the prompt, grounding the AI's response in verified sources.
Why is HNSW preferred over brute-force search?
Brute-force search compares a query against every single vector in the database, which becomes incredibly slow as data grows. HNSW uses a hierarchical graph structure to navigate quickly to the most relevant areas, reducing search times from seconds to milliseconds while maintaining high accuracy.
Can I use IVFFlat instead of HNSW?
Yes, IVFFlat is a valid alternative, especially if you have strict memory constraints or very large datasets. It partitions data into clusters, which can be faster to build and lighter on storage, though it may offer slightly lower recall accuracy compared to HNSW.
How do embeddings work in simple terms?
Embeddings convert text into arrays of numbers. Words with similar meanings end up with similar number patterns. This allows computers to calculate semantic relationships mathematically, enabling them to find related documents even if they don't share exact keywords.
What role do filters play in vector databases?
Filters restrict the search results based on metadata, such as user permissions, document dates, or categories. This ensures that the AI only retrieves information that is relevant and authorized for the specific context, adding a layer of security and precision to semantic search.
- May, 6 2026
- Collin Pace
- 0
- Permalink
Written by Collin Pace
View all posts by: Collin Pace