Vector Databases: Ingestion, Search, and the Tradeoffs Nobody Warns You About
A complete technical guide to vector database ingestion pipelines, index types, search configurations, and real-world tradeoffs — with working Python examples using Milvus and a semantic search engine over an Amazon product catalog.
Amazon's search engine processes roughly 3.5 billion queries a day. A keyword search for "running shoes comfortable long distance" matches documents containing those exact tokens. It will find a product titled "Comfortable Running Shoes for Long Distance" and miss an identically suited product titled "Marathon Training Footwear — Cushioned, Lightweight."
A semantic search over the same query encodes the intent — not the tokens — and finds both. It also finds the foam-soled trail shoes the copywriter labelled "all-day endurance footwear" and the product with no description at all, ranked by reviews that say "wore these for my first marathon, feet felt fine."
That gap — between matching words and matching meaning — is the entire reason vector databases exist. This guide covers how they work from the ground up: ingestion pipelines, index types, search configuration tradeoffs, and a full working example building semantic search over 50,000 Amazon product descriptions.
What a Vector Database Actually Is
A traditional database stores rows. It indexes by exact values. WHERE price = 49.99 is O(log n) with a B-tree index. It is fast and correct because the match condition is binary.
A vector database stores embeddings — high-dimensional float arrays that encode meaning. A sentence, an image, a product description, a user's click history: all can be transformed into a list of floats by a neural network. Two embeddings are "similar" if their vectors are geometrically close in that high-dimensional space.
The retrieval question is not "does this value equal this value?" but "which of these million vectors is most geometrically similar to this query vector?"
You cannot do this efficiently in Postgres. Exact nearest-neighbour search over 1 million vectors of 1536 dimensions each is O(n × d) per query — comparing your query against every stored vector, dimension by dimension. At 1M vectors × 1536 dims, that is 1.5 billion float comparisons per query. At a hundred queries per second, your database is spending its entire CPU budget on arithmetic.
Approximate Nearest Neighbour (ANN) algorithms solve this by trading a small amount of recall accuracy for orders-of-magnitude speed improvements. Vector databases are engines built around these algorithms, with the storage, indexing, and distribution infrastructure to run them at scale.
Milvus is the leading open-source vector database. It is cloud-native (Kubernetes-first), supports multiple index types, handles billions of vectors, and has a first-class Python SDK. Zilliz Cloud is the managed version if you want to skip the infrastructure. This guide uses Milvus throughout.
bash
The Ingestion Pipeline
Getting data into a vector database has four steps. Each has a decision point that affects everything downstream.
Step 1: Chunking
You rarely embed entire documents. A 10,000-word product manual embedded as one vector loses the signal from any specific section — every query about the manual returns it, and the embedding averages across all its content.
Chunking splits documents into segments before embedding. Three strategies:
Fixed-size chunking — split every N characters or N words with optional overlap.
python
Simple and predictable. The problem: a sentence split in the middle of a key phrase loses meaning at the boundary.
Sentence-boundary chunking — respect sentence endings, group into chunks up to a token limit.
python
Better for prose. For product descriptions — typically 1–3 sentences — a whole description often fits in one chunk, which is what you want.
The tradeoff: Smaller chunks → more precise retrieval but more vectors to store and search. Larger chunks → better context per result but noisier similarity scores. For product search, embedding the full description as one chunk is usually correct. For document QA (RAG), 256–512 tokens with 10% overlap is the standard starting point.
Step 2: Embedding Model Choice
The embedding model maps your text to a vector. The choice sets your dimension count, your accuracy ceiling, and your cost.
| Model | Dims | Context | Best for |
|---|---|---|---|
text-embedding-3-small (OpenAI) | 1536 | 8191 tokens | General purpose, API-hosted |
text-embedding-3-large (OpenAI) | 3072 | 8191 tokens | Higher accuracy, higher cost |
BAAI/bge-large-en-v1.5 | 1024 | 512 tokens | Open-source, strong benchmarks |
sentence-transformers/all-MiniLM-L6-v2 | 384 | 256 tokens | Fast, lightweight, good enough |
BAAI/bge-m3 | 1024 | 8192 tokens | Multilingual, sparse+dense hybrid |
For product search in English at scale, bge-large-en-v1.5 or all-MiniLM-L6-v2 running locally beats paying per-token to OpenAI. For a RAG pipeline where embedding happens infrequently, text-embedding-3-small is fine.
python
Always normalise embeddings at ingestion time if you plan to use inner product similarity — it makes inner product equivalent to cosine similarity, and inner product search is faster in Milvus.
Step 3: Schema Design in Milvus
A Milvus collection is the equivalent of a table. You define a schema with typed fields. The schema determines what metadata you can filter on at search time.
python
Schema design decisions:
- Add fields you will filter on —
price,category,rating. Milvus pushes scalar filters down before the ANN search, so filtering on an unindexed field is a post-scan, not pre-scan. enable_dynamic_field=Truelets you insert extra fields without pre-declaring them. Useful during development; turn it off in production for schema enforcement.auto_id=Falselets you control primary keys — keep them as your source system's product ID so you can join back to your product database.
Step 4: Loading the Index and Batch Ingestion
After inserting data you must build an index before searching. In Milvus, unindexed collections fall back to brute-force FLAT search.
python
Now the full ingestion pipeline — load the Amazon dataset, embed in batches, insert:
python
Index Types: The Decision That Sets Your Speed-Recall Ceiling
The index is the data structure Milvus builds over your vectors to enable fast approximate search. Choose the wrong one and you either get poor recall or unacceptable latency. There is no universally correct answer — the right choice depends on your dataset size, query latency budget, and available RAM.
FLAT — Brute Force
No approximation. Scans every vector. 100% recall, slowest query time.
Index params: {}
Search params: {}
Use when: Fewer than 100,000 vectors, or when you need exact results for evaluation or testing. Never in production at scale.
IVF_FLAT — Inverted File Index
Partitions vectors into nlist clusters at build time using k-means. At query time, searches only the nprobe nearest clusters — not all of them.
python
nlist tradeoff: More clusters → finer partitioning → better recall at low nprobe, but longer index build time.
nprobe tradeoff: The core speed-recall dial.
| nprobe | Recall@10 | Latency (1M vectors) |
|---|---|---|
| 4 | ~82% | ~12ms |
| 16 | ~92% | ~35ms |
| 64 | ~98% | ~110ms |
| 256 | ~99.5% | ~380ms |
Use when: Millions of vectors, moderate recall requirements, want predictable memory usage.
IVF_SQ8 — IVF with Scalar Quantisation
Same as IVF_FLAT but compresses each float32 (4 bytes) to int8 (1 byte) — 4× memory reduction with ~2–3% recall drop.
python
Use when: Memory is the constraint. A 1M × 1024-dim collection in IVF_FLAT uses ~4GB RAM. In IVF_SQ8: ~1GB.
HNSW — Hierarchical Navigable Small World
A graph-based index. Builds a multi-layer proximity graph at ingestion time. Queries traverse the graph starting from coarse layers, converging on neighbours in fine layers. Fastest query times of any Milvus index at high recall.
python
M tradeoff: M=8 → small graph, fast build, lower recall. M=32 → large graph, slow build, near-perfect recall. M=16 is the standard default.
ef tradeoff: ef=32 → fast, ~95% recall. ef=128 → slower, ~99% recall. ef must be ≥ top-k.
| ef | Recall@10 | Latency (1M vectors) |
|---|---|---|
| 16 | ~94% | ~2ms |
| 64 | ~98% | ~6ms |
| 256 | ~99.5% | ~18ms |
Memory cost: ~2× IVF_FLAT for the same dataset, because the graph structure is stored in RAM.
Use when: Query latency is the primary constraint. E-commerce search, real-time recommendation, chatbot retrieval. This is what Amazon OpenSearch k-NN and Milvus-backed production systems use.
DISKANN — Graph Index on Disk
Like HNSW but the graph lives on SSD, not RAM. Slower than HNSW but enables datasets that exceed available RAM.
python
Use when: Billion-scale datasets, limited RAM. A 1B-vector HNSW index would require ~3TB RAM. DISKANN makes this feasible on a commodity server.
Similarity Metrics
Milvus supports three. The choice is not arbitrary.
Cosine Similarity
Measures the angle between two vectors. Ignores magnitude, cares only about direction.
similarity = (A · B) / (||A|| × ||B||)
Range: -1 (opposite) to 1 (identical direction). Best for text embeddings — two sentences can have very different word counts (different magnitudes) but the same semantic meaning.
In Milvus, use COSINE as metric type, or normalise vectors at ingestion and use IP (mathematically equivalent, slightly faster).
L2 (Euclidean Distance)
Measures absolute distance between vector endpoints.
distance = sqrt(sum((A_i - B_i)^2))
Range: 0 (identical) to ∞. Best for image embeddings and spatial data where magnitude carries meaning.
Inner Product (IP)
score = A · B = sum(A_i × B_i)
Range: unbounded. With normalised vectors, IP == cosine similarity. Use IP when vectors are pre-normalised — it skips the normalisation step at search time and is marginally faster.
Search: Configurations and Filtered Queries
Basic Semantic Search
python
Filtered Search
Add a scalar filter. Milvus applies the filter before the ANN scan — it partitions the candidate set, then searches only within matching vectors.
python
Filter pushdown caveat: Milvus filters before ANN search only when the filter selectivity is above a threshold. If only 0.01% of your collection matches the filter, Milvus may fall back to a post-scan. Add scalar indexes (INVERTED for string fields, STL_SORT for numeric) to keep filters fast.
Real-World Example: Semantic Search Over Amazon Products
This is the full end-to-end comparison — keyword search (what Amazon originally did) vs semantic search (what it does now) vs hybrid (the current production pattern).
Keyword Search Baseline
python
Hybrid Search (Dense + Sparse + RRF Re-ranking)
This is the pattern that Amazon, Shopify, and Elasticsearch 8.x all implement — dense semantic vectors combined with sparse BM25 keyword vectors, fused with Reciprocal Rank Fusion.
python
Running the Comparison
python
What you observe:
- Semantic search returns "marathon training footwear", "cushioned endurance sneakers", and "ultralight trail runners" — none containing the word "running" in their title
- Keyword search returns only products with "running shoes" in the title — accurate for exact matches, blind to synonyms
- Hybrid (RRF fusion) gets the best of both: exact-match products ranked high, semantically similar products that BM25 would have missed also surfaced
- Filtered search narrows to the relevant category and price band without a noticeable recall drop because the
categoryandpricescalar indexes are in place
This is the architecture Amazon's A9/A10 algorithm has moved toward: a dense retrieval stage (semantic) fused with a sparse retrieval stage (BM25), re-ranked by relevance and personalisation signals. Milvus replicates this in ~80 lines of Python.
Latency Benchmarks: HNSW ef vs Recall
Running on a single node with 50k vectors (bge-large-en-v1.5, 1024 dims):
python
Typical output at 50k vectors:
ef latency_ms results
--------------------------------
16 1.2ms 10
32 1.8ms 10
64 2.9ms 10
128 4.7ms 10
256 8.1ms 10
At 50k vectors, latency is dominated by network round-trip, not compute. The difference becomes significant at 10M+ vectors where ef=64 might be 8ms and ef=256 might be 45ms — at that point the ef choice is real product engineering, not academic.
Production Configuration Checklist
Partitions for Multi-Tenancy
If you serve multiple customers or product categories, use partition keys to isolate data within a single collection:
python
Partition keys route data to physical partitions. A search with a matching partition key filter scans only that partition — dramatically faster than scanning the full collection.
Consistency Levels
Milvus is a distributed system. Data written on one node may not be immediately visible on all query nodes. Choose the right consistency level for your use case:
python
| Level | Behaviour | Latency impact |
|---|---|---|
Strong | Read reflects all writes up to this moment | High — waits for sync |
Bounded | Read reflects writes within a staleness window (default: 5s) | Low |
Session | Read reflects all writes from this session | Medium |
Eventually | No guarantee; maximum throughput | Minimal |
For e-commerce product search: Bounded is correct. A newly listed product appearing in search within 5 seconds is fine. Strong consistency is appropriate only when exact read-after-write guarantees are required (financial records, audit logs).
Quantisation to Cut Memory
For very large collections, INT8 scalar quantisation reduces memory 4× at ~2–3% recall cost:
python
At 100M vectors × 1024 dims: float32 = ~400GB RAM, SQ8 = ~100GB RAM. The recall drop is acceptable for most retrieval-augmented use cases.
Monitoring What Matters
Three metrics define a healthy vector search deployment:
python
How Amazon Actually Does It
Amazon's current search architecture (A10 algorithm, as described in public engineering posts) uses a multi-stage retrieval pipeline:
- 02
Stage 1 — Candidate retrieval: A dense bi-encoder retrieves the top ~1000 candidates from a billion-product index using ANN search (HNSW-equivalent). The bi-encoder is trained on Amazon's own click and purchase data, making the embedding space personalised to shopping intent specifically.
- 04
Stage 2 — Re-ranking: A cross-encoder re-ranks the 1000 candidates using a transformer that jointly attends to both the query and each product. Cross-encoders are too slow for first-stage retrieval but accurate enough for re-ranking a small candidate set.
- 06
Stage 3 — Business rules: Price, Prime eligibility, seller rating, and relevance scores are combined in a learned ranking function.
AWS exposes the first two stages as managed services:
- Amazon OpenSearch with k-NN plugin (HNSW, NMSLIB, Faiss backends) for Stage 1
- Amazon Bedrock Knowledge Bases for a fully managed RAG pipeline using the same retrieval primitives
- Amazon Personalize for the personalisation layer
Milvus replicates Stage 1 directly. For Stage 2, you can add a cross-encoder re-ranker using sentence-transformers cross-encoders or a Cohere Re-rank API call after the Milvus retrieval.
python
The cross-encoder reads both the query and the full product text together — much more accurate than the bi-encoder's separate encoding, and fast enough for 50 candidates in ~20ms.
Further Reading
Milvus Official Docs:
AWS:
Research:
Datasets used:
