Agent Vector Store Integration Specification: Pinecone, Weaviate, and pgvector
Production AI agents require a vector store that handles embeddings, metadata filtering, hybrid search, and per-tenant isolation. This specification compares Pinecone, Weaviate, and pgvector across indexing (HNSW vs IVF), consistency, and cost so teams can pick a backend matched to their RAG workload.
TL;DR
Use pgvector when your data already lives in Postgres and you need transactional consistency. Choose Weaviate when hybrid search (BM25 + vector) and modular embedding pipelines matter. Pick Pinecone when you want a managed serverless backend with minimal ops and predictable scaling. All three support HNSW indexing; only Weaviate and pgvector expose tuning parameters.
Why this specification exists
Agent retrieval quality depends almost entirely on the vector store and how it is wired into the agent loop. A poorly chosen backend caps recall, leaks tenant data across namespaces, or creates query latency that breaks streaming UX. This specification standardizes the integration contract so a team can swap backends without rewriting agent code.
Scope and assumptions
- The agent uses an embedding model with stable output dimensions (e.g., 1,536 or 3,072).
- Documents are pre-chunked and enriched with metadata before write.
- The agent expects k-NN retrieval with optional metadata filters.
- Hybrid search (sparse + dense) is desirable but optional.
- Multi-tenant isolation is required at the namespace or row level.
Backend comparison
| Capability | Pinecone | Weaviate | pgvector | Qdrant | Milvus |
|---|---|---|---|---|---|
| Hosting | Serverless or pod | Self-host or cloud | Postgres extension | Self-host or cloud | Self-host or cloud |
| Index types | Proprietary HNSW | HNSW, flat | HNSW, IVFFlat | HNSW | HNSW, IVF, DiskANN |
| HNSW tuning | Not exposed (Pinecone, docs) | efConstruction, maxConnections, ef | m, ef_construction, ef_search | m, ef_construct, ef | Full tuning |
| Metadata filter | Yes | Yes | Yes (SQL WHERE) | Yes | Yes |
| Hybrid search | Sparse-dense vectors | Native BM25 + vector | Manual (tsvector + cosine) | Native | Native |
| Multi-tenancy | Namespaces | Tenants | Schema or tenant_id column | Collections | Partitions |
| ACID guarantees | No | No | Yes (Postgres) | No | No |
Reported p50 latency on a 1M-vector, 1,536-dim workload sits in the 3-12 ms range across these systems with HNSW configured for accuracy, with pgvector and Qdrant at the fast end and Pinecone Serverless in the 10-12 ms range (Vecstore benchmark, 2026). Treat these numbers as directional; your network, embedding dimensionality, and filter selectivity dominate real workloads.
Index selection: HNSW vs IVF
HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph. It delivers sub-10 ms p50 on millions of vectors with 95%+ recall when tuned, and it accepts incremental inserts without a full rebuild (pgvector docs). The cost is RAM: HNSW typically consumes 2-5x the memory of IVFFlat and has slower build times.
IVF (Inverted File Index) clusters vectors into Voronoi cells and probes only the nearest cells at query time. Build is faster and memory cheaper, but recall is sensitive to nprobe and to data distribution drift after inserts. Milvus recommends IVF when embeddings exceed 100M and memory budget is tight (Milvus blog, 2026).
Default rule for agent workloads. Use HNSW unless you cannot fit the graph in RAM. Tune ef_search per query class: low for autocomplete-style retrieval, high for grounded RAG.
Namespace and multi-tenant patterns
- Namespace per tenant (Pinecone, Qdrant collections). Strongest isolation; cross-tenant analytics require fan-out.
- Shared index with tenant_id filter (Weaviate tenants, pgvector column). Lower ops; rely on filter correctness for isolation.
- Shared index with row-level security (pgvector + Postgres RLS). Strongest correctness guarantee for regulated data; small filter cost.
Agents must always pass a tenant identifier into retrieval and reject any response that contains documents outside that scope. Treat the tenant ID as a security boundary, not a hint.
Hybrid search (sparse + dense)
Dense vectors capture semantics; sparse retrievers like BM25 capture rare tokens (product codes, error strings). Hybrid search combines both via reciprocal rank fusion (RRF) or weighted scoring.
- Weaviate: native hybrid query with alpha parameter to balance vector vs BM25.
- Pinecone: sparse-dense vectors require generating a sparse representation (e.g., SPLADE) at index time.
- pgvector: combine tsvector full-text search with cosine distance and rank-fuse in SQL.
For agent retrieval, hybrid search typically improves recall on out-of-distribution queries containing identifiers, codes, or proper nouns.
Write-time vs read-time consistency
- Write-time consistency. Agent writes (e.g., new memory) must be queryable before the next turn. pgvector delivers this via Postgres transactions. Pinecone Serverless and Weaviate are eventually consistent on write; expect propagation lag from milliseconds to seconds.
- Read-time consistency. Replica lag affects retrieval freshness in distributed deployments. Pin reads to the primary for memory writes that must be visible immediately.
If your agent depends on "read your own writes" semantics within a single conversation, prefer pgvector or document a write-then-poll pattern in the agent loop.
Refresh cadence
| Source | Recommended cadence | Trigger |
|---|---|---|
| Static reference docs | Weekly batch | Doc change webhook |
| Product catalog | Hourly | CDC stream |
| User memory | Real-time | Per turn |
| Public web | Daily | Scheduled crawl |
Rebuild full indexes only when embedding model changes; otherwise upsert in place.
Cost modeling
Agent retrieval costs decompose into storage, query, and embedding regeneration.
- Storage: vector size = dim 4 bytes (float32) or dim 1 byte (int8 quantized). HNSW adds 2-5x for graph edges.
- Query: managed services charge per read unit or per query; self-hosted is dominated by RAM and CPU.
- Embedding regeneration: triggered by model upgrades or chunking changes.
For a 10M-vector, 1,536-dim corpus, public benchmarks show monthly costs ranging roughly from low hundreds (pgvector on managed Postgres) to several hundreds (managed serverless), depending on QPS and retention (Vecstore benchmark, 2026). Treat any vendor pricing page as the source of truth.
Reference write contract
{
"id": "doc_123#chunk_4",
"vector": [0.12, -0.04, 0.91],
"sparse_vector": {"indices": [42, 1019], "values": [0.7, 0.3]},
"metadata": {
"tenant_id": "acme",
"source": "https://example.com/article",
"chunk_index": 4,
"updated_at": "2026-05-03T10:00:00Z",
"acl": ["role:reader"]
}
}Reference query contract
results = store.query(
tenant_id="acme",
vector=embed(query_text),
sparse_vector=sparse_encode(query_text),
top_k=12,
filter={"updated_at": {"$gte": "2026-01-01"}},
hybrid_alpha=0.6,
)Common pitfalls
- Mixing embedding models in one index. Always namespace by model + dimension.
- Forgetting to re-embed after a chunking change.
- Using top_k larger than the agent context budget can ingest.
- Treating namespace filters as security without server-side enforcement.
- Skipping hybrid search and then complaining the agent cannot find SKUs.
FAQ
Q: Which vector store should a small team start with?
Start with pgvector if you already run Postgres. You get HNSW indexing, transactional writes, and SQL filters without adding a new system. Move to a dedicated store when memory pressure or QPS exceeds what your Postgres can support.
Q: When does HNSW stop being the right choice?
When the graph no longer fits in RAM or when ingest dominates queries. At hundreds of millions of vectors, IVF or DiskANN can be cheaper if you accept a small recall hit.
Q: How should agents handle a stale read after a memory write?
Treat the write as a future read. Either confirm the write before the next agent turn, or include the just-written content in the working context until the index propagates.
Q: Is hybrid search worth the complexity?
For agent workloads with codes, identifiers, or domain jargon, yes. Pure dense retrieval often misses exact-match tokens. Start with Weaviate or Qdrant native hybrid; build it manually only on pgvector.
Q: How do I avoid leaking documents across tenants?
Pass tenant_id into every read and write. Enforce filters server-side via row-level security (pgvector) or per-tenant indexes (Pinecone, Qdrant). Never rely on application code alone.
Related Articles
Agent Knowledge Base Specification: Structure, Refresh, and Versioning
Production specification for AI agent knowledge bases: document model, chunking strategies, metadata enrichment, refresh cadence, version pinning, and rollback.
Agent Memory Pattern Specification: Short-Term, Long-Term, and Episodic
Specification for AI agent memory: working, episodic, semantic, and procedural tiers with consolidation, eviction, and PII handling.
Agent Multi-Step Reasoning Specification: ReAct, Plan-and-Execute, and Reflection
Specification for AI agent multi-step reasoning patterns: ReAct, Plan-and-Execute, Reflexion, Tree of Thoughts, and Self-Consistency.