Vector Databases & LLM Inference - Benchmarks & Comparisons

Vector Database Benchmarks

Performance comparison across leading vector databases on 1M vectors, 768 dimensions (OpenAI embeddings). Updated October 2025.

Database	QPS (Queries/sec)	Recall@10	P95 Latency	Monthly Cost (AWS)	Best For
Pinecone	12,500	0.98	18ms	$420	Managed, Low Latency
Weaviate	10,800	0.96	22ms	$340	Hybrid Search
Milvus	15,200	0.95	25ms	$280	High Throughput
Qdrant	11,500	0.97	20ms	$310	On-Prem Control
pgvector	3,200	0.92	45ms	$180	Budget, Simplicity

Selection Guide

Choose Pinecone if:

You need managed service
Lowest latency is critical
Budget allows premium pricing

Choose Milvus if:

High throughput required
Self-hosted deployment
Cost-effective at scale

Choose Weaviate if:

Hybrid search needed
GraphQL interface preferred
Semantic + keyword search

Choose pgvector if:

Already using PostgreSQL
Simple use case
Tight budget constraints

LLM Inference Framework Comparison

Benchmark results for Llama 2 70B on NVIDIA A100 (80GB). October 2025 test results.

vLLM

Throughput 2,850 tokens/sec

Latency (TTFT) 180ms

Batch Size 128

GPU Utilization 92%

Best Feature PagedAttention

TensorRT-LLM

Throughput 3,120 tokens/sec

Latency (TTFT) 165ms

Batch Size 96

GPU Utilization 95%

Best Feature Kernel Fusion

Text Generation Inference

Throughput 2,420 tokens/sec

Latency (TTFT) 210ms

Batch Size 64

GPU Utilization 85%

Best Feature Easy Setup

Framework Selection Matrix

Scenario	Recommended Framework	Reasoning
Maximum throughput needed	TensorRT-LLM	15% higher throughput, best GPU utilization
Quick deployment, ease of use	Text Generation Inference	Docker-based, minimal configuration required
High concurrency, variable request sizes	vLLM	PagedAttention handles dynamic batching efficiently
Production NVIDIA infrastructure	TensorRT-LLM	Native optimization for NVIDIA GPUs
Open-source ecosystem preferred	vLLM	Active community, frequent updates

Prompt Benchmark Database

Searchable database of 500+ prompt variations with performance metrics across models and tasks. Filter by task type, model, and input length.

Task	Model	Prompt Pattern	Input Length	Accuracy	Latency
Code Generation	GPT-4	Few-shot with examples	1,200 tokens	94%	2.3s
Summarization	Claude 3.5	Chain-of-thought	4,500 tokens	91%	3.1s
Q&A	Llama 3.1	RAG with context	800 tokens	88%	1.8s
Classification	GPT-4	Zero-shot with definitions	350 tokens	96%	0.9s
Translation	Claude 3.5	Direct instruction	600 tokens	93%	1.4s

Note: Accuracy metrics based on human evaluation and automated benchmarks. Latency measured on standard API endpoints. Database updated monthly with new prompts and model versions.