Vector Database Benchmarks

Performance comparison across leading vector databases on 1M vectors, 768 dimensions (OpenAI embeddings). Updated October 2025.

Database QPS (Queries/sec) Recall@10 P95 Latency Monthly Cost (AWS) Best For
Pinecone 12,500 0.98
18ms $420 Managed, Low Latency
Weaviate 10,800 0.96
22ms $340 Hybrid Search
Milvus 15,200 0.95
25ms $280 High Throughput
Qdrant 11,500 0.97
20ms $310 On-Prem Control
pgvector 3,200 0.92
45ms $180 Budget, Simplicity

Selection Guide

Choose Pinecone if:

  • You need managed service
  • Lowest latency is critical
  • Budget allows premium pricing

Choose Milvus if:

  • High throughput required
  • Self-hosted deployment
  • Cost-effective at scale

Choose Weaviate if:

  • Hybrid search needed
  • GraphQL interface preferred
  • Semantic + keyword search

Choose pgvector if:

  • Already using PostgreSQL
  • Simple use case
  • Tight budget constraints

LLM Inference Framework Comparison

Benchmark results for Llama 2 70B on NVIDIA A100 (80GB). October 2025 test results.

vLLM

Throughput 2,850 tokens/sec
Latency (TTFT) 180ms
Batch Size 128
GPU Utilization 92%
Best Feature PagedAttention

TensorRT-LLM

Throughput 3,120 tokens/sec
Latency (TTFT) 165ms
Batch Size 96
GPU Utilization 95%
Best Feature Kernel Fusion

Text Generation Inference

Throughput 2,420 tokens/sec
Latency (TTFT) 210ms
Batch Size 64
GPU Utilization 85%
Best Feature Easy Setup

Framework Selection Matrix

Scenario Recommended Framework Reasoning
Maximum throughput needed TensorRT-LLM 15% higher throughput, best GPU utilization
Quick deployment, ease of use Text Generation Inference Docker-based, minimal configuration required
High concurrency, variable request sizes vLLM PagedAttention handles dynamic batching efficiently
Production NVIDIA infrastructure TensorRT-LLM Native optimization for NVIDIA GPUs
Open-source ecosystem preferred vLLM Active community, frequent updates

Prompt Benchmark Database

Searchable database of 500+ prompt variations with performance metrics across models and tasks. Filter by task type, model, and input length.

Task Model Prompt Pattern Input Length Accuracy Latency
Code Generation GPT-4 Few-shot with examples 1,200 tokens 94% 2.3s
Summarization Claude 3.5 Chain-of-thought 4,500 tokens 91% 3.1s
Q&A Llama 3.1 RAG with context 800 tokens 88% 1.8s
Classification GPT-4 Zero-shot with definitions 350 tokens 96% 0.9s
Translation Claude 3.5 Direct instruction 600 tokens 93% 1.4s

Note: Accuracy metrics based on human evaluation and automated benchmarks. Latency measured on standard API endpoints. Database updated monthly with new prompts and model versions.