Server data from the Official MCP Registry
Persistent memory for AI agents — semantic + recency search, ONNX embeddings, Docker Compose.
Persistent memory for AI agents — semantic + recency search, ONNX embeddings, Docker Compose.
Valid MCP server (2 strong, 3 medium validity signals). 1 known CVE in dependencies Imported from the Official MCP Registry.
10 files analyzed · 1 issue found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
From the project's GitHub README.
A production-grade persistent memory service for AI agents. Agents forget everything between sessions by default — memex fixes that. It stores, retrieves, and ranks conversation memory using semantic search with recency decay, so agents surface what's relevant and recent, not just what's semantically closest.
POST /v1/memories → store a memory, embed it, persist to Postgres
POST /v1/memories/search → retrieve top-k memories ranked by similarity + recency
DELETE /v1/memories/{id} → forget a specific memory
GET /v1/memories/count → how many memories does this agent/user have
GET /health → liveness + DB connectivity check
GET /metrics → Prometheus metrics
caller (agent / app)
│
▼
FastAPI (async)
│
┌────┴────┐
│ │
embeddings asyncpg pool (min=5, max=20)
(fastembed │
ONNX, ▼
local) PostgreSQL 16
pgvector extension
ivfflat index (cosine)
Write path: content → fastembed ONNX inference (local, ~12 ms CPU, BAAI/bge-small-en-v1.5) → INSERT with 384-dim vector → return memory ID.
Read path: query → embed → pgvector cosine search (top_k × 3 candidates) → re-rank with recency decay in Python → return top_k results with scores.
Pure vector similarity returns the most semantically similar memories, not the most useful ones. A fact from 90 days ago that's a 0.95 similarity match is often less useful than a 0.80 match from yesterday.
Score formula:
score = α × cosine_similarity + (1 − α) × exp(−λ × age_days)
Where λ = ln(2) / half_life_days (default: 30 days, so a 30-day-old memory has 50% recency weight).
α is configurable per request (default 0.7). Task-focused agents use higher α (semantic dominates). Conversational agents use lower α (recency matters more).
The pgvector query returns top_k × 3 candidates sorted by pure similarity. Python re-ranks with the decay formula and slices to top_k. This prevents recency decay from starving high-similarity older memories — they're still in the candidate pool.
At 10× scale (>1M memories per agent): push the scoring into a Postgres function using pg_proc to eliminate the Python re-ranking round-trip.
SQLAlchemy adds ORM overhead on every query. The hot retrieval path — embed, query, re-rank — needs to be tight. asyncpg gives direct control over pool min/max (same instinct as tuning HikariCP in Java). pgvector queries require raw SQL for the <=> operator anyway.
Pool defaults: min=5, max=20. Right-size for a single-instance deployment. Override via DB_MAX_POOL_SIZE env var.
Sliding window counter via upsert. One fewer dependency. Correct under concurrent requests (transactional upsert). At 10× scale with distributed deployments: replace with Redis INCR + EXPIRE — atomic operations, no lock contention.
ivfflat has lower build cost and lower memory footprint — the right tradeoff at small-to-medium scale (<1M vectors). lists=100 works well up to ~1M rows. At 10× scale: switch to HNSW (m=16, ef_construction=64) for better recall at the cost of higher memory and build time.
Prerequisites: Docker and Docker Compose. No API keys required — the entire stack runs locally.
git clone https://github.com/ayushagrawal288/memex
cd memex
docker compose up
The API is live at http://localhost:8000. Interactive docs at http://localhost:8000/docs.
curl -X POST http://localhost:8000/v1/memories \
-H "Content-Type: application/json" \
-d '{
"agent_id": "my-agent",
"user_id": "user-123",
"content": "User prefers concise responses and dislikes verbose explanations.",
"memory_type": "semantic",
"importance": 1.2
}'
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"agent_id": "my-agent",
"user_id": "user-123",
"content": "User prefers concise responses and dislikes verbose explanations.",
"importance": 1.2,
"memory_type": "semantic",
"created_at": "2026-05-26T10:30:00Z",
"score": null
}
curl -X POST http://localhost:8000/v1/memories/search \
-H "Content-Type: application/json" \
-d '{
"agent_id": "my-agent",
"user_id": "user-123",
"query": "how does this user like to communicate",
"top_k": 5,
"alpha": 0.7
}'
{
"results": [
{
"id": "3fa85f64-...",
"content": "User prefers concise responses and dislikes verbose explanations.",
"memory_type": "semantic",
"created_at": "2026-05-26T10:30:00Z",
"score": 0.8921
}
],
"query": "how does this user like to communicate",
"total": 1
}
| Type | Use for |
|---|---|
episodic | Specific events, past conversations |
semantic | Facts, preferences, general knowledge |
procedural | Workflows, how-to instructions |
Run on a MacBook M-series, Docker Desktop, single Postgres instance:
locust -f scripts/load_test.py --host=http://localhost:8000 \
--headless -u 50 -r 10 -t 60s
Realistic load (50 users, 100–300 ms think time — models actual agent traffic):
| Endpoint | RPS | p50 (ms) | p95 (ms) | p99 (ms) | Error rate |
|---|---|---|---|---|---|
| POST /v1/memories (write) | 27 | 160 | 270 | 330 | 0% |
| POST /v1/memories/search | 83 | 110 | 200 | 250 | 0% |
| Aggregated | 113 | 120 | 230 | 300 | 0% |
Saturation test (500 users, minimal think time — finds the throughput ceiling):
| Endpoint | RPS (plateau) | p50 (ms) | p99 (ms) | Error rate |
|---|---|---|---|---|
| POST /v1/memories (write) | 28 | 3,900 | 6,100 | 0% |
| POST /v1/memories/search | 91 | 3,600 | 5,800 | 0% |
| Aggregated | ~120 | 3,700 | 5,900 | 0% |
Run on MacBook M-series, Docker Desktop (4 CPUs), 4 uvicorn workers, 16 threads/worker.
Embeddings: local ONNX (BAAI/bge-small-en-v1.5) — zero external API calls, zero cost.
Why the ceiling is ~120 RPS:
Every write and every search requires one ONNX inference (~10–15 ms on CPU). With 4 Docker CPUs: 4 cores / 12 ms ≈ 333 embeddings/s theoretical max. After Python overhead, DB queries, and asyncio scheduling: ~120 RPS actual.
Path to higher throughput:
| Approach | Expected gain | Complexity |
|---|---|---|
| Embedding cache (Redis, key = SHA256 of text) | 2–3× (40–60% hit rate on repeated agent queries) | Low |
| Horizontal scaling (N replicas behind a load balancer) | N× linear | Medium |
| GPU inference (swap ONNX runtime → CUDA) | 10–50× | Medium |
| Voyage-3 API (offload to Anthropic's inference fleet) | Scales to thousands of RPS, limited by API quota | Low code change |
memex/
├── app/
│ ├── main.py # REST API — FastAPI, lifespan, router registration
│ ├── mcp_server.py # MCP server — single-worker FastAPI on port 8001
│ ├── core/
│ │ └── config.py # All settings, loaded from env
│ ├── db/
│ │ └── pool.py # asyncpg pool, migrations
│ ├── models/
│ │ └── schemas.py # Pydantic request/response models
│ ├── services/
│ │ ├── embeddings.py # fastembed ONNX inference (local, zero API calls)
│ │ ├── local_summarizer.py # Extractive summariser — Jaccard dedup + TF scoring
│ │ ├── memory.py # Core write/search/scoring logic
│ │ ├── metrics.py # Prometheus metric definitions
│ │ ├── summarizer.py # Background summarisation job
│ │ └── rate_limit.py # Sliding window rate limiter
│ └── api/routes/
│ ├── memories.py # Memory endpoints
│ ├── health.py # Health + readiness
│ └── mcp_tools.py # MCP tool definitions (store, search, delete, count)
├── scripts/
│ └── load_test.py # Locust load test
├── docker-compose.yml
├── Dockerfile
└── requirements.txt
docker compose up starts Prometheus and Grafana alongside the API:
| Service | URL | Credentials |
|---|---|---|
| REST API docs | http://localhost:8000/docs | — |
| MCP server | http://localhost:8001/mcp/ | — |
| Prometheus | http://localhost:9090 | — |
| Grafana | http://localhost:3000 | admin / admin |
The Grafana dashboard is provisioned automatically. Panels:
prometheus-fastapi-instrumentatorembed / embed_batch)Custom metrics are in app/services/metrics.py and exposed on /metrics alongside the standard FastAPI instrumentator metrics.
memex exposes itself as an MCP server so any MCP-aware agent (Claude Desktop, Claude Code, custom agents) can store and retrieve memories without custom HTTP integration.
Transport: Streamable HTTP (MCP 2024-11-05 spec). Single-worker process on port 8001 — session state is in-process, so a separate service avoids sticky-session complexity while keeping the REST API's multi-worker throughput.
Tools:
| Tool | Description |
|---|---|
store_memory | Embed + persist a memory (type, importance configurable) |
search_memories | Semantic + recency ranked retrieval with configurable alpha |
delete_memory | Forget a specific memory by UUID |
count_memories | How many memories an agent/user pair has |
Add to ~/.config/claude/claude_desktop_config.json:
{
"mcpServers": {
"memex": {
"type": "streamable-http",
"url": "http://localhost:8001/mcp/"
}
}
}
claude mcp add --transport http memex http://localhost:8001/mcp/
The MCP Streamable HTTP transport is session-stateful — initialize, tools/list, and tools/call must all reach the same server process. The REST API runs 4 uvicorn workers with round-robin routing; routing different MCP requests to different workers breaks session state.
Running a dedicated single-worker MCP service on port 8001 avoids sticky-session infrastructure (nginx ip_hash, Redis session store) while keeping the REST API fully multi-worker.
Runs as a background asyncio task on a configurable interval (default: every 5 minutes). Finds any (agent_id, user_id) pair where episodic memory count exceeds a threshold, condenses the oldest batch into a single semantic memory, then deletes the originals. Fully local — no LLM API calls.
How it summarises: Pure Python extractive algorithm. Sentences are deduplicated by Jaccard similarity (≥ 0.7 threshold), scored by word frequency (TF), and the top-N are returned in original order. ~1 ms per summarisation, zero dependencies beyond the standard library.
Why episodic-only: Episodic memories are conversation events with natural time-based obsolescence. Semantic and procedural memories encode facts and skills — silently condensing them risks precision loss; they age out via recency decay instead.
Concurrency safety: Uses pg_try_advisory_xact_lock keyed on hashtext(agent_id|user_id). The lock is held only during the DB write transaction, not during the embedding call.
Tune via env vars:
| Var | Default | Description |
|---|---|---|
SUMMARIZATION_ENABLED | true | Toggle the background job |
SUMMARIZATION_THRESHOLD | 100 | Episodic count to trigger per pair |
SUMMARIZATION_BATCH_SIZE | 50 | Oldest N memories to condense per run |
SUMMARIZATION_INTERVAL_SECONDS | 300 | How often the job wakes up |
importance score into ranking formula alongside similarity and recency| Layer | Choice | Why |
|---|---|---|
| API | FastAPI + uvicorn | Async-first, fast, excellent OpenAPI generation |
| Embeddings | fastembed ONNX (BAAI/bge-small-en-v1.5) | Local, zero API calls, ~12 ms CPU inference, 384-dim |
| Database | PostgreSQL 16 + pgvector | Relational + vector in one system, no extra infra |
| Vector index | ivfflat | Lower build cost than HNSW at this scale |
| Pool | asyncpg | Direct control, zero ORM overhead |
| Summariser | Pure Python extractive | Jaccard dedup + TF scoring, zero ML deps, ~1 ms |
| Retry | tenacity | Jitter-based backoff on transient errors |
| Metrics | Prometheus + prometheus-fastapi-instrumentator | Standard observability |
| Load testing | Locust | Python-native, realistic user simulation |
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Modelcontextprotocol · Developer Tools
Web content fetching and conversion for efficient LLM usage
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.