Server data from the Official MCP Registry
Forecast future events and scan prediction-market edges.
Forecast future events and scan prediction-market edges.
Remote endpoints: streamable-http: https://foresea.ink/mcp/
The MCP server is a complex forecasting API with reasonable authentication patterns and permissions appropriate for its purpose (network APIs, file I/O, environment variables for credentials). However, several security and code-quality concerns lower the score: hardcoded/exposed credentials in documentation (Google OAuth client ID, GitHub callback URLs, deployment details), weak fallback auth patterns (fail-open caching), insufficient input validation on critical paths, overly broad exception handling that masks errors, and potential timing-attack vectors in auth flows. The server also lacks CSRF protection on state-changing endpoints despite supporting OAuth. These are moderate rather than critical because authentication is present, credentials are primarily in documentation rather than code, and the server does validate API keys where configured. Supply chain analysis found 11 known vulnerabilities in dependencies (2 critical, 0 high severity).
4 files analyzed · 24 issues found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Available as Local & Remote
This plugin can run on your machine or connect to a hosted endpoint. during install.
From the project's GitHub README.
Conference artifact for studying how explicit rationale instructions affect LLM forecasting behavior on Metaculus-style binary forecasting questions. The codebase contains the prompt variants, batch inference runner, generated result tables, and plotting/analysis scripts used for the paper figures. The live Foresea API also supports prediction-market intelligence: typed forecasts, evidence retrieval, and model-vs-market edge analysis for binary and multiple-choice markets.
Deployed on Google Cloud Run — model gpt-oss-120b, variant variant0_neutral_baseline:
https://foresea.ink
(The URL is printed in the GitHub Actions deploy-step output after the first push to main.)
# Health check
curl https://foresea.ink/health
# Single-record prediction
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will X happen by date Y?",
"question_type": "binary",
"description": "Context here.",
"news_articles": [],
"attach_evidence": true,
"evidence_top_k": 5,
"market_platform": "Polymarket",
"market_probability": 0.42,
"variant": "variant0_neutral_baseline"
}'
When attach_evidence is true and no news_articles are supplied, /predict
fetches and ranks current news evidence from GDELT, Google News RSS, and Stooq by
default, injects it into the model prompt, and returns the selected
evidence_articles with the forecast. Supplying news_articles skips automatic
retrieval and uses the caller-provided evidence.
The response includes both the forecast and the evidence used by the model:
{
"question_type": "binary",
"predicted_answer": "Yes",
"confidence": 0.86,
"options": [],
"range_forecast": null,
"rationale": "Model-generated explanation for the forecast.",
"model_rationale": "Model-generated explanation for the forecast.",
"variant": "variant0_neutral_baseline",
"model_key": "gpt-oss-120b",
"evidence_sources": [
{
"source": "Reuters",
"title": "Article headline",
"url": "https://example.com/article",
"publish_date": "2026-05-29T00:00:00Z",
"relevance_score": 0.82
}
],
"evidence_articles": [
{
"title": "Article headline",
"summary": "Cleaned article summary.",
"source": "Reuters",
"url": "https://example.com/article",
"publish_date": "2026-05-29T00:00:00Z",
"relevance_score": 0.82,
"search_query": "query used for retrieval"
}
],
"evidence_error": null,
"market_analysis": {
"platform": "Polymarket",
"market_url": "https://example.com/market",
"outcome": "Yes",
"market_probability": 0.42,
"model_probability": 0.86,
"edge": 0.44,
"stance": "model_above_market",
"summary": "Foresea is 44 percentage points above the market on Yes."
}
}
Use evidence_sources when a client only needs the source list and links. Use
evidence_articles when a client needs the article-level details that were
attached to the model prompt. rationale and model_rationale are generated by
gpt-oss-120b and explain why the model chose its answer and confidence.
When market_probability is supplied, market_analysis is computed
deterministically from the model probability and the market-implied probability.
Production is served from the custom domain:
https://foresea.ink
The Cloud Run service is:
project: brave-drive-471109-d9
region: us-central1
service: analyzing-llm-rationale
Required runtime environment:
SCADS_AI_API_KEY: Secret Manager secret used by hosted model calls.MODEL_DEVICE=cpu: production Cloud Run runs the CPU image.CUSTOM_DOMAIN=foresea.ink: redirects *.run.app requests to the public domain.GOOGLE_CLIENT_ID: Google OAuth web client ID used by /auth/config.GITHUB_CLIENT_ID / GITHUB_CLIENT_SECRET: GitHub OAuth app credentials. The
OAuth app's callback URL must be the site origin (e.g. https://foresea.ink/).
When unset, the "Continue with GitHub" button is hidden and /auth/github
returns 503. Sign-in also works with Google and email/password.SESSION_SECRET: long random string used to sign browser session JWTs.Current Google OAuth client ID:
664177666636-s186jhl522b9vh5enj211tu6t5c13m97.apps.googleusercontent.com
The OAuth client must allow these JavaScript origins:
https://foresea.ink
https://www.foresea.ink
https://analyzing-llm-rationale-hy7gvnvt4a-uc.a.run.app
To update non-secret environment variables without replacing the existing
SESSION_SECRET, use --update-env-vars:
gcloud run services update analyzing-llm-rationale \
--region us-central1 \
--project brave-drive-471109-d9 \
--update-env-vars MODEL_DEVICE=cpu,CUSTOM_DOMAIN=foresea.ink,GOOGLE_CLIENT_ID='664177666636-s186jhl522b9vh5enj211tu6t5c13m97.apps.googleusercontent.com'
Verify the deployed auth config and health endpoint:
curl https://foresea.ink/auth/config
curl https://foresea.ink/health
The server is built to scale horizontally on Cloud Run:
/auth/register, /auth/login). Passwords are stored as salted
PBKDF2-HMAC-SHA256 hashes; accounts live in Cloud Datastore.REDIS_URL is set, so they are
shared across instances; otherwise they fall back to per-instance in-memory
state and fail open. /predict (non-personalised requests), evidence
retrieval, and /extract URL fetches are cached; public GETs send
Cache-Control.| Var | Default | Description |
|---|---|---|
REDIS_URL | unset | Memorystore/Redis URL. Shares cache + rate limits across instances. |
PREDICT_CACHE_TTL | 600 | Cache TTL (s) for non-personalised /predict responses. 0 disables. |
EVIDENCE_CACHE_TTL | 900 | Cache TTL (s) for evidence retrieval. |
EXTRACT_CACHE_TTL | 3600 | Cache TTL (s) for /extract URL fetches. |
LOCAL_CACHE_MAX | 1024 | Max entries in the in-memory fallback cache. |
SEARXNG_URL / TAVILY_API_KEY / SERPER_API_KEY / BRAVE_API_KEY | unset | Enable web search as an evidence source. A self-hosted SearXNG is preferred when set, then Tavily, Serper, Brave. Tavily/Serper have free no-card tiers. When none is set, evidence comes from GDELT, Google News, and RSS. |
NEWSAPI_KEY | unset | Enables NewsAPI as an evidence source. |
GET /track-record serves the public forecast track record. The heavy tick loop
does not run on Cloud Run: .github/workflows/track-record-tick.yml runs hourly
on GitHub Actions, updates data/track_record_store.json as the source-of-truth
entity store, writes the public aggregate to static/track_record_live.json, and
commits both files back to main. At runtime, Cloud Run fetches the committed
aggregate from raw GitHub, falling back to the bundled file and then the static
backtest in static/track_record.json.
The Action calls /predict once per newly snapshotted market. If /predict is
protected, set the GitHub secret PREDICT_API_KEY; no TRACK_RECORD_TOKEN or
server-side /track-record/tick endpoint is required.
Raise the Cloud Run throughput ceiling (no idle cost while min-instances=0):
gcloud run services update analyzing-llm-rationale --region us-central1 \
--max-instances 20 --concurrency 40 --memory 1Gi
Once max-instances > 1, provision Memorystore for Redis (billable) and set
REDIS_URL so rate limiting and caching stay correct across instances:
gcloud services enable redis.googleapis.com vpcaccess.googleapis.com compute.googleapis.com
gcloud redis instances create foresea-cache --size=1 --region=us-central1 --tier=basic
gcloud compute networks vpc-access connectors create foresea-vpc \
--region=us-central1 --range=10.8.0.0/28
gcloud run services update analyzing-llm-rationale --region us-central1 \
--vpc-connector foresea-vpc \
--update-env-vars REDIS_URL=redis://<instance-host>:6379
The public Cloud Run API is the easiest integration target. It accepts forecasting questions and returns a typed forecast, model rationale, and optional evidence articles. It is built for resolvable forecasts, not general Q&A.
GET /health: service health check.GET /track-record: public live track record, falling back to the static backtest.GET /track-record/digest: shareable markdown summary of the live track record.POST /predict: public prediction endpoint.GET /markets/polymarket: fetch a live Polymarket quote (see below).GET /markets/kalshi: fetch a live Kalshi quote (see below).POST /agent/analyze: orchestrated end-to-end analysis of a live question (see below).GET /agent/scan: scan a venue for mispriced markets, ranked by edge (see below).GET /trading/accounts: authenticated trading-readiness status, no secrets returned.POST /trading/preview: authenticated dry-run order normalization.POST /trading/orders: authenticated live order submission with explicit confirmation.POST /agent/analyze runs the whole pipeline autonomously: resolve the market
(fetch a live Polymarket/Kalshi price when an identifier is given) → gather
evidence + forecast → price the edge → run any custom skills →
recommend. It returns one structured report.
curl -X POST https://foresea.ink/agent/analyze \
-H "Content-Type: application/json" \
-d '{
"platform": "polymarket",
"slug": "will-the-fed-cut-rates-in-2026",
"skills": [
{"name": "Base rate check", "instruction": "Compare to historical base rates."},
{"name": "Risk", "instruction": "What would most change this forecast?"}
]
}'
Custom skills are your own analysis steps — each runs as an extra model pass
over the question, forecast, and evidence, and comes back as a named section in
the report. Provide a question directly, or a platform + market identifier
(slug/market_id for Polymarket, ticker for Kalshi). Pass history (prior
turns) for multi-turn follow-ups — with history, short follow-ups like "why?" or
"what about June?" are answered in context. BYOK fields (openrouter_api_key,
openrouter_model, provider_base_url) apply here too.
The report includes recommendation (buy_yes/buy_no/hold/no_market_price),
edge, model_probability, market_probability, thesis, evidence_sources,
and pipeline (the ordered steps that ran).
GET /agent/scan lists live markets on a venue, forecasts each, and returns the
ones whose model-vs-market gap clears min_edge, ranked by |edge|.
curl "https://foresea.ink/agent/scan?platform=polymarket&limit=4&min_edge=0.1"
Params: platform (polymarket or kalshi), limit (markets to analyse, max 8),
min_edge (default 0.1), evidence_top_k. Each market runs a full forecast, so
it's bounded by limit and the result is cached briefly. Response: {platform, scanned, opportunities: [{question, market_url, market_probability, model_probability, edge, recommendation}]}. In the web app, the desk's
"⚡ Scan Polymarket for mispriced markets" button calls this.
Foresea exposes a public remote MCP server at:
https://foresea.ink/mcp/
It is advertised for discovery at:
https://foresea.ink/.well-known/mcp/server.json
The remote MCP server is a thin tool layer over the public API. It exposes:
foresea_forecast: calls POST /predict.foresea_analyze_market: calls POST /agent/analyze.foresea_scan_markets: calls GET /agent/scan.foresea_track_record: calls GET /track-record.foresea_edge_board: calls GET /edge-board — live model-vs-market disagreements ranked, each tagged with the resolved track record of gaps that size (by_edge calibration + lead_lag).foresea://track-record and foresea://openapi.json.Use https://foresea.ink/mcp/ directly in MCP clients that support remote
Streamable HTTP servers. For clients that still require a local stdio command,
run the wrapper locally.
The repo targets Python 3.10+ because the official MCP Python SDK requires it.
To create a repo-local Python 3.11 MCP environment with uv:
uv venv --python 3.11 .venv-mcp
uv pip install --python .venv-mcp/bin/python --no-deps -e .
uv pip install --python .venv-mcp/bin/python "mcp>=1.27.1" requests pyyaml pip
source .venv-mcp/bin/activate
analyze-llm-rationale mcp-server
That lightweight install avoids pulling the full inference dependency stack
(notably Torch/CUDA) when all you need is the MCP wrapper. In a full development
environment, pip install -e ".[mcp]" is also valid.
MCP client config example:
{
"mcpServers": {
"foresea": {
"url": "https://foresea.ink/mcp/"
}
}
}
For a local HTTP MCP endpoint:
.venv-mcp/bin/analyze-llm-rationale mcp-server \
--transport streamable-http \
--host 127.0.0.1 \
--port 8787
Connect MCP clients to http://127.0.0.1:8787/mcp. If a private deployment
requires auth, set FORESEA_API_KEY or pass --api-key; the wrapper forwards it
as X-API-Key.
Quick verification:
.venv-mcp/bin/python - <<'PY'
import importlib.metadata as md
from analyzing_llm_rationale.mcp_server import create_mcp_server
print(md.version("mcp"))
print(create_mcp_server().name)
PY
Pull the current market-implied probability straight from a venue, then feed it
into /predict as market_probability to compute an edge.
# Polymarket — by market slug (or ?id=<numeric id>)
curl "https://foresea.ink/markets/polymarket?slug=will-the-fed-cut-rates-in-2026"
# Kalshi — by market ticker
curl "https://foresea.ink/markets/kalshi?ticker=KXFED-26SEP-C"
Both return a normalised quote:
{
"platform": "Polymarket",
"question": "Will the Fed cut rates in 2026?",
"market_url": "https://polymarket.com/market/...",
"outcome": "Yes",
"probability": 0.54,
"outcomes": [
{"label": "Yes", "probability": 0.54},
{"label": "No", "probability": 0.46}
]
}
probability is null for unpriced/illiquid markets. Quotes are cached briefly
(MARKET_CACHE_TTL, default 30s).
Foresea can submit guarded prediction-market orders, but live execution is
disabled by default. Keep this separate from /agent/analyze: the agent can
recommend buy_yes/buy_no, but order submission requires a signed-in user,
server-side exchange credentials, FORESEA_ENABLE_TRADING=true, execute=true,
and the exact confirmation phrase PLACE REAL ORDER.
Credentials are read only from the server environment, so use Cloud Run Secret Manager mounts or environment secrets. Do not collect private keys in the browser or store exchange secrets in Datastore.
# Global guardrails
export FORESEA_ENABLE_TRADING=false # must be true for live orders
export FORESEA_MAX_ORDER_NOTIONAL=50 # local cap per order, USD
export FORESEA_ALLOW_MARKET_ORDERS=false # separate gate for IOC/FOK-style orders
# Kalshi authenticated REST (RSA-PSS signing)
export KALSHI_API_KEY_ID=<kalshi-key-id>
export KALSHI_PRIVATE_KEY_FILE=/secrets/kalshi-private-key.pem
export KALSHI_BASE_URL=https://external-api.kalshi.com/trade-api/v2
# Polymarket CLOB SDK
export POLYMARKET_PRIVATE_KEY=<wallet-private-key>
export POLYMARKET_API_KEY=<clob-api-key>
export POLYMARKET_API_SECRET=<clob-api-secret>
export POLYMARKET_API_PASSPHRASE=<clob-api-passphrase>
export POLYMARKET_FUNDER_ADDRESS=<optional-funder-address>
export POLYMARKET_SIGNATURE_TYPE=<optional-signature-type>
Install the optional SDKs in production with:
pip install -e ".[serve,trading]"
The Docker image installs trading, so Cloud Run only needs secrets/env vars.
Check configured venues:
curl https://foresea.ink/trading/accounts \
-H "Authorization: Bearer $FORESEA_SESSION"
Preview a Kalshi order without execution:
curl -X POST https://foresea.ink/trading/preview \
-H "Authorization: Bearer $FORESEA_SESSION" \
-H "Content-Type: application/json" \
-d '{
"platform": "kalshi",
"ticker": "KXFED-26SEP-C",
"action": "buy",
"outcome": "yes",
"price": 0.42,
"quantity": 1
}'
Submit a live order only after reviewing the preview:
curl -X POST https://foresea.ink/trading/orders \
-H "Authorization: Bearer $FORESEA_SESSION" \
-H "Content-Type: application/json" \
-d '{
"platform": "kalshi",
"ticker": "KXFED-26SEP-C",
"action": "buy",
"outcome": "yes",
"price": 0.42,
"quantity": 1,
"execute": true,
"confirmation": "PLACE REAL ORDER"
}'
For Polymarket, pass the CLOB token_id for the exact outcome, or pass
slug/market_id plus outcome and Foresea will resolve the token id from the
public market record. Limit orders use quantity as shares. Market-buy orders
use max_cost as USD spend when supplied and remain blocked unless
FORESEA_ALLOW_MARKET_ORDERS=true.
Required:
question: forecasting question, such as "Will X happen by date Y?",
"Who will win X?", "What will X be?", or "When will X happen?".Optional:
question_type: binary, multiple_choice, numeric, or date. If omitted,
the model attempts to infer the type.options: answer choices for multiple_choice questions.description: extra context for the question.resolution_criteria: how the question should resolve or be measured.categories: list of topic labels.news_articles: caller-supplied evidence articles. If provided, automatic
evidence retrieval is skipped.attach_evidence: defaults to true. When true and news_articles is empty,
the API fetches current evidence from GDELT, Google News RSS, and Stooq.evidence_top_k: number of evidence articles to attach, capped by the server.market_platform: prediction market venue such as Polymarket, Kalshi,
Manifold, or Metaculus.market_url: URL for the market being analyzed.market_outcome: outcome whose market price is supplied. Defaults to Yes
for binary markets.market_probability: current market-implied probability for
market_outcome. Use 0.42 or 42; the API normalizes percentages.variant: prompt variant. Defaults to variant0_neutral_baseline.created_time, publish_time, resolve_time, days_open: optional
forecasting metadata.openrouter_api_key + openrouter_model: run the forecast on your own model
instead of the server default (see "Bring your own model" below).provider_base_url: optional OpenAI-compatible /chat/completions endpoint to
use with your key/model instead of OpenRouter. Must be public HTTPS.By default /predict runs on the server's hosted model. To use your own:
openrouter_api_key and openrouter_model (e.g.
openai/gpt-4o, anthropic/claude-sonnet-4-5). The request is proxied through
OpenRouter.provider_base_url (e.g.
https://api.openai.com/v1/chat/completions) with the matching openrouter_model
(here just the provider's model ID, e.g. gpt-4o) and your key.For safety, provider_base_url must be public HTTPS; loopback, private,
link-local, and cloud-metadata hosts are rejected. In the web app, the sidebar's
"Use your own model" panel exposes the provider, endpoint, key, and model.
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will X happen by 2027?",
"question_type": "binary",
"openrouter_api_key": "YOUR_KEY",
"openrouter_model": "gpt-4o",
"provider_base_url": "https://api.openai.com/v1/chat/completions"
}'
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will the Federal Reserve cut interest rates at least once before September 30, 2026?",
"question_type": "binary",
"market_platform": "Polymarket",
"market_probability": 42
}'
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Who will win the 2026 Formula 1 drivers championship?",
"question_type": "multiple_choice",
"options": ["Max Verstappen", "Lando Norris", "Charles Leclerc", "Lewis Hamilton", "Other"],
"attach_evidence": false
}'
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "What will US CPI inflation be in December 2026?",
"question_type": "numeric",
"resolution_criteria": "Use the year-over-year CPI-U inflation rate for December 2026."
}'
curl -X POST https://foresea.ink/predict \
-H "Content-Type: application/json" \
-d '{
"question": "Will Company X report positive net income in Q4 2026?",
"description": "Resolve using the company earnings release.",
"resolution_criteria": "Yes if reported GAAP net income is positive.",
"attach_evidence": false,
"news_articles": [
{
"title": "Company X raises full-year guidance",
"source": "Example Business News",
"url": "https://example.com/company-x-guidance",
"publish_date": "2026-05-29",
"summary": "Company X raised revenue guidance and reported margin expansion."
}
]
}'
import requests
payload = {
"question": "Will the Federal Reserve cut interest rates at least once before September 30, 2026?",
"question_type": "binary",
"attach_evidence": True,
"evidence_top_k": 3,
"market_platform": "Polymarket",
"market_probability": 42,
}
response = requests.post(
"https://foresea.ink/predict",
json=payload,
timeout=180,
)
response.raise_for_status()
prediction = response.json()
print(prediction["predicted_answer"], prediction["confidence"])
print(prediction["model_rationale"])
if prediction.get("market_analysis"):
print(prediction["market_analysis"]["summary"])
for source in prediction["evidence_sources"]:
print(source["source"], source["url"])
question_type: detected or requested type: binary, multiple_choice,
numeric, or date.predicted_answer: "Yes", "No", the top multiple-choice option, or the
median numeric/date estimate.confidence: model confidence as a number from 0 to 1 for binary and
multiple-choice forecasts; null for numeric/date forecasts.options: per-option probabilities for multiple-choice forecasts.range_forecast: p10, p50, p90, and optional unit for numeric/date
forecasts.rationale: model-generated explanation.model_rationale: alias for the model-generated explanation, intended for API
clients.evidence_sources: compact source list with article title, URL, publication
date, and relevance score.evidence_articles: full evidence records attached to the prompt.evidence_error: retrieval error message, or null when evidence retrieval
succeeds.market_analysis: optional comparison against a supplied market price:
market_probability, model_probability, edge, stance, and a short
summary. edge is model_probability - market_probability.src/analyzing_llm_rationale/: packaged inference, provider, validation, and CLI logic.configs/: model and rationale-variant definitions.prompts/: system prompt and the nine rationale-variant prompts.scripts/: evaluation, recovery, SHAP, plotting, and utility scripts.slurm/: HPC launchers for the variant/temperature sweeps.results/: model outputs and run metadata.analysis/: aggregate metric tables and rationale-analysis outputs.paper/: paper figures, Draw.io sources, PDFs, and qualitative case studies.tests/: unit tests for the package and metric parsing.See ARTIFACT_MANIFEST.md for the submission checklist and file-level notes.
python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev,analysis]"
Use .[dev] for the core runner and tests only. Use .[analysis] when
regenerating plots or SHAP analyses.
PYTHONPATH=src python -m analyzing_llm_rationale validate-dataset
python -m unittest discover -s tests
ruff check src tests scripts/*.py
PYTHONPATH=src is useful when the repository has not been installed yet or an
older user-local install shadows the working tree.
Run the variant 3 pipeline with the packaged CLI:
analyze-llm-rationale run-batch --variant variant3_reasoning_type
For a remote OpenAI-compatible provider:
export PROVIDER_API_KEY=your_token
analyze-llm-rationale run-batch --variant variant3_reasoning_type --model llama-3.3-70b-instruct
If you do not want to install the package into the environment, invoke it directly:
PYTHONPATH=src python -m analyzing_llm_rationale run-batch --variant variant3_reasoning_type
Useful options:
--variant variant6_step_by_step_reasoning: choose the prompt/output contract.--model qwen2.5-7b-instruct: choose a configured model definition.--temperature 0.7: control generation temperature and output directory.--max-records 10: process only a bounded number of records.--reprocess-nulls: rerun existing rows with predicted_answer = null.--drop-article-text: remove raw article text from prompts before inference.--device auto: select cuda when available, otherwise cpu.verify-results --variant ...: verify completeness, duplicates, malformed rows, and missing IDs.validate-dataset: validate the dataset schema before a run.Foresea has a Karpathy-style autoresearch harness for prompt experiments: edit
one candidate prompt, run a fixed benchmark slice, score one metric, and append
an auditable experiment log. The research surface is
autoresearch/candidate_prompt.txt; agent instructions live in
autoresearch/program.md. The default --model gpt-oss-120b uses the
SCADS-hosted OpenAI-compatible endpoint from configs/models.yaml
(SCADS_AI_API_KEY or SCADS_AI_API_KEY.txt).
Run one candidate experiment:
PYTHONPATH=src python -m analyzing_llm_rationale autoresearch \
--model gpt-oss-120b \
--candidate-prompt-path autoresearch/candidate_prompt.txt \
--max-records 50 \
--metric brier_score
Compare against a baseline and promote only if the candidate improves:
PYTHONPATH=src python -m analyzing_llm_rationale autoresearch \
--model gpt-oss-120b \
--candidate-prompt-path autoresearch/candidate_prompt.txt \
--baseline-results-path results/GPT-OSS-120B/temperature_00/results_variant0_neutral_baseline.json \
--promote-to prompts/variant0_neutral_baseline.txt \
--max-records 50 \
--metric brier_score \
--min-delta 0.001
Each run writes analysis/autoresearch/runs/<run_id>/score.json and appends a
machine-readable row to analysis/autoresearch/experiments.jsonl.
Validate an existing result file:
PYTHONPATH=src python -m analyzing_llm_rationale verify-results \
--model qwen2.5-7b-instruct \
--variant variant3_reasoning_type \
--temperature 0.0 \
--temperature-tag temperature_000
Regenerate aggregate metrics from results/:
python scripts/evaluate_metrics.py
Run the DuckDB SQL analytics suite over the real Metaculus-style dataset and saved model outputs:
python scripts/sql_analytics.py \
--db analysis/forecasting_analytics.duckdb \
--ingest --replace \
--output-dir analysis/sql_analytics
This writes a markdown report plus one CSV per query for 10 medium-level SQL problems: model accuracy, best variants, calibration bins, Brier score, consensus/disagreement cases, prompt lift over baseline, temperature sensitivity, overconfident errors, and category difficulty.
Run the LangChain-powered news retrieval wrapper:
PYTHONPATH=src analyze-llm-rationale fetch-and-rank \
--question "Will X happen by date Y?" \
--source gdelt \
--source google-news \
--source stooq \
--top-k 5
The news pipeline uses LangChain for a query-planning step, article
summarization, and embedding-based relevance ranking before inference. Evidence
sources are configurable with --source for the CLI and --evidence-source
when serving the API.
Run or schedule the Prefect DAG for RSS/news fetch, inference, and DuckDB logging:
# One question
python flows/forecasting_flow.py --question-id 124 --top-k 5
# Small batch from the dataset
python flows/forecasting_flow.py --limit 3 --top-k 5
# Daily scheduled deployment at 06:00 UTC
prefect server start
python flows/forecasting_flow.py --deploy --limit 3 --cron "0 6 * * *"
Regenerate paper figures after metrics are present:
python scripts/plot_model_variant_metric_heatmap.py
python scripts/plot_variant_delta_from_v0.py
python scripts/plot_temperature_frontier.py
python scripts/plot_frs_ablation_slopegraph.py
python scripts/plot_uncertainty_language_calibration_disconnect.py
python scripts/plot_shap_importance_attribute_gaps.py
Common runner and verification commands:
python scripts/run_variant.py --variant variant5_key_conditionspython scripts/run_variant.py --variant variant3_reasoning_type --temperature 0.7 --temperature-tag temperature_07python scripts/run_variant.py --variant variant4_credibility --model llama-3.3-70b-instructpython scripts/verify_results.py --variant variant3_reasoning_typepython download_qwen_model.pypython test_local_inference.pyRepo layout:
scripts/: modular runner entrypointslurm/: batch launchersAuditability:
run_metadata_<variant>.json next to the results file.python -m unittest discover -s tests
ruff check src tests scripts/*.py
The included dataset is forecasting_qa_news_metaculus_2025-02-01_to_today.metaculus_frs_format.json.
Model access is configured in configs/models.yaml. Open-weight Qwen models run
locally through Hugging Face; hosted models use OpenAI-compatible endpoints and
require API keys through environment variables or local key files.
Never commit key files or tokens. Large local caches (.cache/, envs/, .venv/)
are intentionally ignored and excluded from source archives.
If this repository supports a publication, cite the artifact with the metadata in
CITATION.cff and cite the upstream datasets/models according to their licenses.
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Modelcontextprotocol · Developer Tools
Web content fetching and conversion for efficient LLM usage
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.