Server data from the Official MCP Registry
Clean raw HTML into LLM-ready text before agents spend tokens.
Clean raw HTML into LLM-ready text before agents spend tokens.
Valid MCP server (3 strong, 4 medium validity signals). No known CVEs in dependencies. Package registry verified. Imported from the Official MCP Registry.
4 files analyzed · 1 issue found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Set these up before or after installing:
Environment variable: APIFY_TOKEN
Add this to your MCP configuration file:
{
"mcpServers": {
"io-github-larelabs-refinery-mcp": {
"env": {
"APIFY_TOKEN": "your-apify-token-here"
},
"args": [
"-y",
"@larelabs/refinery-mcp"
],
"command": "npx"
}
}
}From the project's GitHub README.
Clean HTML before your agent burns tokens.
Refinery MCP wraps the Refinery Apify Actor as an MCP server so Claude, Cursor, and other agents can turn raw HTML or URLs into clean LLM-ready text plus word_count.

flowchart LR
A[Agent needs web context] --> B[Fetch URL or raw HTML]
B --> C[Refinery MCP]
C --> D[Refinery Apify Actor]
D --> E[Clean text + word_count]
E --> F[RAG / embeddings / LLM context]
Agents are getting good at fetching web pages. The problem is what they fetch:
<html>
<head>
<script>gtag("event", "page_view")</script>
<style>.nav,.cookie,.footer{display:block}</style>
</head>
<body>
<nav>Home · Pricing · Login · Docs · Blog · Careers</nav>
<aside>Subscribe to our newsletter</aside>
<article>
<h1>How ACME cut support ticket routing time by 63%</h1>
<p>ACME routes 40,000 monthly support tickets through an AI triage system.</p>
<p>The team reduced retrieval noise by cleaning HTML before chunking.</p>
</article>
<footer>Legal · Privacy · Cookie settings · LinkedIn · X</footer>
</body>
</html>
The model does not need most of that. It needs this:
How ACME cut support ticket routing time by 63%
ACME routes 40,000 monthly support tickets through an AI triage system.
The team reduced retrieval noise by cleaning HTML before chunking.

Refinery MCP gives your agent a tool for that middle step:
fetch page -> refine HTML -> send clean text to RAG / embeddings / LLM
Agents can fetch pages, but raw HTML is noisy and expensive:
Refinery is the middle step your agent can call before it stuffs web context into a prompt:
fetch/render -> clean/refine -> chunk/embed/answer
It is not a crawler. Use Firecrawl, Crawl4AI, Playwright, browser automation, or your own fetcher when you need rendering. Use Refinery when you already have a URL or raw HTML and want a cheap cleanup pass before the LLM.
Use Refinery MCP when:
word_count / token-ish savings before embeddingDo not use it as your browser renderer, anti-bot layer, or site crawler.
clean_urlFetches a URL through the Refinery Apify Actor and returns dataset rows with clean text and metadata.
Example input:
{
"url": "https://docs.stripe.com/payments",
"removeScripts": true,
"removeStyles": true
}
clean_htmlCleans raw HTML your agent, crawler, or browser session already fetched.
Example input:
{
"html": "<html><body><nav>Home Pricing Login</nav><article><h1>Vendor security update</h1><p>We now support SOC 2 exports for enterprise accounts.</p></article><footer>Legal Privacy Careers</footer></body></html>",
"extractMentions": false,
"extractHashtags": false
}
Example result:
{
"text": "Vendor security update\n\nWe now support SOC 2 exports for enterprise accounts.",
"word_count": 10,
"content_type": "web",
"language": "en",
"processing_time_ms": 44.96,
"success": true
}
estimate_savingsLocal helper that compares raw HTML vs cleaned text and estimates token savings. This does not call Apify.
Example output:
{
"raw_chars": 168,
"clean_chars": 41,
"estimated_raw_tokens": 42,
"estimated_clean_tokens": 11,
"estimated_token_savings": 31,
"reduction_pct": 76
}
npx -y @larelabs/refinery-mcp
Set your Apify token:
export APIFY_TOKEN=apify_api_xxx
export REFINERY_ACTOR_ID=larelabs/refinery-html-to-llm-cleaner
Use the published package:
{
"mcpServers": {
"refinery": {
"command": "npx",
"args": ["-y", "@larelabs/refinery-mcp"],
"env": {
"APIFY_TOKEN": "apify_api_xxx",
"REFINERY_ACTOR_ID": "larelabs/refinery-html-to-llm-cleaner"
}
}
}
}
Or run from source during development:
git clone https://github.com/LareLabs/refinery-mcp
cd refinery-mcp
npm install
npm run build
{
"mcpServers": {
"refinery": {
"command": "npm",
"args": ["run", "dev", "--prefix", "/absolute/path/to/refinery-mcp"],
"env": {
"APIFY_TOKEN": "apify_api_xxx"
}
}
}
}
npm run build
APIFY_TOKEN=apify_api_xxx npm run smoke
The smoke test starts the MCP server over stdio, lists tools, and calls estimate_savings without spending Apify credits.
Use Refinery MCP to clean this docs page before summarizing it:
https://docs.stripe.com/payments
Return the clean text, word_count, and a short summary. Do not summarize raw HTML.
Another useful prompt:
I fetched this page HTML with Playwright. Use Refinery MCP clean_html before adding it to my RAG ingestion queue. Return the cleaned text and estimated token savings.
glama.json added — submit at https://glama.ai/mcp/servers)MIT
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Modelcontextprotocol · Developer Tools
Web content fetching and conversion for efficient LLM usage
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.