MCP Marketplace
BrowseHow It WorksFor CreatorsDocs
Sign inSign up
MCP Marketplace

The curated, security-first marketplace for AI tools.

Product

Browse ToolsSubmit a ToolDocumentationHow It WorksBlogFAQ

Legal

Terms of ServicePrivacy PolicyCommunity Guidelines

Connect

support@mcp-marketplace.ioTwitter / XDiscord

MCP Marketplace © 2026. All rights reserved.

Back to Browse

Docpick MCP Server

by QuartzUnit
Developer ToolsModerate6.5MCP RegistryLocal
Free

Server data from the Official MCP Registry

Schema-driven document extraction with local OCR + LLM. Document in, Structured JSON out.

About

Schema-driven document extraction with local OCR + LLM. Document in, Structured JSON out.

Security Report

6.5
Moderate6.5Moderate Risk

Valid MCP server (1 strong, 3 medium validity signals). 3 known CVEs in dependencies (1 critical, 1 high severity) Package registry verified. Imported from the Official MCP Registry.

7 files analyzed · 4 issues found

Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.

Permissions Required

This plugin requests these system permissions. Most are normal for its category.

file_system

Check that this permission is expected for this type of plugin.

How to Install

Add this to your MCP configuration file:

{
  "mcpServers": {
    "io-github-arknill-docpick": {
      "args": [
        "docpick"
      ],
      "command": "uvx"
    }
  }
}

Documentation

View on GitHub

From the project's GitHub README.

Docpick

PyPI Python License Tests

한국어 문서 · llms.txt

Document in, Structured JSON out. Locally. With your schema.

docpick is a lightweight, schema-driven document extraction pipeline that combines local OCR engines with local LLMs to extract structured JSON from any document — invoices, receipts, bills of lading, tax forms, and more.

  • Zero cloud dependency — runs entirely on your machine (CPU or GPU)
  • Custom schemas — define your own Pydantic models or use 8 built-in document schemas
  • Validation built-in — checkdigit verification, cross-field rules, cross-document consistency
  • Apache 2.0 — no GPL/AGPL dependencies

Install

pip install docpick            # core (LLM extraction only)
pip install docpick[paddle]    # + PaddleOCR (recommended)
pip install docpick[easyocr]   # + EasyOCR (Korean-optimized)
pip install docpick[got]       # + GOT-OCR2.0 (GPU, vision-language)
pip install docpick[all]       # all OCR backends

Requirements: Python 3.11+ / LLM endpoint (vLLM, Ollama, or OpenAI-compatible)

Quick Start

Python API

from docpick import DocpickPipeline
from docpick.schemas import InvoiceSchema

pipeline = DocpickPipeline()
result = pipeline.extract("invoice.pdf", schema=InvoiceSchema)

print(result.data)           # Structured dict matching schema
print(result.validation)     # Validation errors/warnings
print(result.confidence)     # Per-field confidence scores

CLI

# Extract structured data
docpick extract invoice.pdf --schema invoice --output result.json

# OCR only (no LLM)
docpick ocr document.png --lang ko,en

# Validate extracted JSON
docpick validate result.json --schema invoice

# Batch process a directory
docpick batch ./documents/ --schema invoice --output ./results/ --concurrency 4

# List available schemas
docpick schemas list

# Show schema details
docpick schemas show invoice

Built-in Schemas

SchemaDocument TypeKey Validations
invoiceCommercial invoicesLine item sums, tax ID checkdigit, date order
receiptRetail/restaurant receiptsTotal = subtotal + tax + tip
bill_of_ladingOcean/air B/LContainer weight sums, ISO 6346, HS code format
purchase_orderPurchase ordersPO total = line items, delivery date order
kr_tax_invoiceKorean e-tax invoice (세금계산서)Business number checkdigit (x2), supply/tax/total sums
bank_statementBank statementsIBAN mod97, period date order
id_documentPassport/ID (ICAO 9303)MRZ, ISO 3166 country codes, date ranges
certificate_of_originCertificate of OriginISO 3166 alpha-2 country codes

Custom Schemas

Define your own schema with Pydantic:

from pydantic import BaseModel
from docpick import DocpickPipeline
from docpick.validation.rules import SumEqualsRule, RequiredFieldRule

class MyDocument(BaseModel):
    """Custom document schema."""
    company_name: str | None = None
    total_amount: float | None = None
    tax_amount: float | None = None
    net_amount: float | None = None
    items: list[dict] | None = None

    class ValidationRules:
        rules = [
            RequiredFieldRule("company_name"),
            SumEqualsRule(["net_amount", "tax_amount"], "total_amount"),
        ]

pipeline = DocpickPipeline()
result = pipeline.extract("my_document.pdf", schema=MyDocument)

Or use a JSON Schema file:

docpick extract document.pdf --schema my_schema.json

Validation

Check Digit Algorithms

AlgorithmUse Case
kr_business_numberKorean business registration number (10 digits)
luhnCredit card numbers
iso_6346Shipping container numbers
iban_mod97International bank account numbers
awb_mod7Air waybill numbers
mrzMachine Readable Zone (passport/ID)

Cross-Field Rules

RuleDescription
SumEqualsRuleSum of fields equals target (with tolerance)
DateBeforeRuleDate A must precede Date B
RequiredFieldRuleField must be non-null and non-empty
FieldEqualsRuleTwo fields must be equal
RangeRuleNumeric field within min/max bounds
RegexRuleField matches regex pattern

Cross-Document Validation

Validate consistency across related documents (e.g., Invoice + B/L + Packing List):

from docpick.validation.cross_document import create_trade_document_validator

validator = create_trade_document_validator()
result = validator.validate({
    "invoice": invoice_data,
    "bl": bl_data,
    "packing_list": packing_list_data,
    "certificate": certificate_data,
})
print(result.is_valid)

OCR Engines

EngineTypeGPULanguagesBest For
PaddleOCRTraditional OCROptional111General documents (default)
EasyOCRTraditional OCROptional80+Korean text
GOT-OCR2.0Vision-LanguageRequiredMultiComplex layouts
VLMVision-LanguageRequiredMultiDirect image → JSON

2-Tier Auto Engine

The default auto engine uses confidence-based fallback:

  1. Tier 1 (CPU): PaddleOCR → EasyOCR
  2. Tier 2 (GPU): GOT-OCR2.0 → VLM

If Tier 1 average confidence falls below threshold (default 0.7), automatically escalates to Tier 2.

LLM Providers

ProviderEndpointDefault Model
vLLMhttp://localhost:8000/v1Qwen/Qwen3.5-32B-AWQ
Ollamahttp://localhost:11434qwen3.5:7b

Configure via CLI or YAML:

docpick config set llm.provider ollama
docpick config set llm.base_url http://localhost:11434
docpick config set llm.model qwen3.5:7b

Error Handling

The pipeline is designed to be resilient:

  • OCR failure → automatic fallback to next available engine
  • LLM JSON parse failure → automatic retry with correction prompt (up to 1 retry)
  • Partial results → returns whatever was extracted, with errors logged in result.errors
  • Document load failure → returns empty result with error message
result = pipeline.extract("damaged.pdf", schema=InvoiceSchema)
if result.errors:
    print("Pipeline warnings:", result.errors)
if result.data:
    print("Partial extraction:", result.data)

Batch Processing

Process entire directories with parallel workers:

from docpick.batch import BatchProcessor
from docpick.schemas import InvoiceSchema

processor = BatchProcessor(concurrency=4)
result = processor.process_directory(
    "./invoices/",
    schema=InvoiceSchema,
    recursive=True,
)

print(f"Processed {result.succeeded}/{result.total} files")
for path, extraction in result.results.items():
    print(f"{path}: {extraction.data.get('total_amount')}")

Architecture

flowchart TD
    A["📄 Document\n(PDF / Image)"] --> B["DocumentLoader\n(pypdfium2)"]
    B --> C["Tier 1: OCR\n(PaddleOCR / EasyOCR)\nCPU"]
    C --> D{"Confidence\n≥ threshold?"}
    D -->|"yes"| F["LLM Extractor\n(vLLM / Ollama)\nSchema prompt"]
    D -->|"no"| E["Tier 2: VLM\n(GOT / VLM)\nGPU"]
    E --> F
    F --> G["Pydantic Validation"]
    G --> H["✅ ExtractionResult"]

License

Apache 2.0 — all dependencies are Apache 2.0 or MIT licensed.


Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.

Reviews

No reviews yet

Be the first to review this server!

0

installs

New

no ratings yet

Is this your server?

Claim ownership to manage your listing, respond to reviews, and track installs from your dashboard.

Claim with GitHub

Sign up with the GitHub account that owns this repo

Links

Source CodePyPI Package

Details

Published March 17, 2026
Version 0.1.2
0 installs
Local Plugin

More Developer Tools MCP Servers

Fetch

Free

by Modelcontextprotocol · Developer Tools

Web content fetching and conversion for efficient LLM usage

80.0K
Stars
4
Installs
5.3
Security
No ratings yet
Local

Toleno

Free

by Toleno · Developer Tools

Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.

137
Stars
511
Installs
8.0
Security
4.8
Local

mcp-creator-python

Free

by mcp-marketplace · Developer Tools

Create, build, and publish Python MCP servers to PyPI — conversationally.

-
Stars
68
Installs
10.0
Security
4.6
Local

MarkItDown

Free

by Microsoft · Content & Media

Convert files (PDF, Word, Excel, images, audio) to Markdown for LLM consumption

120.0K
Stars
26
Installs
6.0
Security
5.0
Local

FinAgent

Free

by mcp-marketplace · Finance

Free stock data and market news for any MCP-compatible AI assistant.

-
Stars
18
Installs
10.0
Security
No ratings yet
Local

mcp-creator-typescript

Free

by mcp-marketplace · Developer Tools

Scaffold, build, and publish TypeScript MCP servers to npm — conversationally

-
Stars
17
Installs
10.0
Security
5.0
Local