Yes, Multimodal is free to use.

How do I install Multimodal?

Multimodal is a local plugin. Install it using npm package: @r16t/multimodal-mcp and add the generated configuration snippet to your AI app's MCP config file. Then restart your AI app.

What AI apps work with Multimodal?

Multimodal uses the Model Context Protocol (MCP) and works with any MCP-compatible AI app, including Claude, ChatGPT / Codex, Gemini, Copilot, Cursor, and more.

Back to Browse

Multimodal MCP Server

by Rsmdt

Developer ToolsLow Risk10.0MCP RegistryLocal

Free

Server data from the Official MCP Registry

Multi-provider media generation — images, video, audio, and transcription via a unified interface

About

Multi-provider media generation — images, video, audio, and transcription via a unified interface

Security Report

10.0

Low Risk10.0Low Risk

Valid MCP server (1 strong, 1 medium validity signals). No known CVEs in dependencies. Package registry verified. Imported from the Official MCP Registry.

5 files analyzed · 1 issue found

Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.

What You'll Need

Set these up before or after installing:

OpenAI API key for image, video, audio generation and transcriptionRequired

Environment variable: OPENAI_API_KEY

xAI API key for image and video generationRequired

Environment variable: XAI_API_KEY

Google Gemini API key for image, video, and audio generationRequired

Environment variable: GEMINI_API_KEY

ElevenLabs API key for audio generation and transcriptionRequired

Environment variable: ELEVENLABS_API_KEY

BFL API key for FLUX image generation and editingRequired

Environment variable: BFL_API_KEY

Directory for saved media files (defaults to cwd)Optional

Environment variable: MEDIA_OUTPUT_DIR

How to Install

Add this to your MCP configuration file:

{
  "mcpServers": {
    "io-github-rsmdt-multimodal": {
      "env": {
        "BFL_API_KEY": "your-bfl-api-key-here",
        "XAI_API_KEY": "your-xai-api-key-here",
        "GEMINI_API_KEY": "your-gemini-api-key-here",
        "OPENAI_API_KEY": "your-openai-api-key-here",
        "MEDIA_OUTPUT_DIR": "your-media-output-dir-here",
        "ELEVENLABS_API_KEY": "your-elevenlabs-api-key-here"
      },
      "args": [
        "-y",
        "@r16t/multimodal-mcp"
      ],
      "command": "npx"
    }
  }
}

Documentation

View on GitHub

From the project's GitHub README.

multimodal-mcp

Multi-provider media generation MCP server. Generate images, videos, audio, and transcriptions from text prompts using OpenAI, xAI, Gemini, ElevenLabs, and BFL (FLUX) through a single unified interface.

Features

🎨 Image Generation — Generate images via OpenAI (gpt-image-1), xAI (grok-imagine-image), Gemini (imagen-4), or BFL (FLUX Pro 1.1)
✏️ Image Editing — Edit images via OpenAI, xAI, Gemini, or BFL (FLUX Kontext)
🎬 Video Generation — Generate videos via OpenAI (sora-2), xAI (grok-imagine-video), or Gemini (veo-3.1)
🔊 Audio Generation — Text-to-speech via OpenAI (tts-1), Gemini, or ElevenLabs (Flash v2.5). Sound effects via ElevenLabs
🎙️ Audio Transcription — Speech-to-text via OpenAI (Whisper) or ElevenLabs (Scribe)
🔄 Auto-Discovery — Automatically detects configured providers from environment variables
🎯 Provider Selection — Auto-selects or explicitly choose a provider per request
📁 File Output — Saves all generated media to disk with descriptive filenames

Quick Start

Set the API key for at least one provider. Most users only need one — add more to access additional providers.

# Using OpenAI
claude mcp add multimodal-mcp -e OPENAI_API_KEY=sk-... -- npx -y @r16t/multimodal-mcp@latest

# Or using xAI
# claude mcp add multimodal-mcp -e XAI_API_KEY=xai-... -- npx -y @r16t/multimodal-mcp@latest

# Or using Gemini
# claude mcp add multimodal-mcp -e GEMINI_API_KEY=AIza... -- npx -y @r16t/multimodal-mcp@latest

# Or using ElevenLabs (audio + transcription)
# claude mcp add multimodal-mcp -e ELEVENLABS_API_KEY=xi-... -- npx -y @r16t/multimodal-mcp@latest

# Or using BFL/FLUX (images)
# claude mcp add multimodal-mcp -e BFL_API_KEY=... -- npx -y @r16t/multimodal-mcp@latest

Using a different editor? See setup instructions for Claude Desktop, Cursor, VS Code, Windsurf, and Cline.

Environment Variables

Variable	Required	Description
`OPENAI_API_KEY`	At least one provider key	OpenAI API key — enables image, video, audio generation, and transcription via gpt-image-1, sora-2, tts-1, and whisper-1
`XAI_API_KEY`	At least one provider key	xAI API key — enables image and video generation via grok-imagine-image and grok-imagine-video
`GEMINI_API_KEY`	At least one provider key	Gemini API key — enables image, video, and audio generation via imagen-4, veo-3.1, and gemini-2.5-flash-preview-tts
`GOOGLE_API_KEY`	—	Alias for `GEMINI_API_KEY`; either name is accepted
`ELEVENLABS_API_KEY`	At least one provider key	ElevenLabs API key — enables audio generation (TTS, sound effects) and transcription via Flash v2.5 and Scribe v1
`BFL_API_KEY`	At least one provider key	BFL API key — enables image generation and editing via FLUX Pro 1.1 and FLUX Kontext
`MEDIA_OUTPUT_DIR`	No	Directory for saved media files. Defaults to the current working directory

Available Tools

`generate_image`

Generate an image from a text prompt.

Parameter	Type	Required	Description
`prompt`	string	Yes	Text description of the image to generate
`provider`	string	No	Provider to use: `openai`, `xai`, `google`, `bfl`. Auto-selects if omitted
`aspectRatio`	string	No	Aspect ratio: `1:1`, `16:9`, `9:16`, `4:3`, `3:4`
`quality`	string	No	Quality level: `low`, `standard`, `high`
`outputDirectory`	string	No	Directory to save the generated file. Absolute or relative path. Defaults to `MEDIA_OUTPUT_DIR` or cwd
`providerOptions`	object	No	Provider-specific parameters passed through directly

`generate_video`

Generate a video from a text prompt. Video generation is asynchronous and may take several minutes.

Parameter	Type	Required	Description
`prompt`	string	Yes	Text description of the video to generate
`provider`	string	No	Provider to use: `openai`, `xai`, `google`. Auto-selects if omitted
`duration`	number	No	Video duration in seconds (provider limits apply)
`aspectRatio`	string	No	Aspect ratio: `16:9`, `9:16`, `1:1`
`resolution`	string	No	Resolution: `480p`, `720p`, `1080p`
`outputDirectory`	string	No	Directory to save the generated file. Absolute or relative path. Defaults to `MEDIA_OUTPUT_DIR` or cwd
`providerOptions`	object	No	Provider-specific parameters passed through directly

`generate_audio`

Generate audio from text. Supports text-to-speech and sound effects. Audio generation is synchronous.

Parameter	Type	Required	Description
`text`	string	Yes	Text to convert to speech, or a description of the sound effect to generate
`provider`	string	No	Provider to use: `openai`, `google`, `elevenlabs`. Auto-selects if omitted
`voice`	string	No	Voice name (provider-specific). OpenAI: `alloy`, `ash`, `coral`, `echo`, `fable`, `nova`, `onyx`, `sage`, `shimmer`. Google: `Kore`, `Charon`, `Fenrir`, `Aoede`, `Puck`, etc. ElevenLabs: voice ID
`speed`	number	No	Speech speed multiplier (OpenAI only): `0.25` to `4.0`
`format`	string	No	Output format (OpenAI only): `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`
`outputDirectory`	string	No	Directory to save the generated file. Absolute or relative path. Defaults to `MEDIA_OUTPUT_DIR` or cwd
`providerOptions`	object	No	Provider-specific parameters passed through directly. ElevenLabs: set `mode: "sound-effect"` for sound effects, `model` for TTS model selection

`transcribe_audio`

Transcribe audio to text (speech-to-text).

Parameter	Type	Required	Description
`audioPath`	string	Yes	Absolute path to the audio file to transcribe
`provider`	string	No	Provider to use: `openai`, `elevenlabs`. Auto-selects if omitted
`language`	string	No	Language code (e.g., `en`, `fr`, `es`) to hint the transcription language
`providerOptions`	object	No	Provider-specific parameters passed through directly

`list_providers`

List all configured media generation providers and their capabilities. Takes no parameters.

Provider Capabilities

Provider	Image	Image Editing	Video	Audio	Transcription	Key Models
OpenAI	✅	✅	✅	✅	✅	gpt-image-1, sora-2, tts-1, whisper-1
xAI	✅	✅	✅	—	—	grok-imagine-image, grok-imagine-video
Gemini	✅	✅	✅	✅	—	imagen-4, veo-3.1, gemini-2.5-flash-preview-tts
ElevenLabs	—	—	—	✅	✅	eleven_flash_v2_5, scribe_v1
BFL	✅	✅	—	—	—	flux-pro-1.1, flux-kontext-pro

Image Aspect Ratios

Provider	1:1	16:9	9:16	4:3	3:4
OpenAI	✅	✅	✅	✅	✅
xAI	✅	✅	✅	✅	✅
Gemini	✅	✅	✅	✅	✅
BFL	✅	✅	✅	✅	✅

Video Aspect Ratios & Resolutions

Provider	16:9	9:16	1:1	480p	720p	1080p
OpenAI	✅	✅	✅	✅	✅	✅
xAI	✅	✅	✅	—	✅	✅
Gemini	✅	✅	—	—	✅	✅

Audio Formats

Provider	mp3	opus	aac	flac	wav	pcm
OpenAI	✅	✅	✅	✅	✅	✅
Gemini	—	—	—	—	✅	—
ElevenLabs	✅	✅	—	—	—	✅

Troubleshooting

No providers configured

[config] No provider API keys detected

Set at least one of OPENAI_API_KEY, XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, or BFL_API_KEY in the MCP server's env block.

Provider not available for requested media type

Each provider supports different media types (see Provider Capabilities). If you specify a provider that isn't configured (no API key) or doesn't support the requested media type, you'll receive an error. Omit the provider parameter to auto-select from configured providers.

Video generation timeout

Video generation polls for up to 10 minutes. If your video hasn't completed in that window, the request will fail with a timeout error. Try a shorter duration or a simpler prompt.

xAI image generation returned no data

This indicates the xAI API returned an empty response. Check that your XAI_API_KEY is valid and that your prompt does not violate xAI content policies.

Gemini image/video generation failed: 403

Verify your GEMINI_API_KEY has the Generative Language API enabled in Google Cloud Console.

Development

npm run build      # Compile TypeScript to build/
npm test           # Run tests with Vitest
npm run lint       # Lint and auto-fix with ESLint
npm run typecheck  # Type-check without emitting
npm run dev        # Watch mode for TypeScript compilation

Editor Setup

Replace OPENAI_API_KEY with your provider of choice (XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, BFL_API_KEY). You can set multiple keys to enable multiple providers.

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cursor

Add to .cursor/mcp.json in your project root (or ~/.cursor/mcp.json globally):

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

VS Code (GitHub Copilot)

Add to .vscode/mcp.json in your project root:

{
  "servers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Windsurf

Add to ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cline

Add to ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}