Teaching OpenClaw to Understand Voice with MAI-Transcribe-1
This post explains how to add Azure Speech transcription to OpenClaw with Microsoft's MAI-Transcribe-1. The goal is simple: turn voice notes from Telegram, LINE, and WhatsApp into normal text input early enough that your existing prompts, tools, and plugins can keep working without a separate voice-only stack.
What This Adds to OpenClaw
- A user can send a voice message in any supported channel, and OpenClaw will transcribe it before generating a reply.
- The system can auto-detect across 25 languages, including Chinese, English, Japanese, and Korean.
- Voice commands such as "Remind me in 10 minutes" can trigger existing tools directly.
- Telegram, LINE, and WhatsApp voice formats are all supported.
- Speech recognition stays inexpensive at about $0.36 USD per hour of audio.
Prerequisites
- You already have OpenClaw running on an Azure VM. If not, start with Building a Family AI Chat Bot on Azure with OpenClaw.
- Azure AI Foundry is already deployed.
- Your Bicep infrastructure can deploy successfully today.
ffmpegis installed on the VM. You will use it to normalize audio formats before upload.
Related OpenClaw Guides
- Start with OpenClaw Overview if you want the broader plugin model and article map before adding speech.
- Read Building a Family AI Chat Bot on Azure with OpenClaw if you still need the Azure VM, Key Vault, and Foundry base environment.
- Read Teaching OpenClaw to Draw with MAI-Image-2 on Telegram, LINE, and WhatsApp if you want a parallel media workflow for image generation and delivery.
- Read Building a Natural-Language Reminder & Scheduled Task System for OpenClaw if you want voice commands to trigger something concrete after transcription.
Open-Source Plugin Repository
The plugin code in this article is now available publicly at weijen/openclaw-mai-transcribe-plugin.
Use that repository if you want the standalone OpenClaw plugin runtime, tests, and a minimal configuration example without copying code blocks out of this post. This article still covers the larger production setup around the plugin, including Azure Speech provisioning, VM-side integration, channel-specific audio handling, and how the plugin fits into a real OpenClaw deployment.
Where Voice Messages Go
Telegram / LINE / WhatsApp
| (voice message)
v
OpenClaw Gateway
|
+----+-----------------------+
| Telegram / WhatsApp | LINE
| tools.media.audio | media attachment -> agent
| CLI pipeline | -> mai_transcribe tool
+----+-----------------------+
|
v
mai-transcribe.sh (CLI)
| ffmpeg conversion (M4A -> WAV)
v
Azure Speech Service (East US)
MAI-Transcribe-1
|
v
Transcript text -> agent reply
Voice messages take two different routes depending on the channel:
- Telegram and WhatsApp: OpenClaw's
tools.media.audiopipeline calls the CLI script automatically. - LINE: the voice message arrives as a media attachment, so the plugin prompt instructs the agent to call
mai_transcribeexplicitly.
Step 1: Deploy Azure Speech
MAI-Transcribe-1 runs on Azure Speech Service, not Azure AI Foundry. The model is only available in East US right now.
1.1 Add a Bicep module
Create infra/bicep/modules/speech.bicep:
param location string = 'eastus'
param prefix string = 'oc-family'
param skuName string = 'S0'
var unique = uniqueString(resourceGroup().id)
resource speech 'Microsoft.CognitiveServices/accounts@2024-10-01' = {
name: '${prefix}-speech-${unique}'
location: location
kind: 'SpeechServices'
sku: { name: skuName }
properties: {
customSubDomainName: '${prefix}-speech-${unique}'
publicNetworkAccess: 'Enabled'
disableLocalAuth: false
networkAcls: { defaultAction: 'Allow' }
}
}
output speechEndpoint string = speech.properties.endpoint
output speechResourceName string = speech.name
output speechRegion string = location
#disable-next-line outputs-should-not-contain-secrets
output speechKey string = speech.listKeys().key1
1.2 Wire it into main.bicep
Add parameters and the new module:
param enableSpeech bool = true
param speechLocation string = 'eastus'
module speech './modules/speech.bicep' = if (enableSpeech) {
name: 'speech'
params: { location: speechLocation; prefix: prefix }
}
Then store the Speech key in Key Vault:
@secure()
param speechApiKey string = ''
resource speechApiKeySecret 'Microsoft.KeyVault/vaults/secrets@2023-07-01' = if (!empty(speechApiKey)) {
parent: kv
name: 'speech-api-key'
properties: { value: speechApiKey }
}
Pass it through in the main.bicep Key Vault module call:
speechApiKey: enableSpeech ? speech.outputs.speechKey : ''
1.3 Deploy and verify
az deployment group create \
--resource-group oc-family-rg \
--template-file infra/bicep/main.bicep \
--parameters infra/bicep/params/prod.bicepparam
Verify the resource:
az cognitiveservices account list -g oc-family-rg \
--query "[?kind=='SpeechServices'].{name:name, location:location}" -o table
You should see oc-family-speech-xxxxx | eastus in the output.
Step 2: Install ffmpeg
LINE voice messages arrive as M4A with AAC encoding. In practice, MAI-Transcribe-1 can reject some LINE-produced M4A files with HTTP 422 even though M4A is listed as supported. Converting those files to WAV first avoids the format mismatch.
ssh weijen@family-claw.multiagentai.co 'sudo apt-get install -y ffmpeg'
[!NOTE] Telegram and WhatsApp use OGG Opus, which typically works without conversion. Even so, normalizing all non-WAV, non-MP3, and non-FLAC inputs through
ffmpegis the safer operational choice.
Step 3: Build the CLI transcription script
This script is the core of the speech-to-text flow. OpenClaw calls it whenever it needs to transcribe an audio file.
Create scripts/vm/mai-transcribe.sh:
#!/usr/bin/env bash
set -euo pipefail
AUDIO_PATH="${1:?Usage: mai-transcribe.sh <audio-file>}"
REGION="${SPEECH_REGION:-eastus}"
API_VERSION="2025-10-15"
# Read the API key from OpenClaw config if it is not already in the environment.
if [ -z "${SPEECH_API_KEY:-}" ]; then
SPEECH_API_KEY=$(python3 -c "
import json
cfg = json.load(open('$HOME/.openclaw/openclaw.json'))
mt = cfg.get('plugins', {}).get('entries', {}).get('mai-transcribe', {}).get('config', {})
print(mt.get('apiKey', ''))
" 2>/dev/null || true)
fi
# Convert non-WAV/MP3/FLAC inputs to WAV before upload.
UPLOAD_PATH="${AUDIO_PATH}"
CLEANUP_TMP=""
EXT_LOWER=$(echo "${AUDIO_PATH##*.}" | tr '[:upper:]' '[:lower:]')
if [[ "$EXT_LOWER" != "wav" && "$EXT_LOWER" != "mp3" && "$EXT_LOWER" != "flac" ]]; then
if command -v ffmpeg &>/dev/null; then
TMP_WAV=$(mktemp /tmp/mai-transcribe-XXXXXX.wav)
ffmpeg -y -i "${AUDIO_PATH}" -ar 16000 -ac 1 -f wav "${TMP_WAV}" </dev/null 2>/dev/null
UPLOAD_PATH="${TMP_WAV}"
CLEANUP_TMP="${TMP_WAV}"
fi
fi
ENDPOINT="https://${REGION}.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=${API_VERSION}"
RESPONSE=$(curl -s -w "\n%{http_code}" -X POST "${ENDPOINT}" \
-H "Content-Type: multipart/form-data" \
-H "Ocp-Apim-Subscription-Key: ${SPEECH_API_KEY}" \
--form "audio=@\"${UPLOAD_PATH}\"" \
--form 'definition={"enhancedMode":{"enabled":true,"model":"mai-transcribe-1"}}')
[ -n "${CLEANUP_TMP}" ] && rm -f "${CLEANUP_TMP}"
HTTP_CODE=$(echo "${RESPONSE}" | tail -1)
BODY=$(echo "${RESPONSE}" | sed '$d')
if [ "${HTTP_CODE}" -ne 200 ]; then
echo "ERROR: HTTP ${HTTP_CODE}: ${BODY}" >&2
exit 1
fi
echo "${BODY}" | python3 -c "
import sys, json
data = json.load(sys.stdin)
phrases = data.get('combinedPhrases', [])
if phrases:
print(phrases[0].get('text', ''))
else:
texts = [p.get('text', '') for p in data.get('phrases', [])]
print(' '.join(texts))
"
Deploy it to the VM:
scp scripts/vm/mai-transcribe.sh weijen@family-claw.multiagentai.co:~/.openclaw/scripts/
ssh weijen@family-claw.multiagentai.co 'chmod +x ~/.openclaw/scripts/mai-transcribe.sh'
Test it manually:
ssh weijen@family-claw.multiagentai.co \
'~/.openclaw/scripts/mai-transcribe.sh /tmp/openclaw/line-media-*.m4a'
If everything is wired correctly, the command should print the recognized transcript.
Step 4: Build the OpenClaw plugin
The plugin has two jobs:
- Register the
mai_transcribetool so the agent can call it deliberately. - Use
before_prompt_buildto tell the agent that voice attachments must be transcribed before reasoning.
If you want to start from the published code instead of rebuilding the files by hand, clone weijen/openclaw-mai-transcribe-plugin and adapt the configuration to your OpenClaw host.
4.1 Suggested plugin structure
extensions/mai-transcribe/
├── index.js
├── lib/api.js
├── openclaw.plugin.json
├── package.json
└── test/api.test.js
4.2 API client in lib/api.js
Use the same low-level multipart technique as the MAI-Image-2 integration, built on Node.js https:
function transcribeAudio({ region, apiVersion, model, apiKey, audioBuffer, mimeType, fileName }) {
return new Promise((resolve, reject) => {
const boundary = `----FormBoundary${Date.now()}`;
const definition = JSON.stringify({ enhancedMode: { enabled: true, model } });
const defPart = `--${boundary}\r\n` +
`Content-Disposition: form-data; name="definition"\r\n` +
`Content-Type: application/json\r\n\r\n` + definition + `\r\n`;
const audioHeader = `--${boundary}\r\n` +
`Content-Disposition: form-data; name="audio"; filename="${fileName}"\r\n` +
`Content-Type: ${mimeType}\r\n\r\n`;
const audioFooter = `\r\n--${boundary}--\r\n`;
const body = Buffer.concat([
Buffer.from(defPart + audioHeader), audioBuffer, Buffer.from(audioFooter)
]);
const req = https.request({
hostname: `${region}.api.cognitive.microsoft.com`,
path: `/speechtotext/transcriptions:transcribe?api-version=${apiVersion}`,
method: 'POST',
headers: {
'Content-Type': `multipart/form-data; boundary=${boundary}`,
'Ocp-Apim-Subscription-Key': apiKey,
'Content-Length': body.length,
},
}, (res) => {
// Parse the JSON body and extract combinedPhrases[0].text.
});
req.write(body);
req.end();
});
}
4.3 Plugin entry in index.js
function register(api) {
const cfg = Object.assign(
{ region: 'eastus', model: 'mai-transcribe-1', maxFileSize: 26214400 },
api.pluginConfig || {},
);
function resolveApiKey() {
if (cfg.apiKey) return cfg.apiKey;
if (api.resolveSecret) {
const secret = api.resolveSecret('speech-api-key');
if (secret) return secret;
}
return process.env.SPEECH_API_KEY || '';
}
api.registerTool({
name: 'mai_transcribe',
description: 'Transcribe an audio file into text with MAI-Transcribe-1. Supports 25 languages.',
parameters: {
type: 'object',
required: ['filePath'],
properties: {
filePath: { type: 'string', description: 'Path to the audio file' },
},
},
execute: async (_toolCallId, params) => {
const apiKey = resolveApiKey();
// Read the file, convert with ffmpeg when needed, call the API, and return the transcript.
},
});
api.on('before_prompt_build', () => ({
appendSystemContext:
'You have a mai_transcribe tool that transcribes audio files to text. ' +
'When you receive a voice message or audio file attachment, ' +
'ALWAYS use the mai_transcribe tool to transcribe it first. ' +
'Never tell the user you cannot read audio files. Use the tool instead.',
}), { priority: 20 });
}
[!IMPORTANT]
before_prompt_buildmatters because LINE sends voice clips as media attachments, not through OpenClaw's automatic audio pipeline. Without the prompt injection, the model can respond with "I can't read this audio format" instead of calling the tool.
Step 5: Update OpenClaw configuration
Add two configuration blocks to openclaw.json.
5.1 Automatic transcription pipeline for Telegram and WhatsApp
{
"tools": {
"media": {
"audio": {
"enabled": true,
"maxBytes": 26214400
},
"models": [
{
"type": "cli",
"command": "/home/weijen/.openclaw/scripts/mai-transcribe.sh",
"args": ["{{MediaPath}}"],
"timeoutSeconds": 60,
"capabilities": ["audio"]
}
]
}
}
}
5.2 Plugin configuration
{
"plugins": {
"allow": ["...", "mai-transcribe"],
"entries": {
"mai-transcribe": {
"enabled": true,
"config": {
"region": "eastus",
"apiKey": "__KEYVAULT__:speech-api-key"
}
}
},
"load": {
"paths": [
"...",
"/home/weijen/.openclaw/extensions/mai-transcribe"
]
}
}
}
Step 6: One-command installation
If you are using the project CLI, the plugin can be installed in one command:
oc install-mai-transcribe-plugin family-claw.multiagentai.co
That command handles the following work automatically:
- Copies the plugin to the VM with SCP.
- Deploys the CLI script.
- Merges the required configuration.
- Restarts the gateway.
- Verifies the service health.
Step 7: Test the integration
Automated checks
# Gateway health
oc smoke-test family-claw.multiagentai.co
# Confirm the plugin was loaded
ssh weijen@family-claw.multiagentai.co \
'journalctl --user -u openclaw-gateway.service --since "1 minute ago" | grep transcri'
You should see a log line similar to mai-transcribe plugin ready: region=eastus, model=mai-transcribe-1.
Manual checks
| Test | Action | Expected result |
|---|---|---|
| LINE Chinese voice | Send a Chinese voice message | Bot replies based on the transcript |
| Telegram English voice | Send an English voice message | Bot replies in English |
| Voice reminder | Say "Remind me in 10 minutes" | Bot creates a reminder |
| Voice search | Say "Check today's weather in Taipei" | Bot searches and replies |
What Broke in Practice
Problem 1: LINE M4A files were rejected by the API
| Symptom | Fix |
|---|---|
HTTP 422 InvalidAudioFormat even though the docs say M4A is supported |
Convert LINE M4A files to WAV first with ffmpeg -ar 16000 -ac 1 -f wav |
Telegram and WhatsApp OGG Opus files worked directly without conversion.
Problem 2: LINE did not use the automatic STT pipeline
| Symptom | Fix |
|---|---|
| The model says it cannot read the audio attachment | LINE voice arrives as a media attachment, not through tools.media.audio, so before_prompt_build must explicitly tell the agent to call mai_transcribe |
Problem 3: tools.media.audio was enabled but no transcription happened
| Symptom | Fix |
|---|---|
enabled: true is set but the CLI never runs |
Configure a CLI model in both tools.media.models and tools.media.audio.models if your OpenClaw build expects both locations |
Problem 4: Azure Speech is not Azure AI Foundry
| Symptom | Fix |
|---|---|
| Calling the Speech API with an AI Foundry key returns 401 | MAI-Transcribe-1 requires an Azure Speech resource with kind: SpeechServices, which is a separate service with its own key and endpoint |
When these failures show up in production, trace and diagnose transcription failures with Azure AI Foundry to inspect the full request payload and the Speech API response.
Voice Format Matrix by Platform
| Platform | Format | Needs ffmpeg |
Storage path |
|---|---|---|---|
| Telegram | OGG Opus (.oga) |
No | ~/.openclaw/media/inbound/ |
OGG Opus (.ogg) |
No | ~/.openclaw/media/inbound/ |
|
| LINE | M4A AAC (.m4a) |
Yes | /tmp/openclaw/line-media-*.m4a |
Cost
| Item | Cost |
|---|---|
Speech resource (S0) |
No monthly base fee |
| MAI-Transcribe-1 | $0.36 per audio hour |
| Typical family usage | About 5 to 10 voice notes per day at roughly 10 seconds each, which lands near 1 hour per month |
| Estimated monthly total | Less than $0.36 |
That is roughly on par with OpenAI Whisper pricing, but the operational advantage is that everything stays in Azure with the same Key Vault and deployment story.
Why This Shape Works
The integration comes down to four moving parts:
- Use Bicep to deploy Azure Speech and store the key in Key Vault.
- Use a CLI script to call the REST API and normalize formats with
ffmpeg. - Use a plugin to register the tool and teach the agent to transcribe before reasoning.
- Use two configuration paths: the automatic media pipeline for Telegram and WhatsApp, and the explicit tool path for LINE.
The biggest lesson was that voice is not one feature. Each channel packages and routes it differently. LINE does not enter the same pipeline as Telegram and WhatsApp, and its M4A payloads are not always accepted as-is. Once those two differences were handled directly, the rest of the integration became much more predictable.