Teaching OpenClaw to Understand Voice with MAI-Transcribe-1

This post explains how to add Azure Speech transcription to OpenClaw with Microsoft's MAI-Transcribe-1. The goal is simple: turn voice notes from Telegram, LINE, and WhatsApp into normal text input early enough that your existing prompts, tools, and plugins can keep working without a separate voice-only stack.

What This Adds to OpenClaw

A user can send a voice message in any supported channel, and OpenClaw will transcribe it before generating a reply.
The system can auto-detect across 25 languages, including Chinese, English, Japanese, and Korean.
Voice commands such as "Remind me in 10 minutes" can trigger existing tools directly.
Telegram, LINE, and WhatsApp voice formats are all supported.
Speech recognition stays inexpensive at about $0.36 USD per hour of audio.

Prerequisites

You already have OpenClaw running on an Azure VM. If not, start with Building a Family AI Chat Bot on Azure with OpenClaw.
Azure AI Foundry is already deployed.
Your Bicep infrastructure can deploy successfully today.
ffmpeg is installed on the VM. You will use it to normalize audio formats before upload.

Start with OpenClaw Overview if you want the broader plugin model and article map before adding speech.
Read Building a Family AI Chat Bot on Azure with OpenClaw if you still need the Azure VM, Key Vault, and Foundry base environment.
Read Teaching OpenClaw to Draw with MAI-Image-2 on Telegram, LINE, and WhatsApp if you want a parallel media workflow for image generation and delivery.
Read Building a Natural-Language Reminder & Scheduled Task System for OpenClaw if you want voice commands to trigger something concrete after transcription.

Open-Source Plugin Repository

The plugin code in this article is now available publicly at weijen/openclaw-mai-transcribe-plugin.

Use that repository if you want the standalone OpenClaw plugin runtime, tests, and a minimal configuration example without copying code blocks out of this post. This article still covers the larger production setup around the plugin, including Azure Speech provisioning, VM-side integration, channel-specific audio handling, and how the plugin fits into a real OpenClaw deployment.

Where Voice Messages Go

Telegram / LINE / WhatsApp
        | (voice message)
        v
   OpenClaw Gateway
        |
   +----+-----------------------+
   | Telegram / WhatsApp        | LINE
   | tools.media.audio          | media attachment -> agent
   | CLI pipeline               | -> mai_transcribe tool
   +----+-----------------------+
        |
        v
   mai-transcribe.sh (CLI)
        | ffmpeg conversion (M4A -> WAV)
        v
   Azure Speech Service (East US)
   MAI-Transcribe-1
        |
        v
   Transcript text -> agent reply

Voice messages take two different routes depending on the channel:

Telegram and WhatsApp: OpenClaw's tools.media.audio pipeline calls the CLI script automatically.
LINE: the voice message arrives as a media attachment, so the plugin prompt instructs the agent to call mai_transcribe explicitly.

Step 1: Deploy Azure Speech

MAI-Transcribe-1 runs on Azure Speech Service, not Azure AI Foundry. The model is only available in East US right now.

1.1 Add a Bicep module

Create infra/bicep/modules/speech.bicep:

param location string = 'eastus'
param prefix string = 'oc-family'
param skuName string = 'S0'

var unique = uniqueString(resourceGroup().id)

resource speech 'Microsoft.CognitiveServices/accounts@2024-10-01' = {
  name: '${prefix}-speech-${unique}'
  location: location
  kind: 'SpeechServices'
  sku: { name: skuName }
  properties: {
    customSubDomainName: '${prefix}-speech-${unique}'
    publicNetworkAccess: 'Enabled'
    disableLocalAuth: false
    networkAcls: { defaultAction: 'Allow' }
  }
}

output speechEndpoint string = speech.properties.endpoint
output speechResourceName string = speech.name
output speechRegion string = location

#disable-next-line outputs-should-not-contain-secrets
output speechKey string = speech.listKeys().key1

1.2 Wire it into `main.bicep`

Add parameters and the new module:

param enableSpeech bool = true
param speechLocation string = 'eastus'

module speech './modules/speech.bicep' = if (enableSpeech) {
  name: 'speech'
  params: { location: speechLocation; prefix: prefix }
}

Then store the Speech key in Key Vault:

@secure()
param speechApiKey string = ''

resource speechApiKeySecret 'Microsoft.KeyVault/vaults/secrets@2023-07-01' = if (!empty(speechApiKey)) {
  parent: kv
  name: 'speech-api-key'
  properties: { value: speechApiKey }
}

Pass it through in the main.bicep Key Vault module call:

speechApiKey: enableSpeech ? speech.outputs.speechKey : ''

1.3 Deploy and verify

az deployment group create \
  --resource-group oc-family-rg \
  --template-file infra/bicep/main.bicep \
  --parameters infra/bicep/params/prod.bicepparam

Verify the resource:

az cognitiveservices account list -g oc-family-rg \
  --query "[?kind=='SpeechServices'].{name:name, location:location}" -o table

You should see oc-family-speech-xxxxx | eastus in the output.

Step 2: Install `ffmpeg`

LINE voice messages arrive as M4A with AAC encoding. In practice, MAI-Transcribe-1 can reject some LINE-produced M4A files with HTTP 422 even though M4A is listed as supported. Converting those files to WAV first avoids the format mismatch.

ssh weijen@family-claw.multiagentai.co 'sudo apt-get install -y ffmpeg'

[!NOTE] Telegram and WhatsApp use OGG Opus, which typically works without conversion. Even so, normalizing all non-WAV, non-MP3, and non-FLAC inputs through ffmpeg is the safer operational choice.

Step 3: Build the CLI transcription script

This script is the core of the speech-to-text flow. OpenClaw calls it whenever it needs to transcribe an audio file.

Create scripts/vm/mai-transcribe.sh:

#!/usr/bin/env bash
set -euo pipefail

AUDIO_PATH="${1:?Usage: mai-transcribe.sh <audio-file>}"
REGION="${SPEECH_REGION:-eastus}"
API_VERSION="2025-10-15"

# Read the API key from OpenClaw config if it is not already in the environment.
if [ -z "${SPEECH_API_KEY:-}" ]; then
  SPEECH_API_KEY=$(python3 -c "
import json
cfg = json.load(open('$HOME/.openclaw/openclaw.json'))
mt = cfg.get('plugins', {}).get('entries', {}).get('mai-transcribe', {}).get('config', {})
print(mt.get('apiKey', ''))
" 2>/dev/null || true)
fi

# Convert non-WAV/MP3/FLAC inputs to WAV before upload.
UPLOAD_PATH="${AUDIO_PATH}"
CLEANUP_TMP=""
EXT_LOWER=$(echo "${AUDIO_PATH##*.}" | tr '[:upper:]' '[:lower:]')

if [[ "$EXT_LOWER" != "wav" && "$EXT_LOWER" != "mp3" && "$EXT_LOWER" != "flac" ]]; then
  if command -v ffmpeg &>/dev/null; then
    TMP_WAV=$(mktemp /tmp/mai-transcribe-XXXXXX.wav)
    ffmpeg -y -i "${AUDIO_PATH}" -ar 16000 -ac 1 -f wav "${TMP_WAV}" </dev/null 2>/dev/null
    UPLOAD_PATH="${TMP_WAV}"
    CLEANUP_TMP="${TMP_WAV}"
  fi
fi

ENDPOINT="https://${REGION}.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=${API_VERSION}"

RESPONSE=$(curl -s -w "\n%{http_code}" -X POST "${ENDPOINT}" \
  -H "Content-Type: multipart/form-data" \
  -H "Ocp-Apim-Subscription-Key: ${SPEECH_API_KEY}" \
  --form "audio=@\"${UPLOAD_PATH}\"" \
  --form 'definition={"enhancedMode":{"enabled":true,"model":"mai-transcribe-1"}}')

[ -n "${CLEANUP_TMP}" ] && rm -f "${CLEANUP_TMP}"

HTTP_CODE=$(echo "${RESPONSE}" | tail -1)
BODY=$(echo "${RESPONSE}" | sed '$d')

if [ "${HTTP_CODE}" -ne 200 ]; then
  echo "ERROR: HTTP ${HTTP_CODE}: ${BODY}" >&2
  exit 1
fi

echo "${BODY}" | python3 -c "
import sys, json
data = json.load(sys.stdin)
phrases = data.get('combinedPhrases', [])
if phrases:
    print(phrases[0].get('text', ''))
else:
    texts = [p.get('text', '') for p in data.get('phrases', [])]
    print(' '.join(texts))
"

Deploy it to the VM:

scp scripts/vm/mai-transcribe.sh weijen@family-claw.multiagentai.co:~/.openclaw/scripts/
ssh weijen@family-claw.multiagentai.co 'chmod +x ~/.openclaw/scripts/mai-transcribe.sh'

Test it manually:

ssh weijen@family-claw.multiagentai.co \
  '~/.openclaw/scripts/mai-transcribe.sh /tmp/openclaw/line-media-*.m4a'

If everything is wired correctly, the command should print the recognized transcript.

Step 4: Build the OpenClaw plugin

The plugin has two jobs:

Register the mai_transcribe tool so the agent can call it deliberately.
Use before_prompt_build to tell the agent that voice attachments must be transcribed before reasoning.

If you want to start from the published code instead of rebuilding the files by hand, clone weijen/openclaw-mai-transcribe-plugin and adapt the configuration to your OpenClaw host.

4.1 Suggested plugin structure

extensions/mai-transcribe/
├── index.js
├── lib/api.js
├── openclaw.plugin.json
├── package.json
└── test/api.test.js

4.2 API client in `lib/api.js`

Use the same low-level multipart technique as the MAI-Image-2 integration, built on Node.js https:

function transcribeAudio({ region, apiVersion, model, apiKey, audioBuffer, mimeType, fileName }) {
  return new Promise((resolve, reject) => {
    const boundary = `----FormBoundary${Date.now()}`;
    const definition = JSON.stringify({ enhancedMode: { enabled: true, model } });

    const defPart = `--${boundary}\r\n` +
      `Content-Disposition: form-data; name="definition"\r\n` +
      `Content-Type: application/json\r\n\r\n` + definition + `\r\n`;
    const audioHeader = `--${boundary}\r\n` +
      `Content-Disposition: form-data; name="audio"; filename="${fileName}"\r\n` +
      `Content-Type: ${mimeType}\r\n\r\n`;
    const audioFooter = `\r\n--${boundary}--\r\n`;

    const body = Buffer.concat([
      Buffer.from(defPart + audioHeader), audioBuffer, Buffer.from(audioFooter)
    ]);

    const req = https.request({
      hostname: `${region}.api.cognitive.microsoft.com`,
      path: `/speechtotext/transcriptions:transcribe?api-version=${apiVersion}`,
      method: 'POST',
      headers: {
        'Content-Type': `multipart/form-data; boundary=${boundary}`,
        'Ocp-Apim-Subscription-Key': apiKey,
        'Content-Length': body.length,
      },
    }, (res) => {
      // Parse the JSON body and extract combinedPhrases[0].text.
    });
    req.write(body);
    req.end();
  });
}

4.3 Plugin entry in `index.js`

function register(api) {
  const cfg = Object.assign(
    { region: 'eastus', model: 'mai-transcribe-1', maxFileSize: 26214400 },
    api.pluginConfig || {},
  );

  function resolveApiKey() {
    if (cfg.apiKey) return cfg.apiKey;
    if (api.resolveSecret) {
      const secret = api.resolveSecret('speech-api-key');
      if (secret) return secret;
    }
    return process.env.SPEECH_API_KEY || '';
  }

  api.registerTool({
    name: 'mai_transcribe',
    description: 'Transcribe an audio file into text with MAI-Transcribe-1. Supports 25 languages.',
    parameters: {
      type: 'object',
      required: ['filePath'],
      properties: {
        filePath: { type: 'string', description: 'Path to the audio file' },
      },
    },
    execute: async (_toolCallId, params) => {
      const apiKey = resolveApiKey();
      // Read the file, convert with ffmpeg when needed, call the API, and return the transcript.
    },
  });

  api.on('before_prompt_build', () => ({
    appendSystemContext:
      'You have a mai_transcribe tool that transcribes audio files to text. ' +
      'When you receive a voice message or audio file attachment, ' +
      'ALWAYS use the mai_transcribe tool to transcribe it first. ' +
      'Never tell the user you cannot read audio files. Use the tool instead.',
  }), { priority: 20 });
}

[!IMPORTANT] before_prompt_build matters because LINE sends voice clips as media attachments, not through OpenClaw's automatic audio pipeline. Without the prompt injection, the model can respond with "I can't read this audio format" instead of calling the tool.

Step 5: Update OpenClaw configuration

Add two configuration blocks to openclaw.json.

5.1 Automatic transcription pipeline for Telegram and WhatsApp

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 26214400
      },
      "models": [
        {
          "type": "cli",
          "command": "/home/weijen/.openclaw/scripts/mai-transcribe.sh",
          "args": ["{{MediaPath}}"],
          "timeoutSeconds": 60,
          "capabilities": ["audio"]
        }
      ]
    }
  }
}

5.2 Plugin configuration

{
  "plugins": {
    "allow": ["...", "mai-transcribe"],
    "entries": {
      "mai-transcribe": {
        "enabled": true,
        "config": {
          "region": "eastus",
          "apiKey": "__KEYVAULT__:speech-api-key"
        }
      }
    },
    "load": {
      "paths": [
        "...",
        "/home/weijen/.openclaw/extensions/mai-transcribe"
      ]
    }
  }
}

Step 6: One-command installation

If you are using the project CLI, the plugin can be installed in one command:

oc install-mai-transcribe-plugin family-claw.multiagentai.co

That command handles the following work automatically:

Copies the plugin to the VM with SCP.
Deploys the CLI script.
Merges the required configuration.
Restarts the gateway.
Verifies the service health.

Step 7: Test the integration

Automated checks

# Gateway health
oc smoke-test family-claw.multiagentai.co

# Confirm the plugin was loaded
ssh weijen@family-claw.multiagentai.co \
  'journalctl --user -u openclaw-gateway.service --since "1 minute ago" | grep transcri'

You should see a log line similar to mai-transcribe plugin ready: region=eastus, model=mai-transcribe-1.

Manual checks

Test	Action	Expected result
LINE Chinese voice	Send a Chinese voice message	Bot replies based on the transcript
Telegram English voice	Send an English voice message	Bot replies in English
Voice reminder	Say "Remind me in 10 minutes"	Bot creates a reminder
Voice search	Say "Check today's weather in Taipei"	Bot searches and replies

What Broke in Practice

Problem 1: LINE M4A files were rejected by the API

Symptom	Fix
HTTP 422 `InvalidAudioFormat` even though the docs say M4A is supported	Convert LINE M4A files to WAV first with `ffmpeg -ar 16000 -ac 1 -f wav`

Telegram and WhatsApp OGG Opus files worked directly without conversion.

Problem 2: LINE did not use the automatic STT pipeline

Symptom	Fix
The model says it cannot read the audio attachment	LINE voice arrives as a media attachment, not through `tools.media.audio`, so `before_prompt_build` must explicitly tell the agent to call `mai_transcribe`

Problem 3: `tools.media.audio` was enabled but no transcription happened

Symptom	Fix
`enabled: true` is set but the CLI never runs	Configure a CLI model in both `tools.media.models` and `tools.media.audio.models` if your OpenClaw build expects both locations

Problem 4: Azure Speech is not Azure AI Foundry

Symptom	Fix
Calling the Speech API with an AI Foundry key returns 401	MAI-Transcribe-1 requires an Azure Speech resource with `kind: SpeechServices`, which is a separate service with its own key and endpoint

When these failures show up in production, trace and diagnose transcription failures with Azure AI Foundry to inspect the full request payload and the Speech API response.

Voice Format Matrix by Platform

Platform	Format	Needs `ffmpeg`	Storage path
Telegram	OGG Opus (`.oga`)	No	`~/.openclaw/media/inbound/`
WhatsApp	OGG Opus (`.ogg`)	No	`~/.openclaw/media/inbound/`
LINE	M4A AAC (`.m4a`)	Yes	`/tmp/openclaw/line-media-*.m4a`

Cost

Item	Cost
Speech resource (`S0`)	No monthly base fee
MAI-Transcribe-1	$0.36 per audio hour
Typical family usage	About 5 to 10 voice notes per day at roughly 10 seconds each, which lands near 1 hour per month
Estimated monthly total	Less than $0.36

That is roughly on par with OpenAI Whisper pricing, but the operational advantage is that everything stays in Azure with the same Key Vault and deployment story.

Why This Shape Works

The integration comes down to four moving parts:

Use Bicep to deploy Azure Speech and store the key in Key Vault.
Use a CLI script to call the REST API and normalize formats with ffmpeg.
Use a plugin to register the tool and teach the agent to transcribe before reasoning.
Use two configuration paths: the automatic media pipeline for Telegram and WhatsApp, and the explicit tool path for LINE.

The biggest lesson was that voice is not one feature. Each channel packages and routes it differently. LINE does not enter the same pipeline as Telegram and WhatsApp, and its M4A payloads are not always accepted as-is. Once those two differences were handled directly, the rest of the integration became much more predictable.