用 MAI-Transcribe-1 讓 OpenClaw 聽懂語音

這篇文章會帶你把 Azure Speech 的 MAI-Transcribe-1 接進 OpenClaw，讓 Telegram、LINE、WhatsApp 的語音訊息能在進入代理流程前先轉成一般文字輸入。這樣你原本的提示詞、工具與 plugin 不需要為了語音再維護一套平行邏輯。

你會得到什麼

用戶在任何頻道傳語音訊息，AI 會先轉文字，再像一般訊息一樣回應。
支援 25 種語言自動偵測，包含中文、英文、日文與韓文。
語音指令可以直接觸發工具，例如說「十分鐘後提醒我開會」就會建立提醒。
Telegram、LINE、WhatsApp 的語音格式都能處理。
語音辨識成本很低，每小時音訊大約只要 $0.36 USD。

前置條件

你已經有一台執行 OpenClaw 的 Azure VM，可以先參考基礎設施篇。
Azure AI Foundry 已經部署完成。
你的 Bicep IaC 目前可以正常部署。
VM 上已安裝 ffmpeg，後面會用它處理音訊格式相容性。

Open Source Plugin Repo

這篇文章提到的 plugin 程式碼，現在已經公開在 weijen/openclaw-mai-transcribe-plugin。

如果你想直接拿一份可用的 OpenClaw plugin runtime、測試與最小設定範例，可以從那個 repo 開始，不需要手動從文章裡複製程式碼。這篇文章則保留比較完整的生產環境脈絡，包含 Azure Speech 資源部署、VM 端整合、不同聊天通道的音訊處理方式，以及 plugin 放進實際 OpenClaw 架構後要注意的細節。

語音訊息會怎麼走

Telegram / LINE / WhatsApp
        | (語音訊息)
        v
   OpenClaw Gateway
        |
   +----+-----------------------+
   | Telegram / WhatsApp        | LINE
   | tools.media.audio          | 媒體附件 -> 代理
   | CLI pipeline               | -> mai_transcribe 工具
   +----+-----------------------+
        |
        v
   mai-transcribe.sh (CLI)
        | ffmpeg 轉檔 (M4A -> WAV)
        v
   Azure Speech Service (East US)
   MAI-Transcribe-1
        |
        v
   轉錄文字 -> 代理回應

不同通道的語音訊息會走兩條不同路徑：

Telegram 和 WhatsApp：OpenClaw 的 tools.media.audio 管線會自動呼叫 CLI 腳本轉錄。
LINE：語音訊息會以媒體附件形式傳給代理，所以要靠 plugin 提示詞引導代理主動呼叫 mai_transcribe。

第一步：部署 Azure Speech 資源

MAI-Transcribe-1 使用的是 Azure Speech Service，不是 Azure AI Foundry。目前可用區域是 East US。

1.1 新增 Bicep 模組

建立 infra/bicep/modules/speech.bicep：

param location string = 'eastus'
param prefix string = 'oc-family'
param skuName string = 'S0'

var unique = uniqueString(resourceGroup().id)

resource speech 'Microsoft.CognitiveServices/accounts@2024-10-01' = {
  name: '${prefix}-speech-${unique}'
  location: location
  kind: 'SpeechServices'
  sku: { name: skuName }
  properties: {
    customSubDomainName: '${prefix}-speech-${unique}'
    publicNetworkAccess: 'Enabled'
    disableLocalAuth: false
    networkAcls: { defaultAction: 'Allow' }
  }
}

output speechEndpoint string = speech.properties.endpoint
output speechResourceName string = speech.name
output speechRegion string = location

#disable-next-line outputs-should-not-contain-secrets
output speechKey string = speech.listKeys().key1

1.2 接到 `main.bicep`

加入參數與模組：

param enableSpeech bool = true
param speechLocation string = 'eastus'

module speech './modules/speech.bicep' = if (enableSpeech) {
  name: 'speech'
  params: { location: speechLocation; prefix: prefix }
}

再把 Speech key 存到 Key Vault：

@secure()
param speechApiKey string = ''

resource speechApiKeySecret 'Microsoft.KeyVault/vaults/secrets@2023-07-01' = if (!empty(speechApiKey)) {
  parent: kv
  name: 'speech-api-key'
  properties: { value: speechApiKey }
}

並在 main.bicep 的 Key Vault 模組呼叫中傳入：

speechApiKey: enableSpeech ? speech.outputs.speechKey : ''

1.3 部署與驗證

az deployment group create \
  --resource-group oc-family-rg \
  --template-file infra/bicep/main.bicep \
  --parameters infra/bicep/params/prod.bicepparam

驗證資源是否建立成功：

az cognitiveservices account list -g oc-family-rg \
  --query "[?kind=='SpeechServices'].{name:name, location:location}" -o table

你應該會看到 oc-family-speech-xxxxx | eastus。

第二步：安裝 `ffmpeg`

LINE 的語音訊息通常是 M4A 搭配 AAC 編碼。實際整合時，MAI-Transcribe-1 可能會對某些 LINE 產出的 M4A 回傳 HTTP 422，即使文件寫著支援 M4A。先轉成 WAV 會穩定很多。

ssh weijen@family-claw.multiagentai.co 'sudo apt-get install -y ffmpeg'

[!NOTE] Telegram 和 WhatsApp 的 OGG Opus 通常可以直接處理，但把所有非 WAV、MP3、FLAC 格式統一先經過 ffmpeg 轉換，會比較安全。

第三步：建立 CLI 轉錄腳本

這支腳本是整個語音轉文字流程的核心。OpenClaw 需要轉錄音訊時，就會呼叫它。

建立 scripts/vm/mai-transcribe.sh：

#!/usr/bin/env bash
set -euo pipefail

AUDIO_PATH="${1:?Usage: mai-transcribe.sh <audio-file>}"
REGION="${SPEECH_REGION:-eastus}"
API_VERSION="2025-10-15"

# 如果環境變數沒有 API key，就從 OpenClaw config 讀取。
if [ -z "${SPEECH_API_KEY:-}" ]; then
  SPEECH_API_KEY=$(python3 -c "
import json
cfg = json.load(open('$HOME/.openclaw/openclaw.json'))
mt = cfg.get('plugins', {}).get('entries', {}).get('mai-transcribe', {}).get('config', {})
print(mt.get('apiKey', ''))
" 2>/dev/null || true)
fi

# 非 WAV/MP3/FLAC 檔案先轉成 WAV。
UPLOAD_PATH="${AUDIO_PATH}"
CLEANUP_TMP=""
EXT_LOWER=$(echo "${AUDIO_PATH##*.}" | tr '[:upper:]' '[:lower:]')

if [[ "$EXT_LOWER" != "wav" && "$EXT_LOWER" != "mp3" && "$EXT_LOWER" != "flac" ]]; then
  if command -v ffmpeg &>/dev/null; then
    TMP_WAV=$(mktemp /tmp/mai-transcribe-XXXXXX.wav)
    ffmpeg -y -i "${AUDIO_PATH}" -ar 16000 -ac 1 -f wav "${TMP_WAV}" </dev/null 2>/dev/null
    UPLOAD_PATH="${TMP_WAV}"
    CLEANUP_TMP="${TMP_WAV}"
  fi
fi

ENDPOINT="https://${REGION}.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=${API_VERSION}"

RESPONSE=$(curl -s -w "\n%{http_code}" -X POST "${ENDPOINT}" \
  -H "Content-Type: multipart/form-data" \
  -H "Ocp-Apim-Subscription-Key: ${SPEECH_API_KEY}" \
  --form "audio=@\"${UPLOAD_PATH}\"" \
  --form 'definition={"enhancedMode":{"enabled":true,"model":"mai-transcribe-1"}}')

[ -n "${CLEANUP_TMP}" ] && rm -f "${CLEANUP_TMP}"

HTTP_CODE=$(echo "${RESPONSE}" | tail -1)
BODY=$(echo "${RESPONSE}" | sed '$d')

if [ "${HTTP_CODE}" -ne 200 ]; then
  echo "ERROR: HTTP ${HTTP_CODE}: ${BODY}" >&2
  exit 1
fi

echo "${BODY}" | python3 -c "
import sys, json
data = json.load(sys.stdin)
phrases = data.get('combinedPhrases', [])
if phrases:
    print(phrases[0].get('text', ''))
else:
    texts = [p.get('text', '') for p in data.get('phrases', [])]
    print(' '.join(texts))
"

部署到 VM：

scp scripts/vm/mai-transcribe.sh weijen@family-claw.multiagentai.co:~/.openclaw/scripts/
ssh weijen@family-claw.multiagentai.co 'chmod +x ~/.openclaw/scripts/mai-transcribe.sh'

手動測試：

ssh weijen@family-claw.multiagentai.co \
  '~/.openclaw/scripts/mai-transcribe.sh /tmp/openclaw/line-media-*.m4a'

如果接線正確，這條指令應該會輸出轉錄文字。

第四步：建立 OpenClaw plugin

這個 plugin 需要做兩件事：

註冊 mai_transcribe 工具，讓代理可以主動呼叫。
用 before_prompt_build 告訴代理，收到語音附件時要先轉錄再推理。

如果你不想從零把檔案手刻出來，也可以直接 clone weijen/openclaw-mai-transcribe-plugin，再依你的 OpenClaw host 調整設定。

4.1 建議的 plugin 結構

extensions/mai-transcribe/
├── index.js
├── lib/api.js
├── openclaw.plugin.json
├── package.json
└── test/api.test.js

4.2 `lib/api.js` 的 API 客戶端

可以沿用 MAI-Image-2 整合時的做法，用 Node.js 原生 https 手工組 multipart/form-data：

function transcribeAudio({ region, apiVersion, model, apiKey, audioBuffer, mimeType, fileName }) {
  return new Promise((resolve, reject) => {
    const boundary = `----FormBoundary${Date.now()}`;
    const definition = JSON.stringify({ enhancedMode: { enabled: true, model } });

    const defPart = `--${boundary}\r\n` +
      `Content-Disposition: form-data; name="definition"\r\n` +
      `Content-Type: application/json\r\n\r\n` + definition + `\r\n`;
    const audioHeader = `--${boundary}\r\n` +
      `Content-Disposition: form-data; name="audio"; filename="${fileName}"\r\n` +
      `Content-Type: ${mimeType}\r\n\r\n`;
    const audioFooter = `\r\n--${boundary}--\r\n`;

    const body = Buffer.concat([
      Buffer.from(defPart + audioHeader), audioBuffer, Buffer.from(audioFooter)
    ]);

    const req = https.request({
      hostname: `${region}.api.cognitive.microsoft.com`,
      path: `/speechtotext/transcriptions:transcribe?api-version=${apiVersion}`,
      method: 'POST',
      headers: {
        'Content-Type': `multipart/form-data; boundary=${boundary}`,
        'Ocp-Apim-Subscription-Key': apiKey,
        'Content-Length': body.length,
      },
    }, (res) => {
      // 解析 JSON，取出 combinedPhrases[0].text。
    });
    req.write(body);
    req.end();
  });
}

4.3 `index.js` 的 plugin 入口

function register(api) {
  const cfg = Object.assign(
    { region: 'eastus', model: 'mai-transcribe-1', maxFileSize: 26214400 },
    api.pluginConfig || {},
  );

  function resolveApiKey() {
    if (cfg.apiKey) return cfg.apiKey;
    if (api.resolveSecret) {
      const secret = api.resolveSecret('speech-api-key');
      if (secret) return secret;
    }
    return process.env.SPEECH_API_KEY || '';
  }

  api.registerTool({
    name: 'mai_transcribe',
    description: '用 MAI-Transcribe-1 把音訊檔轉成文字，支援 25 種語言。',
    parameters: {
      type: 'object',
      required: ['filePath'],
      properties: {
        filePath: { type: 'string', description: '音訊檔路徑' },
      },
    },
    execute: async (_toolCallId, params) => {
      const apiKey = resolveApiKey();
      // 讀檔、必要時用 ffmpeg 轉檔、呼叫 API、回傳文字。
    },
  });

  api.on('before_prompt_build', () => ({
    appendSystemContext:
      'You have a mai_transcribe tool that transcribes audio files to text. ' +
      'When you receive a voice message or audio file attachment, ' +
      'ALWAYS use the mai_transcribe tool to transcribe it first. ' +
      'Never tell the user you cannot read audio files. Use the tool instead.',
  }), { priority: 20 });
}

[!IMPORTANT] before_prompt_build 很重要，因為 LINE 的語音是以媒體附件形式傳給代理，不會走 OpenClaw 的自動音訊管線。沒有這段提示詞，模型很可能直接回你「我無法讀取這種音訊格式」。

第五步：更新 OpenClaw 設定

在 openclaw.json 中加入兩段設定。

5.1 Telegram 和 WhatsApp 的自動轉錄管線

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 26214400
      },
      "models": [
        {
          "type": "cli",
          "command": "/home/weijen/.openclaw/scripts/mai-transcribe.sh",
          "args": ["{{MediaPath}}"],
          "timeoutSeconds": 60,
          "capabilities": ["audio"]
        }
      ]
    }
  }
}

5.2 Plugin 設定

{
  "plugins": {
    "allow": ["...", "mai-transcribe"],
    "entries": {
      "mai-transcribe": {
        "enabled": true,
        "config": {
          "region": "eastus",
          "apiKey": "__KEYVAULT__:speech-api-key"
        }
      }
    },
    "load": {
      "paths": [
        "...",
        "/home/weijen/.openclaw/extensions/mai-transcribe"
      ]
    }
  }
}

第六步：一行指令完成安裝

如果你有使用專案內建 CLI，可以直接執行：

oc install-mai-transcribe-plugin family-claw.multiagentai.co

這條指令會自動完成以下工作：

用 SCP 把 plugin 複製到 VM。
部署 CLI 腳本。
合併必要設定。
重啟 gateway。
驗證服務健康狀態。

第七步：測試整合結果

自動化檢查

# Gateway 健康檢查
oc smoke-test family-claw.multiagentai.co

# 確認 plugin 已載入
ssh weijen@family-claw.multiagentai.co \
  'journalctl --user -u openclaw-gateway.service --since "1 minute ago" | grep transcri'

你應該會看到類似 mai-transcribe plugin ready: region=eastus, model=mai-transcribe-1 的日誌。

手動測試

測試	操作	預期
LINE 中文語音	傳一段中文語音訊息	Bot 根據轉錄內容回覆
Telegram 英文語音	傳一段英文語音	Bot 用英文回覆
語音提醒	說「十分鐘後提醒我開會」	Bot 建立提醒
語音搜尋	說「幫我查今天台北天氣」	Bot 搜尋並回覆

我踩過的坑

坑 1：LINE 的 M4A 被 API 拒絕

症狀	解法
HTTP 422 `InvalidAudioFormat`，即使文件寫著支援 M4A	先用 `ffmpeg -ar 16000 -ac 1 -f wav` 把 LINE M4A 轉成 WAV

Telegram 和 WhatsApp 的 OGG Opus 不需要轉檔。

坑 2：LINE 不會走自動 STT 管線

症狀	解法
模型直接說自己不能讀音訊附件	LINE 語音是媒體附件，不是 `tools.media.audio` 管線輸入，所以必須透過 `before_prompt_build` 明確要求代理呼叫 `mai_transcribe`

坑 3：`tools.media.audio` 開啟了但沒有真的轉錄

症狀	解法
設了 `enabled: true`，CLI 還是沒被呼叫	如果你的 OpenClaw 版本需要兩個位置，都要在 `tools.media.models` 和 `tools.media.audio.models` 設定 CLI 模型

坑 4：Azure Speech 不等於 Azure AI Foundry

症狀	解法
拿 AI Foundry 的 key 去打 Speech API 會得到 401	MAI-Transcribe-1 需要 `kind: SpeechServices` 的 Azure Speech 資源，這是另一個服務，也有自己的 key 和 endpoint

在正式環境遇到這些狀況時，可以用 Azure AI Foundry 追蹤與診斷轉錄失敗，直接看完整的請求內容和 Speech API 回應。

各平台語音格式整理

平台	格式	需要 `ffmpeg`	儲存路徑
Telegram	OGG Opus (`.oga`)	否	`~/.openclaw/media/inbound/`
WhatsApp	OGG Opus (`.ogg`)	否	`~/.openclaw/media/inbound/`
LINE	M4A AAC (`.m4a`)	是	`/tmp/openclaw/line-media-*.m4a`

成本

項目	費用
Speech 資源 (`S0`)	沒有月租底費
MAI-Transcribe-1	每小時音訊 $0.36
一般家庭使用量	每天大約 5 到 10 則語音，每則約 10 秒，合計接近每月 1 小時
每月預估總成本	低於 $0.36

和 OpenAI Whisper 的價格大致相近，但好處是整個系統都留在 Azure 生態內，Key Vault 與部署流程也能維持一致。

總結

這個整合可以拆成四個核心部件：

用 Bicep 部署 Azure Speech，並把 key 存進 Key Vault。
用 CLI 腳本呼叫 REST API，並透過 ffmpeg 處理格式相容性。
用 plugin 註冊工具，並教代理在推理前先做轉錄。
用兩條設定路徑分流，Telegram 和 WhatsApp 走自動媒體管線，LINE 走明確工具呼叫。

整個過程最大的學習是，不同通道對語音的處理方式真的不一樣。LINE 不會進入和 Telegram、WhatsApp 相同的管線，而且它送出的 M4A 也不一定能直接被接受。把這兩個行為拆開處理之後，整個整合就清楚很多。