Skip to main content
This system supports integration with 4 major domestic speech engines to implement speech-to-text (ASR) and text-to-speech (TTS) functionality. Each engine has been fully integrated and tested.

Supported Engines

1. Baidu Intelligent Cloud

2. Xunfei Open Platform

3. Volcano Engine

4. Alibaba Cloud


Required Configuration

Baidu Intelligent Cloud

ASR (Speech-to-Text)

The following environment variables need to be configured:
  • SUPERUN_BAIDU_API_KEY - API Key
  • SUPERUN_BAIDU_SECRET_KEY - Secret Key

TTS (Text-to-Speech)

The following environment variables need to be configured:
  • SUPERUN_BAIDU_API_KEY - API Key
  • SUPERUN_BAIDU_SECRET_KEY - Secret Key
Voice Options:
  • 0 - Du Xiaoyu (Female)
  • 1 - Du Xiaomei (Male)
  • 3 - Du Xiaoyao (Female)
  • 4 - Du Yaya (Male)

Xunfei Open Platform

ASR (Speech-to-Text)

The following environment variables need to be configured:
  • SUPERUN_XUNFEI_APP_ID - App ID
  • SUPERUN_XUNFEI_API_KEY - API Key
  • SUPERUN_XUNFEI_API_SECRET - API Secret
Technical Features: Uses WebSocket protocol for real-time speech recognition.

TTS (Text-to-Speech)

The following environment variables need to be configured:
  • SUPERUN_XUNFEI_APP_ID - App ID
  • SUPERUN_XUNFEI_API_KEY - API Key
  • SUPERUN_XUNFEI_API_SECRET - API Secret
Voice Options:
  • xiaoyan - Xunfei Xiaoyan (Female)
  • xiaoyu - Xunfei Xiaoyu (Male)
  • xiaomei - Xunfei Xiaomei (Female)
  • xiaoqi - Xunfei Xiaoqi (Male)
Technical Features: Uses WebSocket protocol for speech synthesis.

Volcano Engine

ASR (Speech-to-Text)

The following environment variables need to be configured:
  • SUPERUN_VOLCANO_APP_ID - App ID
  • SUPERUN_VOLCANO_ACCESS_TOKEN - Access Token
  • SUPERUN_VOLCANO_SECRET_KEY - Secret Key (for WebSocket authentication)
  • SUPERUN_VOLCANO_ASR_CLUSTER - ASR Cluster (optional, default: volcengine_input_common)
Technical Features: Uses WebSocket binary protocol, supports Gzip compression, supports chunked transmission.

TTS (Text-to-Speech)

The following environment variables need to be configured:
  • SUPERUN_VOLCANO_APP_ID - App ID
  • SUPERUN_VOLCANO_ACCESS_TOKEN - Access Token
Voice Options:
  • BV700_V2_streaming - Fresh Female Voice
  • BV001_V2_streaming - General Male Voice
  • BV705_streaming - Sweet Female Voice
  • BV701_V2_streaming - Rich Male Voice

Alibaba Cloud

ASR (Speech-to-Text)

The following environment variables need to be configured:
  • SUPERUN_ALIYUN_ACCESS_KEY_ID - Access Key ID
  • SUPERUN_ALIYUN_ACCESS_KEY_SECRET - Access Key Secret
  • SUPERUN_ALIYUN_APP_KEY - App Key
Technical Features: Uses REST API, supports HMAC-SHA1 signature authentication, uses Token mechanism. Limitation: Single audio recognition length ≤ 60 seconds.

TTS (Text-to-Speech)

The following environment variables need to be configured:
  • SUPERUN_ALIYUN_ACCESS_KEY_ID - Access Key ID
  • SUPERUN_ALIYUN_ACCESS_KEY_SECRET - Access Key Secret
  • SUPERUN_ALIYUN_APP_KEY - App Key
Voice Options:
  • aixia - Aixia (Female)
  • aiwei - Aiwei (Male)
  • aida - Aida (Female)
  • kenny - Kenny (Male)
Technical Features: Uses REST API, supports HMAC-SHA1 signature authentication.

Configuration Method

Supabase Edge Functions (Production Environment)

Configure environment variables in Supabase project:
# Baidu
supabase secrets set SUPERUN_BAIDU_API_KEY=your_api_key
supabase secrets set SUPERUN_BAIDU_SECRET_KEY=your_secret_key

# Xunfei
supabase secrets set SUPERUN_XUNFEI_APP_ID=your_app_id
supabase secrets set SUPERUN_XUNFEI_API_KEY=your_api_key
supabase secrets set SUPERUN_XUNFEI_API_SECRET=your_api_secret

# Volcano Engine
supabase secrets set SUPERUN_VOLCANO_APP_ID=your_app_id
supabase secrets set SUPERUN_VOLCANO_ACCESS_TOKEN=your_access_token
supabase secrets set SUPERUN_VOLCANO_SECRET_KEY=your_secret_key
supabase secrets set SUPERUN_VOLCANO_ASR_CLUSTER=volcengine_input_common

# Alibaba Cloud
supabase secrets set SUPERUN_ALIYUN_ACCESS_KEY_ID=your_access_key_id
supabase secrets set SUPERUN_ALIYUN_ACCESS_KEY_SECRET=your_access_key_secret
supabase secrets set SUPERUN_ALIYUN_APP_KEY=your_app_key

Code Implementation Architecture

Frontend Components

ASR Module (Speech-to-Text)

// src/components/mobile/ASRModule.tsx
const ASRModule = ({ engine = "baidu" }: ASRModuleProps) => {
  const callASRAPI = async (audioData: string) => {
    const { data, error } = await supabase.functions.invoke('asr-convert', {
      body: {
        engine: engine,
        audioData: audioData,
      }
    });
    
    if (data.success) {
      setResult(data.result.text);
      setMetrics({
        time: Math.round(data.result.duration || 0),
        confidence: Math.round((data.result.confidence || 0) * 100),
        rate: "16k"
      });
    }
  };
  
  // ... Recording and file upload logic
};

TTS Module (Text-to-Speech)

// src/components/mobile/TTSModule.tsx
const TTSModule = ({ engine = "baidu" }: TTSModuleProps) => {
  const callTTSAPI = async () => {
    const { data, error } = await supabase.functions.invoke('tts-convert', {
      body: {
        engine: engine,
        text: text,
        voice: selectedVoice,
        speed: speed[0],
        volume: volume[0],
      }
    });
    
    if (data.success) {
      setAudioUrl(data.result.audioUrl);
      setStatus("complete");
    }
  };
  
  // ... Synthesis logic
};

Engine Selector

// src/components/mobile/EngineSelector.tsx
const engines = [
  { id: "baidu", name: "Baidu", shortName: "BD" },
  { id: "xunfei", name: "Xunfei", shortName: "XF" },
  { id: "volcano", name: "Volcano", shortName: "HS" },
  { id: "aliyun", name: "Alibaba Cloud", shortName: "ALI" },
];

Backend Implementation (Supabase Edge Functions)

ASR Conversion Service

File Location: supabase/functions/asr-convert/index.ts Core Logic:
  1. Select corresponding engine implementation based on engine parameter
  2. Read corresponding API credentials from environment variables
  3. Call each engine’s ASR API
  4. Return standardized recognition results
Baidu Implementation:
async function callBaiduASR(apiKey: string, secretKey: string, audioData: string) {
  // 1. Get Access Token
  const accessToken = await getBaiduAccessToken(apiKey, secretKey);
  
  // 2. API URL - Don't include any parameters
  const apiUrl = 'https://vop.baidu.com/server_api';
  
  // 3. Request Body - token must be here
  const requestBody = {
    format: "wav",           // Audio format
    rate: 16000,             // Sample rate (must be number type)
    channel: 1,              // Number of channels
    cuid: userId,            // User identifier
    token: accessToken,      // ← Key: token in request body
    speech: base64Audio,     // Base64 encoded audio
    len: audioByteLength,    // Actual byte length of WAV file (must be number type)
    // Don't use dev_pid
  };
  
  // 4. Send request
  const response = await fetch(apiUrl, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(requestBody),
  });
  
  return { text: result.result[0], confidence: 0.95 };
}
Xunfei Implementation:
async function callXunfeiASR(appId: string, apiKey: string, apiSecret: string, audioData: string) {
  // 1. Build WebSocket authentication URL (HMAC-SHA256 signature)
  const wsUrl = buildWebSocketAuthUrl(host, path, apiKey, apiSecret);
  
  // 2. Establish WebSocket connection
  const ws = new WebSocket(wsUrl);
  
  // 3. Send recognition request
  ws.send(JSON.stringify({
    common: { app_id: appId },
    business: { language: "zh_cn", domain: "iat", accent: "mandarin" },
    data: { status: 2, format: "audio/L16;rate=16000", audio: base64Audio }
  }));
  
  // 4. Receive and parse results
  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    // Parse recognition results...
  };
}
Volcano Engine Implementation:
// Using WebSocket binary protocol
async function callVolcanoASR(appId: string, accessToken: string, audioData: string) {
  // 1. Build WebSocket URL
  const wsUrl = `wss://openspeech.bytedance.com/api/v2/asr?appid=${appId}&token=${accessToken}&cluster=${cluster}`;
  
  // 2. Establish connection (set binaryType to "arraybuffer")
  const ws = new WebSocket(wsUrl);
  ws.binaryType = "arraybuffer";
  
  // 3. Send Full Client Request (binary protocol, Gzip compression)
  const fullRequestMessage = await buildMessage(
    0b0001,  // message_type: full client request
    0b0000,  // flags: not last packet
    0b0001,  // serialization: JSON
    0b0001,  // compression: Gzip
    jsonBytes
  );
  ws.send(fullRequestMessage);
  
  // 4. Send audio data in chunks
  const audioMessage = await buildMessage(
    0b0010,  // message_type: audio only
    0b0010,  // flags: last packet
    0b0000,  // serialization: none
    0b0001,  // compression: Gzip
    audioChunk
  );
  ws.send(audioMessage);
  
  // 5. Parse binary response
  ws.onmessage = async (event) => {
    const result = await parseServerResponse(event.data);
    // Parse recognition results...
  };
}
Alibaba Cloud Implementation:
async function callAliyunASR(accessKeyId: string, accessKeySecret: string, appKey: string, audioData: string) {
  // 1. Get Token (HMAC-SHA1 signature)
  const token = await getAliyunToken(accessKeyId, accessKeySecret);
  
  // 2. Send REST API request
  const response = await fetch('https://nls-gateway-cn-shanghai.aliyuncs.com/stream/v1/asr?appkey=...', {
    method: 'POST',
    headers: {
      'X-NLS-Token': token,
      'Content-Type': 'application/octet-stream'
    },
    body: audioBytes  // Binary audio data
  });
  
  return { text: result.result, confidence: 0.94 };
}

TTS Conversion Service

File Location: supabase/functions/tts-convert/index.ts Core Logic:
  1. Select corresponding engine implementation based on engine parameter
  2. Read corresponding API credentials from environment variables
  3. Map voice parameter to each engine’s voice code
  4. Call each engine’s TTS API
  5. Return base64 encoded audio data
Voice Mapping:
const voiceMapping: Record<string, Record<string, { code: string; name: string }>> = {
  baidu: {
    female_1: { code: "0", name: "Du Xiaoyu" },
    male_1: { code: "1", name: "Du Xiaomei" },
    // ...
  },
  xunfei: {
    female_1: { code: "xiaoyan", name: "Xunfei Xiaoyan" },
    // ...
  },
  volcano: {
    female_1: { code: "BV700_V2_streaming", name: "Fresh Female Voice" },
    // ...
  },
  aliyun: {
    female_1: { code: "aixia", name: "Aixia" },
    // ...
  },
};
Baidu Implementation:
async function callBaiduTTS(apiKey: string, secretKey: string, text: string, voice: string, speed: number, volume: number) {
  const accessToken = await getBaiduAccessToken(apiKey, secretKey);
  
  const params = new URLSearchParams({
    tex: text,
    tok: accessToken,
    lan: "zh",
    spd: Math.round(speed * 5).toString(),
    vol: Math.round((volume / 100) * 15).toString(),
    per: voiceCode,
    aue: "3",  // MP3 format
  });
  
  const response = await fetch(`https://tsn.baidu.com/text2audio?${params.toString()}`);
  const audioBuffer = await response.arrayBuffer();
  
  // Convert to base64
  const audioBase64 = bufferToBase64(audioBuffer);
  return { audioUrl: `data:audio/mp3;base64,${audioBase64}` };
}
Xunfei Implementation:
async function callXunfeiTTS(appId: string, apiKey: string, apiSecret: string, text: string, voice: string, speed: number, volume: number) {
  // Use WebSocket protocol
  const wsUrl = buildWebSocketAuthUrl(host, path, apiKey, apiSecret);
  const ws = new WebSocket(wsUrl);
  
  ws.send(JSON.stringify({
    common: { app_id: appId },
    business: {
      aue: "lame",  // MP3 format
      vcn: voiceCode,
      speed: Math.round(speed * 50),
      volume: Math.round(volume * 100 / 80),
    },
    data: {
      status: 2,
      text: btoa(unescape(encodeURIComponent(text)))
    }
  }));
  
  // Receive and merge audio data chunks
  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.data && data.data.audio) {
      audioChunks.push(data.data.audio);
    }
    if (data.data && data.data.status === 2) {
      // Synthesis complete
      const audioBase64 = audioChunks.join('');
      return { audioUrl: `data:audio/mp3;base64,${audioBase64}` };
    }
  };
}
Volcano Engine Implementation:
async function callVolcanoTTS(appId: string, accessToken: string, text: string, voice: string, speed: number, volume: number) {
  const response = await fetch('https://openspeech.bytedance.com/api/v1/tts', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${accessToken}`
    },
    body: JSON.stringify({
      app: { appid: appId, token: accessToken, cluster: "volcano_tts" },
      audio: {
        voice_type: voiceCode,
        encoding: "mp3",
        speed_ratio: speed,
        volume_ratio: volume / 100,
      },
      request: { text: text, text_type: "plain" }
    })
  });
  
  const result = await response.json();
  // Return base64 audio
  return { audioUrl: `data:audio/mp3;base64,${result.data}` };
}
Alibaba Cloud Implementation:
async function callAliyunTTS(accessKeyId: string, accessKeySecret: string, appKey: string, text: string, voice: string, speed: number, volume: number) {
  const token = await getAliyunToken(accessKeyId, accessKeySecret);
  
  const response = await fetch('https://nls-gateway.cn-shanghai.aliyuncs.com/stream/v1/tts', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-NLS-Token': token,
    },
    body: JSON.stringify({
      appkey: appKey,
      text: text,
      voice: voiceCode,
      format: "mp3",
      sample_rate: 16000,
      volume: volume,
      speech_rate: Math.round((speed - 0.5) * 200),
    })
  });
  
  const audioBuffer = await response.arrayBuffer();
  const audioBase64 = bufferToBase64(audioBuffer);
  return { audioUrl: `data:audio/mp3;base64,${audioBase64}` };
}

Baidu ASR Common Errors and Solutions

Error Code 3311: param rate invalid

This is the most common error, usually caused by the following:
IssueSolution
Token placement errorToken must be in request body, not in URL parameters
cuid duplicationcuid only in request body, don’t repeat in URL
Using dev_pidDon’t use dev_pid parameter, let Baidu auto-detect language
rate type errorEnsure rate is number type, not string
len calculation errorlen must be actual byte length of WAV file

Correct len Parameter Calculation

Calculate actual byte length from Base64 string:
// Calculate actual byte length from Base64 string
const padding = (base64Audio.match(/=/g) || []).length;
const audioByteLength = Math.floor((base64Audio.length * 3) / 4) - padding;

// Verification: audioByteLength should equal WAV file's blob.size

Frontend Audio Processing Points

1. Recording Format

Browser usually uses webm/opus:
const mimeType = "audio/webm;codecs=opus";

2. Must Resample to 16kHz (Baidu Requirement)

const offlineContext = new OfflineAudioContext(
  1,                    // Mono channel
  targetLength,         
  16000                 // Target sample rate
);

3. Convert to 16bit PCM

const pcm16 = new Int16Array(samples.length);
for (let i = 0; i < samples.length; i++) {
  const s = Math.max(-1, Math.min(1, samples[i]));
  pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}

4. Add WAV Header (44 bytes)

const wavHeader = {
  sampleRate: 16000,
  numChannels: 1,
  bitsPerSample: 16,
  byteRate: 32000,      // 16000 * 1 * 16 / 8
  blockAlign: 2,        // 1 * 16 / 8
};

Environment Variable Configuration

Configure in Supabase Edge Function Secrets:
# Supabase Edge Function Secrets
SUPERUN_BAIDU_API_KEY=your_baidu_api_key
SUPERUN_BAIDU_SECRET_KEY=your_baidu_secret_key
How to Obtain: Baidu Intelligent Cloud Console → Speech Technology → Create Application

Debugging Checklist

When encountering 3311 error, check in order:
  1. ✅ Is Token in request body (not URL parameter)
  2. ✅ Is rate number type (typeof rate === 'number')
  3. ✅ Is len equal to WAV file actual size
  4. ✅ Has dev_pid parameter been removed
  5. ✅ Is sample rate in WAV header 16000
  6. ✅ Is audio duration within 0.5-60 seconds range

Complete Request Example

Correct ✓:
{
  format: "wav",
  rate: 16000,          // number type
  channel: 1,
  cuid: "user_001",
  token: "24.xxx...",   // In request body
  speech: "UklGR...",   // Base64
  len: 63404            // number type, actual byte length
}
Incorrect ✗:
{
  format: "wav",
  rate: "16000",        // ← Error: string type
  channel: 1,
  cuid: "user_001",
  dev_pid: 1737,        // ← Error: don't use
  speech: "UklGR...",
  len: "63404"          // ← Error: string type
}
// URL: ?token=xxx      // ← Error: don't put token in URL

Technical Points

ASR (Speech-to-Text)

  1. Unified Audio Format: All engines use WAV format, 16kHz sample rate, mono channel
  2. Base64 Encoding: Audio data converted to base64 in frontend before passing to backend
  3. Protocol Differences:
    • Baidu, Alibaba Cloud: REST API
    • Xunfei, Volcano Engine: WebSocket protocol
  4. Standardized Results: Unified return format { text, confidence, duration }

TTS (Text-to-Speech)

  1. Voice Mapping: Frontend uses unified voice IDs (female_1, male_1, etc.), backend maps to each engine’s actual voice codes
  2. Parameter Conversion:
    • Speed: Frontend range 0.5-2.0x, each engine converts to corresponding range
    • Volume: Frontend range 0-100%, each engine converts to corresponding range
  3. Output Format: All engines uniformly return MP3 format base64 encoded audio
  4. Protocol Differences:
    • Baidu, Alibaba Cloud, Volcano Engine: REST API
    • Xunfei: WebSocket protocol (needs to receive multiple audio chunks)

Testing Recommendations

  1. API Credential Testing: Ensure all environment variables are correctly configured
  2. Audio Format Testing: Test different audio file formats (WAV, MP3, M4A)
  3. Duration Limit Testing: Pay special attention to Alibaba Cloud’s 60-second limit
  4. Error Handling Testing: Test network errors, API errors, and other exceptional cases
  5. Concurrency Testing: Test multiple users using different engines simultaneously

Notes

  1. Cost Control: Each engine has its own billing rules, monitor API call volume
  2. Rate Limits: Each engine has call frequency limits, avoid exceeding limits
  3. Audio Size: Recommend limiting uploaded audio file size (e.g., 10MB)
  4. Timeout Settings: Set reasonable timeout for WebSocket connections (e.g., 30 seconds)
  5. Error Logging: Record detailed error information for troubleshooting

superun Official Website

Browse the official website to learn more features and usage examples.