superun & Prompt.to.design Documentation

This system supports integration with 4 major domestic speech engines to implement speech-to-text (ASR) and text-to-speech (TTS) functionality. Each engine has been fully integrated and tested.

Supported Engines

1. Baidu Intelligent Cloud

Engine Code: baidu
Official Website: https://cloud.baidu.com/
Documentation: https://cloud.baidu.com/doc/SPEECH/index.html

2. Xunfei Open Platform

Engine Code: xunfei
Official Website: https://www.xfyun.cn/
Documentation: https://www.xfyun.cn/doc/

3. Volcano Engine

Engine Code: volcano
Official Website: https://www.volcengine.com/
Documentation: https://www.volcengine.com/docs/6561/79817

4. Alibaba Cloud

Engine Code: aliyun
Official Website: https://www.aliyun.com/
Documentation: https://help.aliyun.com/product/30413.html

Required Configuration

Baidu Intelligent Cloud

ASR (Speech-to-Text)

The following environment variables need to be configured:

SUPERUN_BAIDU_API_KEY - API Key
SUPERUN_BAIDU_SECRET_KEY - Secret Key

TTS (Text-to-Speech)

The following environment variables need to be configured:

SUPERUN_BAIDU_API_KEY - API Key
SUPERUN_BAIDU_SECRET_KEY - Secret Key

Voice Options:

0 - Du Xiaoyu (Female)
1 - Du Xiaomei (Male)
3 - Du Xiaoyao (Female)
4 - Du Yaya (Male)

Xunfei Open Platform

ASR (Speech-to-Text)

The following environment variables need to be configured:

SUPERUN_XUNFEI_APP_ID - App ID
SUPERUN_XUNFEI_API_KEY - API Key
SUPERUN_XUNFEI_API_SECRET - API Secret

Technical Features: Uses WebSocket protocol for real-time speech recognition.

TTS (Text-to-Speech)

The following environment variables need to be configured:

SUPERUN_XUNFEI_APP_ID - App ID
SUPERUN_XUNFEI_API_KEY - API Key
SUPERUN_XUNFEI_API_SECRET - API Secret

Voice Options:

xiaoyan - Xunfei Xiaoyan (Female)
xiaoyu - Xunfei Xiaoyu (Male)
xiaomei - Xunfei Xiaomei (Female)
xiaoqi - Xunfei Xiaoqi (Male)

Technical Features: Uses WebSocket protocol for speech synthesis.

Volcano Engine

ASR (Speech-to-Text)

The following environment variables need to be configured:

SUPERUN_VOLCANO_APP_ID - App ID
SUPERUN_VOLCANO_ACCESS_TOKEN - Access Token
SUPERUN_VOLCANO_SECRET_KEY - Secret Key (for WebSocket authentication)
SUPERUN_VOLCANO_ASR_CLUSTER - ASR Cluster (optional, default: volcengine_input_common)

Technical Features: Uses WebSocket binary protocol, supports Gzip compression, supports chunked transmission.

TTS (Text-to-Speech)

The following environment variables need to be configured:

SUPERUN_VOLCANO_APP_ID - App ID
SUPERUN_VOLCANO_ACCESS_TOKEN - Access Token

Voice Options:

BV700_V2_streaming - Fresh Female Voice
BV001_V2_streaming - General Male Voice
BV705_streaming - Sweet Female Voice
BV701_V2_streaming - Rich Male Voice

Alibaba Cloud

ASR (Speech-to-Text)

The following environment variables need to be configured:

SUPERUN_ALIYUN_ACCESS_KEY_ID - Access Key ID
SUPERUN_ALIYUN_ACCESS_KEY_SECRET - Access Key Secret
SUPERUN_ALIYUN_APP_KEY - App Key

Technical Features: Uses REST API, supports HMAC-SHA1 signature authentication, uses Token mechanism. Limitation: Single audio recognition length ≤ 60 seconds.

TTS (Text-to-Speech)

The following environment variables need to be configured:

SUPERUN_ALIYUN_ACCESS_KEY_ID - Access Key ID
SUPERUN_ALIYUN_ACCESS_KEY_SECRET - Access Key Secret
SUPERUN_ALIYUN_APP_KEY - App Key

Voice Options:

aixia - Aixia (Female)
aiwei - Aiwei (Male)
aida - Aida (Female)
kenny - Kenny (Male)

Technical Features: Uses REST API, supports HMAC-SHA1 signature authentication.

Configuration Method

Supabase Edge Functions (Production Environment)

Configure environment variables in Supabase project:

# Baidu
supabase secrets set SUPERUN_BAIDU_API_KEY=your_api_key
supabase secrets set SUPERUN_BAIDU_SECRET_KEY=your_secret_key

# Xunfei
supabase secrets set SUPERUN_XUNFEI_APP_ID=your_app_id
supabase secrets set SUPERUN_XUNFEI_API_KEY=your_api_key
supabase secrets set SUPERUN_XUNFEI_API_SECRET=your_api_secret

# Volcano Engine
supabase secrets set SUPERUN_VOLCANO_APP_ID=your_app_id
supabase secrets set SUPERUN_VOLCANO_ACCESS_TOKEN=your_access_token
supabase secrets set SUPERUN_VOLCANO_SECRET_KEY=your_secret_key
supabase secrets set SUPERUN_VOLCANO_ASR_CLUSTER=volcengine_input_common

# Alibaba Cloud
supabase secrets set SUPERUN_ALIYUN_ACCESS_KEY_ID=your_access_key_id
supabase secrets set SUPERUN_ALIYUN_ACCESS_KEY_SECRET=your_access_key_secret
supabase secrets set SUPERUN_ALIYUN_APP_KEY=your_app_key

Code Implementation Architecture

Frontend Components

ASR Module (Speech-to-Text)

// src/components/mobile/ASRModule.tsx
const ASRModule = ({ engine = "baidu" }: ASRModuleProps) => {
  const callASRAPI = async (audioData: string) => {
    const { data, error } = await supabase.functions.invoke('asr-convert', {
      body: {
        engine: engine,
        audioData: audioData,
      }
    });
    
    if (data.success) {
      setResult(data.result.text);
      setMetrics({
        time: Math.round(data.result.duration || 0),
        confidence: Math.round((data.result.confidence || 0) * 100),
        rate: "16k"
      });
    }
  };
  
  // ... Recording and file upload logic
};

TTS Module (Text-to-Speech)

// src/components/mobile/TTSModule.tsx
const TTSModule = ({ engine = "baidu" }: TTSModuleProps) => {
  const callTTSAPI = async () => {
    const { data, error } = await supabase.functions.invoke('tts-convert', {
      body: {
        engine: engine,
        text: text,
        voice: selectedVoice,
        speed: speed[0],
        volume: volume[0],
      }
    });
    
    if (data.success) {
      setAudioUrl(data.result.audioUrl);
      setStatus("complete");
    }
  };
  
  // ... Synthesis logic
};

Engine Selector

// src/components/mobile/EngineSelector.tsx
const engines = [
  { id: "baidu", name: "Baidu", shortName: "BD" },
  { id: "xunfei", name: "Xunfei", shortName: "XF" },
  { id: "volcano", name: "Volcano", shortName: "HS" },
  { id: "aliyun", name: "Alibaba Cloud", shortName: "ALI" },
];

Backend Implementation (Supabase Edge Functions)

ASR Conversion Service

File Location: supabase/functions/asr-convert/index.ts Core Logic:

Select corresponding engine implementation based on engine parameter
Read corresponding API credentials from environment variables
Call each engine’s ASR API
Return standardized recognition results

Baidu Implementation:

async function callBaiduASR(apiKey: string, secretKey: string, audioData: string) {
  // 1. Get Access Token
  const accessToken = await getBaiduAccessToken(apiKey, secretKey);
  
  // 2. API URL - Don't include any parameters
  const apiUrl = 'https://vop.baidu.com/server_api';
  
  // 3. Request Body - token must be here
  const requestBody = {
    format: "wav",           // Audio format
    rate: 16000,             // Sample rate (must be number type)
    channel: 1,              // Number of channels
    cuid: userId,            // User identifier
    token: accessToken,      // ← Key: token in request body
    speech: base64Audio,     // Base64 encoded audio
    len: audioByteLength,    // Actual byte length of WAV file (must be number type)
    // Don't use dev_pid
  };
  
  // 4. Send request
  const response = await fetch(apiUrl, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(requestBody),
  });
  
  return { text: result.result[0], confidence: 0.95 };
}

Xunfei Implementation:

async function callXunfeiASR(appId: string, apiKey: string, apiSecret: string, audioData: string) {
  // 1. Build WebSocket authentication URL (HMAC-SHA256 signature)
  const wsUrl = buildWebSocketAuthUrl(host, path, apiKey, apiSecret);
  
  // 2. Establish WebSocket connection
  const ws = new WebSocket(wsUrl);
  
  // 3. Send recognition request
  ws.send(JSON.stringify({
    common: { app_id: appId },
    business: { language: "zh_cn", domain: "iat", accent: "mandarin" },
    data: { status: 2, format: "audio/L16;rate=16000", audio: base64Audio }
  }));
  
  // 4. Receive and parse results
  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    // Parse recognition results...
  };
}

Volcano Engine Implementation:

// Using WebSocket binary protocol
async function callVolcanoASR(appId: string, accessToken: string, audioData: string) {
  // 1. Build WebSocket URL
  const wsUrl = `wss://openspeech.bytedance.com/api/v2/asr?appid=${appId}&token=${accessToken}&cluster=${cluster}`;
  
  // 2. Establish connection (set binaryType to "arraybuffer")
  const ws = new WebSocket(wsUrl);
  ws.binaryType = "arraybuffer";
  
  // 3. Send Full Client Request (binary protocol, Gzip compression)
  const fullRequestMessage = await buildMessage(
    0b0001,  // message_type: full client request
    0b0000,  // flags: not last packet
    0b0001,  // serialization: JSON
    0b0001,  // compression: Gzip
    jsonBytes
  );
  ws.send(fullRequestMessage);
  
  // 4. Send audio data in chunks
  const audioMessage = await buildMessage(
    0b0010,  // message_type: audio only
    0b0010,  // flags: last packet
    0b0000,  // serialization: none
    0b0001,  // compression: Gzip
    audioChunk
  );
  ws.send(audioMessage);
  
  // 5. Parse binary response
  ws.onmessage = async (event) => {
    const result = await parseServerResponse(event.data);
    // Parse recognition results...
  };
}

Alibaba Cloud Implementation:

async function callAliyunASR(accessKeyId: string, accessKeySecret: string, appKey: string, audioData: string) {
  // 1. Get Token (HMAC-SHA1 signature)
  const token = await getAliyunToken(accessKeyId, accessKeySecret);
  
  // 2. Send REST API request
  const response = await fetch('https://nls-gateway-cn-shanghai.aliyuncs.com/stream/v1/asr?appkey=...', {
    method: 'POST',
    headers: {
      'X-NLS-Token': token,
      'Content-Type': 'application/octet-stream'
    },
    body: audioBytes  // Binary audio data
  });
  
  return { text: result.result, confidence: 0.94 };
}

TTS Conversion Service

File Location: supabase/functions/tts-convert/index.ts Core Logic:

Select corresponding engine implementation based on engine parameter
Read corresponding API credentials from environment variables
Map voice parameter to each engine’s voice code
Call each engine’s TTS API
Return base64 encoded audio data

Voice Mapping:

const voiceMapping: Record<string, Record<string, { code: string; name: string }>> = {
  baidu: {
    female_1: { code: "0", name: "Du Xiaoyu" },
    male_1: { code: "1", name: "Du Xiaomei" },
    // ...
  },
  xunfei: {
    female_1: { code: "xiaoyan", name: "Xunfei Xiaoyan" },
    // ...
  },
  volcano: {
    female_1: { code: "BV700_V2_streaming", name: "Fresh Female Voice" },
    // ...
  },
  aliyun: {
    female_1: { code: "aixia", name: "Aixia" },
    // ...
  },
};

Baidu Implementation:

async function callBaiduTTS(apiKey: string, secretKey: string, text: string, voice: string, speed: number, volume: number) {
  const accessToken = await getBaiduAccessToken(apiKey, secretKey);
  
  const params = new URLSearchParams({
    tex: text,
    tok: accessToken,
    lan: "zh",
    spd: Math.round(speed * 5).toString(),
    vol: Math.round((volume / 100) * 15).toString(),
    per: voiceCode,
    aue: "3",  // MP3 format
  });
  
  const response = await fetch(`https://tsn.baidu.com/text2audio?${params.toString()}`);
  const audioBuffer = await response.arrayBuffer();
  
  // Convert to base64
  const audioBase64 = bufferToBase64(audioBuffer);
  return { audioUrl: `data:audio/mp3;base64,${audioBase64}` };
}

Xunfei Implementation:

async function callXunfeiTTS(appId: string, apiKey: string, apiSecret: string, text: string, voice: string, speed: number, volume: number) {
  // Use WebSocket protocol
  const wsUrl = buildWebSocketAuthUrl(host, path, apiKey, apiSecret);
  const ws = new WebSocket(wsUrl);
  
  ws.send(JSON.stringify({
    common: { app_id: appId },
    business: {
      aue: "lame",  // MP3 format
      vcn: voiceCode,
      speed: Math.round(speed * 50),
      volume: Math.round(volume * 100 / 80),
    },
    data: {
      status: 2,
      text: btoa(unescape(encodeURIComponent(text)))
    }
  }));
  
  // Receive and merge audio data chunks
  ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.data && data.data.audio) {
      audioChunks.push(data.data.audio);
    }
    if (data.data && data.data.status === 2) {
      // Synthesis complete
      const audioBase64 = audioChunks.join('');
      return { audioUrl: `data:audio/mp3;base64,${audioBase64}` };
    }
  };
}

Volcano Engine Implementation:

async function callVolcanoTTS(appId: string, accessToken: string, text: string, voice: string, speed: number, volume: number) {
  const response = await fetch('https://openspeech.bytedance.com/api/v1/tts', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${accessToken}`
    },
    body: JSON.stringify({
      app: { appid: appId, token: accessToken, cluster: "volcano_tts" },
      audio: {
        voice_type: voiceCode,
        encoding: "mp3",
        speed_ratio: speed,
        volume_ratio: volume / 100,
      },
      request: { text: text, text_type: "plain" }
    })
  });
  
  const result = await response.json();
  // Return base64 audio
  return { audioUrl: `data:audio/mp3;base64,${result.data}` };
}

Alibaba Cloud Implementation:

async function callAliyunTTS(accessKeyId: string, accessKeySecret: string, appKey: string, text: string, voice: string, speed: number, volume: number) {
  const token = await getAliyunToken(accessKeyId, accessKeySecret);
  
  const response = await fetch('https://nls-gateway.cn-shanghai.aliyuncs.com/stream/v1/tts', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'X-NLS-Token': token,
    },
    body: JSON.stringify({
      appkey: appKey,
      text: text,
      voice: voiceCode,
      format: "mp3",
      sample_rate: 16000,
      volume: volume,
      speech_rate: Math.round((speed - 0.5) * 200),
    })
  });
  
  const audioBuffer = await response.arrayBuffer();
  const audioBase64 = bufferToBase64(audioBuffer);
  return { audioUrl: `data:audio/mp3;base64,${audioBase64}` };
}

Baidu ASR Common Errors and Solutions

Error Code 3311: param rate invalid

This is the most common error, usually caused by the following:

Issue	Solution
Token placement error	Token must be in request body, not in URL parameters
cuid duplication	cuid only in request body, don’t repeat in URL
Using dev_pid	Don’t use dev_pid parameter, let Baidu auto-detect language
rate type error	Ensure rate is number type, not string
len calculation error	len must be actual byte length of WAV file

Correct len Parameter Calculation

Calculate actual byte length from Base64 string:

// Calculate actual byte length from Base64 string
const padding = (base64Audio.match(/=/g) || []).length;
const audioByteLength = Math.floor((base64Audio.length * 3) / 4) - padding;

// Verification: audioByteLength should equal WAV file's blob.size

Frontend Audio Processing Points

1. Recording Format

Browser usually uses webm/opus:

const mimeType = "audio/webm;codecs=opus";

2. Must Resample to 16kHz (Baidu Requirement)

const offlineContext = new OfflineAudioContext(
  1,                    // Mono channel
  targetLength,         
  16000                 // Target sample rate
);

3. Convert to 16bit PCM

const pcm16 = new Int16Array(samples.length);
for (let i = 0; i < samples.length; i++) {
  const s = Math.max(-1, Math.min(1, samples[i]));
  pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}

4. Add WAV Header (44 bytes)

const wavHeader = {
  sampleRate: 16000,
  numChannels: 1,
  bitsPerSample: 16,
  byteRate: 32000,      // 16000 * 1 * 16 / 8
  blockAlign: 2,        // 1 * 16 / 8
};

Environment Variable Configuration

Configure in Supabase Edge Function Secrets:

# Supabase Edge Function Secrets
SUPERUN_BAIDU_API_KEY=your_baidu_api_key
SUPERUN_BAIDU_SECRET_KEY=your_baidu_secret_key

How to Obtain: Baidu Intelligent Cloud Console → Speech Technology → Create Application

Debugging Checklist

When encountering 3311 error, check in order:

✅ Is Token in request body (not URL parameter)
✅ Is rate number type (typeof rate === 'number')
✅ Is len equal to WAV file actual size
✅ Has dev_pid parameter been removed
✅ Is sample rate in WAV header 16000
✅ Is audio duration within 0.5-60 seconds range

Complete Request Example

Correct ✓:

{
  format: "wav",
  rate: 16000,          // number type
  channel: 1,
  cuid: "user_001",
  token: "24.xxx...",   // In request body
  speech: "UklGR...",   // Base64
  len: 63404            // number type, actual byte length
}

Incorrect ✗:

{
  format: "wav",
  rate: "16000",        // ← Error: string type
  channel: 1,
  cuid: "user_001",
  dev_pid: 1737,        // ← Error: don't use
  speech: "UklGR...",
  len: "63404"          // ← Error: string type
}
// URL: ?token=xxx      // ← Error: don't put token in URL

Technical Points

ASR (Speech-to-Text)

Unified Audio Format: All engines use WAV format, 16kHz sample rate, mono channel
Base64 Encoding: Audio data converted to base64 in frontend before passing to backend
Protocol Differences:
- Baidu, Alibaba Cloud: REST API
- Xunfei, Volcano Engine: WebSocket protocol
Standardized Results: Unified return format { text, confidence, duration }

TTS (Text-to-Speech)

Voice Mapping: Frontend uses unified voice IDs (female_1, male_1, etc.), backend maps to each engine’s actual voice codes
Parameter Conversion:
- Speed: Frontend range 0.5-2.0x, each engine converts to corresponding range
- Volume: Frontend range 0-100%, each engine converts to corresponding range
Output Format: All engines uniformly return MP3 format base64 encoded audio
Protocol Differences:
- Baidu, Alibaba Cloud, Volcano Engine: REST API
- Xunfei: WebSocket protocol (needs to receive multiple audio chunks)

Testing Recommendations

API Credential Testing: Ensure all environment variables are correctly configured
Audio Format Testing: Test different audio file formats (WAV, MP3, M4A)
Duration Limit Testing: Pay special attention to Alibaba Cloud’s 60-second limit
Error Handling Testing: Test network errors, API errors, and other exceptional cases
Concurrency Testing: Test multiple users using different engines simultaneously

Notes

Cost Control: Each engine has its own billing rules, monitor API call volume
Rate Limits: Each engine has call frequency limits, avoid exceeding limits
Audio Size: Recommend limiting uploaded audio file size (e.g., 10MB)
Timeout Settings: Set reasonable timeout for WebSocket connections (e.g., 30 seconds)
Error Logging: Record detailed error information for troubleshooting

superun Official Website

Browse the official website to learn more features and usage examples.

Guide

Workflow

Ready-to-use Keys

Features

Tips & tricks

User Guides

Use Cases

Pricing

Changelog

​Supported Engines

​1. Baidu Intelligent Cloud

​2. Xunfei Open Platform

​3. Volcano Engine

​4. Alibaba Cloud

​Required Configuration

​Baidu Intelligent Cloud

​ASR (Speech-to-Text)

​TTS (Text-to-Speech)

​Xunfei Open Platform

​ASR (Speech-to-Text)

​TTS (Text-to-Speech)

​Volcano Engine

​ASR (Speech-to-Text)

​TTS (Text-to-Speech)

​Alibaba Cloud

​ASR (Speech-to-Text)

​TTS (Text-to-Speech)

​Configuration Method

​Supabase Edge Functions (Production Environment)

​Code Implementation Architecture

​Frontend Components

​ASR Module (Speech-to-Text)

​TTS Module (Text-to-Speech)

​Engine Selector

​Backend Implementation (Supabase Edge Functions)

​ASR Conversion Service

​TTS Conversion Service

​Baidu ASR Common Errors and Solutions

​Error Code 3311: param rate invalid

​Correct len Parameter Calculation

​Frontend Audio Processing Points

​1. Recording Format

​2. Must Resample to 16kHz (Baidu Requirement)

​3. Convert to 16bit PCM

​4. Add WAV Header (44 bytes)

​Environment Variable Configuration

​Debugging Checklist

​Complete Request Example

​Technical Points

​ASR (Speech-to-Text)

​TTS (Text-to-Speech)

​Testing Recommendations

​Notes

superun Official Website

Supported Engines

1. Baidu Intelligent Cloud

2. Xunfei Open Platform

3. Volcano Engine

4. Alibaba Cloud

Required Configuration

Baidu Intelligent Cloud

ASR (Speech-to-Text)

TTS (Text-to-Speech)

Xunfei Open Platform

ASR (Speech-to-Text)

TTS (Text-to-Speech)

Volcano Engine

ASR (Speech-to-Text)

TTS (Text-to-Speech)

Alibaba Cloud

ASR (Speech-to-Text)

TTS (Text-to-Speech)

Configuration Method

Supabase Edge Functions (Production Environment)

Code Implementation Architecture

Frontend Components

ASR Module (Speech-to-Text)

TTS Module (Text-to-Speech)

Engine Selector

Backend Implementation (Supabase Edge Functions)

ASR Conversion Service

TTS Conversion Service

Baidu ASR Common Errors and Solutions

Error Code 3311: param rate invalid

Correct len Parameter Calculation

Frontend Audio Processing Points

1. Recording Format

2. Must Resample to 16kHz (Baidu Requirement)

3. Convert to 16bit PCM

4. Add WAV Header (44 bytes)

Environment Variable Configuration

Debugging Checklist

Complete Request Example

Technical Points

ASR (Speech-to-Text)

TTS (Text-to-Speech)

Testing Recommendations

Notes