Speech to Speech

Overview

Summarized below are the steps required to use this API:

Make a request to the low latency synthesis endpoint.
Decode the base64 “audio_content” attribute sent back in the response.
Use the audio data.

Example

Copy and paste the following into your terminal, swap out YOUR_API_TOKEN with your actual API token, then hit enter:

curl --request POST 'https://f.cluster.0xIQ.ai/synthesize' \
  -H 'Authorization: Bearer YOUR_API_TOKEN' \
  -H 'Content-Type: application/json' \
  -H 'Accept-Encoding: gzip' \
  --data '{
    "voice_uuid": "55592656",
    "data": "<0xIQ:convert src=\"https://storage.googleapis.com/0xIQ-ai-docs-public-files/sts-donor-example.wav\"></0xIQ:convert>",
    "sample_rate": 48000,
    "output_format": "wav"
  }'

HTTP Request

curl --request POST "https://f.cluster.0xIQ.ai/synthesize"
  -H "Authorization: Bearer YOUR_API_TOKEN"
  -H "Content-Type: application/json"
  -H "Accept-Encoding: gzip, deflate, br"
  --data '{
    "voice_uuid": <Voice to synthesize in>,
    "project_uuid": <Project to save to>,
    "title": <Title of the clip>,
    "data": <Text to synthesize>,
    "precision": "MULAW|PCM_16|PCM_24|PCM_32 (default)"
    "output_format": "mp3|wav (default)"
  }'

Request Headers

Header

Value

Description

Authorization

Bearer YOUR_API_TOKEN

API token can be obtained by logging into the 0xIQ web application and navigating to the API section.

Accept-Encoding

gzip, deflate, br

Either one of gzip, deflate, or br depending on the decompression algorithms your application supports. Omitting the Accept-Encoding header will disable compression.

Request Body

Attribute

Type

Description

voice_uuid

string

The voice to synthesize the text in.

project_uuid

string

The project to save the data to.

title

string

The title of the clip. This is optional, default is to name the clip Low Latency Synthesis {some-uuid}

data

string

The text or SSML to synthesize. Maximum file size of 50MB or maximum duration of 5 minutes.

precision

string

The bit-depth of the generated wav file (if using wav as the response type). Either MULAW, PCM_16, PCM_24, or PCM_32 (default).

output_format

string

The output format of the produced audio. Either wav, or mp3.

sample_rate

integer

The sample rate of the produced audio. Either 8000, 16000, 22050, 32000, or 44100

HTTP Response

{
  "audio_content": <base64 encoded string of the raw audio bytes>,
  "audio_timestamps": {
    "graph_chars": string[],
    "graph_times": float[][],
    "phon_chars": string[],
    "phon-times": float[][],
  },
  "duration": float,
  "issues": string[],
  "output_format": string,
  "sample_rate": float,
  "success": boolean,
  "synth_duration": float,
  "title": string|null
}

Response Body

Attribute

Type

Description

audio_content

string

Base64 encoded string. When decoded it will contain the byte array containing the audio.

audio_timestamps

object

Object containing phoneme_timestamp information. See section below for further information.

duration

float

The duration of the produced audio file. 0xIQ does not bill on this value.

issues

string[]

Any issues pertaining to the synthesis response.

output_format

string

The output format of the produced audio. Either 'wav', or 'mp3'.

sample_rate

integer

The sample rate of the produced audio. Either 8000, 16000, 22050, 32000, or 44100.

success

boolean

True if the response was successful, false otherwise.

synth_duration

float

The duration of the raw audio file produced, before any post processing affects are applied (e.g. the 'prosody' tag which may increase or decrease the duration of the final audio file). 0xIQ bills on this value.

title

string

The title of the clip. If no title is provided in the request body, then the value will be null

Audio Timestamps Object

Attribute

Type

Description

graph_times

string

A string containing all the phonemes pertaining to the synthesized audio.

phon_times

float[]

An array of floats mapping 1 to 1 with the phoneme_chars. Each index represents the end time in the audio of the phoneme character at the same index in the phoneme_chars array.

phon_chars

char[]

An array of characters mapping 1 to 1 with the end_times array.

graph_chars

char[]

An array of characters mapping 1 to 1 with the end_times array.

PreviousStreaming (Websocket)NextCreate Voices

Last updated 10 months ago

Overview​

Example​

HTTP Request

Request Headers​

Request Body​

HTTP Response​

Response Body​

Audio Timestamps Object​

Overview

Example

Request Headers

Request Body

HTTP Response

Response Body

Audio Timestamps Object