Streaming (HTTP)

To view an example project using the 0xIQ.ai API please see the https://github.com/0xIQai/IQ repository on GitHub.

This endpoint streams a new clip and returns the audio data (wav format) through a stream. With streaming, audio is generated sequentially and sent in chunks. The delay before receiving the first bits of audio is shorter than the time it takes to perform a full synchronous synthesis, regardless of the size of your query. Implementing streaming in your app will make it more responsive.

Additionally, streaming synthesis will yield timestamp information (albeit in a different format than synchronous synthesis) and exact audio duration in the first chunk of bytes streamed, prior to any audio bytes.

Alternatively, if you're on a Business Plan or higher, you can also stream your data from a websocket using our websocket API.

Careful

Notice that the request is sent to our synthesis servers instead of app.0xIQ.ai.

HTTP Request

POST YOUR_STREAMING_ENDPOINT

Your streaming endpoint can be seen in the Try it out section.

JSON Body

Type

Description

project_uuid

string

UUID of the project to which the clip should belong

voice_uuid

string

UUID of the voice to use for synthesizing

data

string

Content to be synthesized. At the moment, SSML is only partially supported. Maximum length of 3000 characters.

precision

(optional) string

The bit depth of the generated audio. One of the following values: PCM_32, PCM_24, PCM_16, or MULAW. Default is PCM_32.

sample_rate

(optional) integer

The sample rate of the produced audio. Either 8000, 16000, 22050, 32000, 44100, 48000. Default is 22050.

HTTP Response

A successful response contains bytes which make up a single channel PCM 16 wav file. It can be decoded and played back on the fly.

Wav encoding

We take advantage of the RIFF format to encode additional data in the header of our wavs, such as:

The size (in bytes) of the entire file
The number of audio samples
The sample rate
The times at which characters will be pronounced
The times at which phonemes will be pronounced

Typically, audio libraries will only parse the first three of these values. Therefore if you wish to obtain the audio timestamps, you will either have to use one of our SDKs or handle the decoding yourself, using our wav specification.

A typical wav file starts with a RIFF chunk, followed by a format chunk and finally a data chunk. The wav files we return will additionally have a cue chunk, a list chunk and ltxt chunks. These are located between the format and data chunk, and deliver the timestamps information. If you are interested in the full specification of the wav format, see here.

Below is the specification of the wav format that we use. Bytes are encoded in little-endian order. Integers are always unsigned and strings are encoded as ascii characters (with the exception of the ltxt chunks).

Header & Format chunks

Size

Description

Value

RIFF ID

"RIFF"

Remaining file size (in bytes) after this read*

(file size) - 8

RIFF type

"WAVE"

Format chunk ID

"fmt " (note the space)

Chunk data size

Compression code

1 (corresponds to PCM)

Number of channels

Sample rate

8,000 - 48,000

Byte rate

16,000 - 96,000

Block align

Bits per sample

* If you have an older model, the file size will be 0xFFFFFFFF instead. Please contact us if this is your case, we will upgrade your model.

After having parsed these chunks, you will know the sample rate and the total size of the audio file in bytes. Do not use this size to approximate the remaining audio duration, the data chunk will give you the exact length of the audio.

Timestamps (cue, list & ltxt chunks)

Cue

Size

Description

Value

Cue chunk ID

"cue "

Remaining size of the cue chunk after this read

4 + n_cue_points * 24

Number of remaining cue points

n_cue_points

This chunk is then followed by n_cue_points cue points:

Cue points

Size

Description

Value

Cue point ID

0 - 0xFFFFFFFF

Unused

"data"

Unused

Sample offset

0 - 0xFFFFFFFF

Cue points are simply a list of time points (expressed as offsets in number of audio samples) which mark the start of graphemes or phonemes.

The graphemes and phonemes are registered in the list chunk.

List

Size

Description

Value

List chunk ID

"list"

Remaining size of the list chunk after this read

4 + (sum of ltxt chunk sizes)

Type ID

"adtl"

This chunk is then followed by ltxt chunks

LTXT

Size

Description

Value

LTXT chunk ID

"ltxt"

Remaining size of this ltxt chunk after this read*

20 + text_length

Cue point ID

0 - 0xFFFFFFFF

Length in number of samples

1 - 0xFFFFFFFF

Character type

"grph" OR "phon"

Unused

text_length

The UTF-8 encoded text with a "\0" termination character

* Note

The wav specification requires that all chunks be aligned on multiples of the block align value (always 2 in our encoding). Therefore if text_length is odd, you must skip an additional byte after having read the chunk.

Each LTXT chunk corresponds to a character or a phoneme (possibly with stress and duration characters). To get the starting position, take the cue point with the same ID as in the LTXT chunk and take its sample offset. To get the ending position, add the length in number of samples from the LTXT chunk to the starting position.

Graphemes are given first, in sequential order. They have the "grph" character type. Then, phonemes follow with the "phon" character type.

Audio data chunk

Finally, the audio data chunk follows.

Data

Size

Description

Value

Data chunk ID

"data"

Number of remaining audio samples * 2

wav_length * 2

The remainder of the wav file are the PCM 16 encoded audio bytes.

PreviousText-To-Speech NextStreaming (Websocket)

Last updated 1 year ago

hashtagHTTP Request​arrow-up-right

hashtagHTTP Response​arrow-up-right

hashtagWav encoding​arrow-up-right

hashtagHeader & Format chunks​arrow-up-right

hashtagTimestamps (cue, list & ltxt chunks)​arrow-up-right

hashtagAudio data chunk​arrow-up-right

HTTP Request

HTTP Response

Wav encoding

Header & Format chunks

Timestamps (cue, list & ltxt chunks)

Audio data chunk