Bixby Developer Center

Guides

Using SSML

Bixby's dialog can include a subset of tags from Speech Synthesis Markup Language (SSML), a W3C standard for enriching text-to-speech.

To use SSML, you must observe the following rules:

  • SSML is only valid inside the speech key in dialog templates.
  • Speech must start with the <speak> tag and end with the </speak> closing tag. If these tags are not present, the speech will not be recognized as containing SSML.
  • The speech string must be enclosed in quote marks, and quotes inside the string must be escaped with a \ character.
template ("The French word for cat is 'chat.'") {
speech ("<speak>The French word for cat is <lang xml:lang=\"fr-FR\">chat</lang>.</speak>")
}

Currently, Bixby supports the following SSML tags:

  • <lang>: Specify the natural language of the enclosed content
  • <audio>: Embed audio clips via URL
  • <say-as>: Specify how to interpret a text construction (for example, as a cardinal or ordinal number)
  • <s>: Mark sentences for appropriate breaks within longer speech
  • <p>: Mark paragraphs for appropriate breaks within longer speech
  • <break>: Introduce a variable length break within longer speech
  • <sub>: Provide an alternate pronunciation for an acronym or a term Bixby has trouble pronouncing
  • <prosody>: Modify the rate, pitch, and volume of the speech
Note

For a working example demonstrating many of Bixby's SSML features, see the SSML Examples Capsule.

Bixby Voices

Bixby supports a nonstandard voice attribute for the <lang> tag that specifies the name or server profile of a Bixby voice to read the enclosed text in. This takes the place of the standard SSML <voice name=""> tag.

Voice names can be used in the <lang> tag to specify a voice. This is optional, but can aid Bixby's pronunciation.

speech ("<speak>The French word for cat is <lang xml:lang=\"fr-FR\" voice=\"M01\">chat</lang>.</speak>")

Use the name in the Voice column of the following table to specify a voice appropriate to the language and locale. Alternatively, you can specify the server profile in the "Profile" column. You must specify the locale in the lang attribute.

VoiceLocaleProfile
윤정ko-KRF01
우호ko-KRM01
유리ko-KRF04
두리ko-KRF05
Stephanieen-USF03
Johnen-USM02
Lisaen-USF05
Juliaen-USF04
张喆(Zangzhe)zh-CNF02
王聪(Wangcong)zh-CNM02
Amyen-GBF02
Chrisen-GBM02
Mariede-DEF01
Jande-DEM01
Sandraes-ESF01
Davides-ESM01
Louisefr-FRF01
Valentinfr-FRM01
Angelait-ITF01
Andreait-ITM01
Franscispt-BRF01

Audio Clips

The <audio> SSML tag allows you to include an audio clip that Bixby plays as part of the dialog. The clip is played in serial with any other speech (that is, the clip is played where the <audio> tag occurs, not simultaneously as background audio).

The audio clips must match the following specifications:

  • Format: WAV, MP2 (MPEG-2 Layer II), or MP3 (MPEG-2 Layer III)
  • Bit rate: 48 Kbps for MP2/MP3, 24 Kbps for WAV
  • Sample rate: 24 KHz
  • Maximum size: 5 MB
  • Maximum duration: 120 seconds
  • Mono (single track)
  • WAV files must be 16-bit little endian

Clips must be specified with an HTTPS URL, hosted on an internet-accessible server with a valid SSL certificate. A single response can have a maximum of 5 clips.

The audio tag has two attributes:

  • src specifies the URL to load the audio file from. This URL must be secure (HTTPS), and must be hosted on a publicly-accessible server with a valid SSL certificate.
  • format specifies the format of the audio file, and can be set to wav or mpeg-2. This attribute is nonstandard, and is not usually needed for supported formats.
Caution

If there is an error in fetching an audio clip from the specified link or the format is invalid, Bixby will not play the rest of the dialog!

template ("Now Bixby can play animal sounds! Listen to this one.") {
speech ("<speak>Now Bixby can play animal sounds! <audio src=\"https://example.com/animal.mp3\"></audio></speak>")
}

This is how to specify the format attribute:

template ("The music sounds like this.") {
speech ("<speak>The music sounds like this. <audio format=\"wav\" src=\"https://example.com/music.wav\"></audio></speak>")
}

You can convert an existing audio sample to the proper format using FFmpeg with the following options:

// output a 24 Kbps mono WAV file
ffmpeg -f s16le -ar 24000 -ac 1 -i input_file destFile.wav

// output a 48 Kbps mono MP3 file
ffmpeg -ar 48000 -ac 1 -i input_file destFile.mp3

say-as

You can use the <say-as> SSML tag to provide information about the meaning of the text contained within the tag, which will help Bixby interpret it and speak the text as intended.

The <say-as> tag has one required attribute, interpret-as, which determines how the value is spoken.

speech("<speak>There are <say-as interpret-as=\"cardinal\">12</say-as> options.</speak>")

Bixby supports the following values for interpret-as:

  • cardinal: Interprets the value as a cardinal number. <say-as interpret-as=\"cardinal\">12345</say-as> will be spoken as "twelve thousand three hundred forty-five".
  • ordinal: Interprets the value as an ordinal number. <say-as interpret-as=\"ordinal\">31</say-as> will be spoken as "thirty-first".
  • spell-out: Spells out the value letter by letter, rather than trying to pronounce it as a word or phrase. <say-as interpret-as=\"spell-out\">abc</say-as> will be spoken as "a, b, c".
  • fraction: Interprets the value as a fraction.
    • <say-as interpret-as=\"fraction\">1/2</say-as> will be spoken as "one half".
    • <say-as interpret-as=\"fraction\">3/4</say-as> will be spoken as "three quarters".
    • Mixed fractions using the + symbol as a separator: <say-as interpret-as=\"fraction\">2+1/2</say-as> will be spoken as "two and one half".
  • digits: Spells out a numeric value digit by digit, rather than trying to pronounce it as a cardinal or ordinal number. <say-as interpret-as=\"digits\">12345</say-as> will be spoken as "1, 2, 3, 4, 5".

Breaks

Bixby supports several SSML tags to make longer speech sound more natural.

You can mark paragraphs and sentences with <p> and <s> tags respectively. Bixby will add breaks of appropriate lengths when it reads that speech aloud.

speech("<speak><s>This is the first sentence.</s><s>This is the second sentence.</s></speak>")

You can also specify a variable length break in speech using the <break> tag, which lets you specify a break with either the strength or time attribute:

  • strength specifies a duration in relative values, from no pause (none) to the equivalent of a paragraph break (x-strong). The following are possible values:
    • none
    • x-weak
    • weak
    • medium(default)
    • strong
    • x-strong
  • time specifies a duration in absolute values, in either seconds (s) or milliseconds (ms). Include the unit with the time, such as 10s.
    • Max value: 10s (10000ms)
    • Default: 0s
speech("<speak>Take a deep breath.<break time=\"200ms\"/>Exhale.<break strength=\"strong\"/>Dance.</speak>")

While both <break> attributes are optional, you must specify one or the other for the tag to have an effect.

Note

The <break> tag is standalone (self-closing), and must end with a trailing (forward) slash after the attribute: <break time=\"1s\"/>. Don't use open and closing tags as you do with <p> and <s> breaks.

Bixby's SSML implementation treats x-weak and none strength values as identical (no pause) and weak and medium values as identical (about a comma-length pause).

prosody

You can use the <prosody> tag to affect the rate, pitch, and volume of Bixby's speech by using the corresponding attribute:

  • rate values:
    • x-slow: 50% of normal speed
    • slow: 75%
    • medium: 100% (default)
    • fast: 125%
    • x-fast : 150%
    • percentage: increase or decrease the speed of speech, from 20% to 200%
  • pitch values:
    • x-low: -25% below normal pitch
    • low: -10%
    • medium: normal pitch (default)
    • high: +10%
    • x-high : +25%
    • percentage: increase or decrease the pitch, from -33.3% to +50%
  • volume values:
    • silent: no volume
    • x-soft: -4db below normal volume
    • soft: -2db
    • medium: no change in volume (default)
    • loud: +2db
    • x-loud : +4db
    • decibels: increase or decrease the volume in decibels, from -6db to +6db
speech("<speak>Normal volume for the first sentence. <prosody volume=\"x-loud\">
Louder volume for the second sentence</prosody>. When I wake up,
<prosody rate=\"x-slow\">I speak quite slowly</prosody>. I can speak with my
normal pitch, <prosody pitch=\"x-high\">but also with a much higher
pitch</prosody>, and also <prosody pitch=\"low\">with a lower pitch</prosody>.
</speak>")

While all the <prosody> attributes are optional, you must specify at least one for the tag to have an effect.

Alternate Pronunciation

If Bixby has trouble pronouncing some text within a speech element, or you'd like to have Bixby expand an acronym when it reads it aloud, use the <sub> tag to provide an alternate pronunciation using the required alias attribute. When Bixby encounters this tag, it will use the value of alias as the speech to read rather than the text within the <sub> tag.

speech("<speak>The speed is 50 <sub alias=\"miles per hour\">mph</sub>.</speak>")