Bixby Developer Center


Using SSML

Bixby's dialog can include a subset of tags from Speech Synthesis Markup Language (SSML), a W3C standard for enriching text-to-speech.

To use SSML, you must observe the following rules:

  • SSML is only valid inside the speech key in dialog templates.
  • Speech must start with the <speak> tag and end with the </speak> closing tag. If these tags are not present, the speech will not be recognized as containing SSML.
  • The speech string must be enclosed in quote marks, and quotes inside the string must be escaped with a \ character.
template ("The French word for cat is 'chat.'") {
speech ("<speak>The French word for cat is <lang xml:lang=\"fr-FR\">chat</lang>.</speak>")

Currently, Bixby supports the following SSML tags:

  • <lang>: Specify the natural language of the enclosed content
  • <audio>: Embed audio clips via URL
  • <say-as>: Specify how to interpret a text construction (for example, as a cardinal or ordinal number)
  • <s>: Mark sentences for appropriate breaks within longer speech
  • <p>: Mark paragraphs for appropriate breaks within longer speech
  • <sub>: Provide an alternate pronunciation for an acronym or a term Bixby has trouble pronouncing

Bixby Voices

Bixby supports a nonstandard voice= attribute for the <lang> tag that specifies the name or server profile of a Bixby voice to read the enclosed text in. This takes the place of the standard SSML <voice name=""> tag.

Voice names can be used in the <lang> tag to specify a voice. This is optional, but can aid Bixby's pronunciation.

speech ("<speak>The French word for cat is <lang xml:lang=\"fr-FR\" voice=\"M01\">chat</lang>.</speak>")

Use the name in the Voice column to specify a voice appropriate to the language and locale. Alternatively, you can specify the server profile in the "Profile" column. You must specify the locale in the lang attribute.


Audio Clips

The <audio> SSML tag allows you to include an audio clip that Bixby plays as part of the dialog. The clip is played in serial with any other speech (that is, the clip is played where the <audio> tag occurs, not simultaneously as background audio).

  • The audio clips must match the following specifications:
    • WAV format
    • Mono PCM encoding
    • 16-bit (little endian)
    • 24 KHz sample rate
  • Clips must be specified with an HTTPS URL, hosted on an internet-accessible server with a valid SSL certificate.
  • Clips must be less than 5 MB and less than 120 seconds in duration
  • A single response can have up to a maximum of 5 clips

If there is an error in fetching the audio clip from the specified link, Bixby will not play the rest of the dialog

You can convert an existing audio sample to the proper format using FFmpeg with the following options:

ffmpeg -f s16le -ar 24000 -ac 1 -i input_file destFile.wav


template ("Now Bixby can play animal sounds! Listen to this one.") {
speech ("<speak>Now Bixby can play animal sounds! <audio src=\"\"></audio></speak>")

The <audio> tag only supports the src attribute.


You can use the <say-as> SSML tag to provide information about the meaning of the text contained within the tag, which will help Bibxy interpret it and speak the text as intended.

The <say-as> tag has one required attribute, interpret-as, which determines how the value is spoken.

speech("<speak>There are <say-as interpret-as=\"cardinal\">12</say-as> options.</speak>")

Bixby supports the following values for interpret-as:

  • cardinal: Interprets the value as a cardinal number. <say-as interpret-as=\"cardinal\">12345</say-as> will be spoken as "twelve thousand three hundred forty-five".
  • ordinal: Interprets the value as an ordinal number. <say-as interpret-as=\"ordinal\">31</say-as> will be spoken as "thirty-first".
  • spell-out: will spell out the value letter by letter, rather than trying to pronounce it as a word or phrase. <say-as interpret-as=\"spell-out\">abc</say-as> will be spoken as "a, b, c".
  • fraction: Interprets the value as a fraction.
    • <say-as interpret-as=\"fraction\">1/2</say-as> will be spoken as "one half".
    • <say-as interpret-as=\"fraction\">3/4</say-as> will be spoken as "three quarters".
    • Mixed fractions using the + symbol as a separator: <say-as interpret-as=\"fraction\">2+1/2</say-as> will be spoken as "two and one half".
  • digits: Will spell out a numeric value digit by digit, rather than trying to pronounce it as a cardinal or ordinal number. <say-as interpret-as=\"digits\">12345</say-as> will be spoken as "1, 2, 3, 4, 5".


To make longer speech sound more natural, you can mark paragraphs and sentences with <p> and <s> tags respectively. Bixby will add breaks of appropriate lengths when it reads that speech aloud.

speech("<speak><s>This is the first sentence.</s><s>This is the second sentence.</s></speak>")

Alternate Pronunciation

If Bixby has trouble pronouncing some text within a speech element, or you'd like to have Bixby expand an acronym when it reads it aloud, use the <sub> tag to provide an alternate pronunciation using the required alias attribute. When Bixby encounters this tag, it will use the value of alias as the speech to read rather than the text within the <sub> tag.

speech("<speak>The speed is 50 <sub alias=\"miles per hour\">mph</sub>.</speak>")