Bixby's dialog can include a subset of tags from Speech Synthesis Markup Language (SSML), a W3C standard for enriching text-to-speech.
To use SSML, you must observe the following rules:
speech
key in dialog templates.<speak>
tag and end with the </speak>
closing tag. If these tags are not present, the speech will not be recognized as containing SSML.\
character.template ("The French word for cat is 'chat.'") {
speech ("<speak>The French word for cat is <lang xml:lang=\"fr-FR\">chat</lang>.</speak>")
}
Currently, Bixby supports the following SSML tags:
<lang>
: Specify the natural language of the enclosed content<audio>
: Embed audio clips via URL<say-as>
: Specify how to interpret a text construction (for example, as a cardinal or ordinal number)<s>
: Mark sentences for appropriate breaks within longer speech<p>
: Mark paragraphs for appropriate breaks within longer speech<break>
: Introduce a variable length break within longer speech<sub>
: Provide an alternate pronunciation for an acronym or a term Bixby has trouble pronouncing<prosody>
: Modify the rate, pitch, and volume of the speechFor a working example demonstrating many of Bixby's SSML features, see the SSML Examples Capsule.
Bixby supports a nonstandard voice
attribute for the <lang>
tag that specifies the name or server profile of a Bixby voice to read the enclosed text in. This takes the place of the standard SSML <voice name="">
tag.
Voice names can be used in the <lang>
tag to specify a voice. This is optional, but can aid Bixby's pronunciation.
speech ("<speak>The French word for cat is <lang xml:lang=\"fr-FR\" voice=\"M01\">chat</lang>.</speak>")
Use the name in the Voice column of the following table to specify a voice appropriate to the language and locale. Alternatively, you can specify the server profile in the "Profile" column. You must specify the locale in the lang
attribute.
Voice | Locale | Profile |
---|---|---|
윤정 | ko-KR | F01 |
우호 | ko-KR | M01 |
유리 | ko-KR | F04 |
두리 | ko-KR | F05 |
Stephanie | en-US | F03 |
John | en-US | M02 |
Lisa | en-US | F05 |
Julia | en-US | F04 |
张喆(Zangzhe) | zh-CN | F02 |
王聪(Wangcong) | zh-CN | M02 |
Amy | en-GB | F02 |
Chris | en-GB | M02 |
Marie | de-DE | F01 |
Jan | de-DE | M01 |
Sandra | es-ES | F01 |
David | es-ES | M01 |
Louise | fr-FR | F01 |
Valentin | fr-FR | M01 |
Angela | it-IT | F01 |
Andrea | it-IT | M01 |
Franscis | pt-BR | F01 |
The <audio>
SSML tag allows you to include an audio clip that Bixby plays as part of the dialog. The clip is played in serial with any other speech (that is, the clip is played where the <audio>
tag occurs, not simultaneously as background audio).
The audio clips must match the following specifications:
Clips must be specified with an HTTPS URL, hosted on an internet-accessible server with a valid SSL certificate. A single response can have a maximum of 5 clips.
The audio tag has two attributes:
src
specifies the URL to load the audio file from. This URL must be secure (HTTPS), and must be hosted on a publicly-accessible server with a valid SSL certificate.format
specifies the format of the audio file, and can be set to wav
or mpeg-2
. This attribute is nonstandard, and is not usually needed for supported formats.If there is an error in fetching an audio clip from the specified link or the format is invalid, Bixby will not play the rest of the dialog!
template ("Now Bixby can play animal sounds! Listen to this one.") {
speech ("<speak>Now Bixby can play animal sounds! <audio src=\"https://example.com/animal.mp3\"></audio></speak>")
}
This is how to specify the format
attribute:
template ("The music sounds like this.") {
speech ("<speak>The music sounds like this. <audio format=\"wav\" src=\"https://example.com/music.wav\"></audio></speak>")
}
You can convert an existing audio sample to the proper format using FFmpeg with the following options:
// output a 24 Kbps mono WAV file
ffmpeg -f s16le -ar 24000 -ac 1 -i input_file destFile.wav
// output a 48 Kbps mono MP3 file
ffmpeg -ar 48000 -ac 1 -i input_file destFile.mp3
You can use the <say-as>
SSML tag to provide information about the meaning of the text contained within the tag, which will help Bixby interpret it and speak the text as intended.
The <say-as>
tag has one required attribute, interpret-as
, which determines how the value is spoken.
speech("<speak>There are <say-as interpret-as=\"cardinal\">12</say-as> options.</speak>")
Bixby supports the following values for interpret-as
:
cardinal
: Interprets the value as a cardinal number. <say-as interpret-as=\"cardinal\">12345</say-as>
will be spoken as "twelve thousand three hundred forty-five".ordinal
: Interprets the value as an ordinal number. <say-as interpret-as=\"ordinal\">31</say-as>
will be spoken as "thirty-first".spell-out
: Spells out the value letter by letter, rather than trying to pronounce it as a word or phrase. <say-as interpret-as=\"spell-out\">abc</say-as>
will be spoken as "a, b, c".fraction
: Interprets the value as a fraction.<say-as interpret-as=\"fraction\">1/2</say-as>
will be spoken as "one half".<say-as interpret-as=\"fraction\">3/4</say-as>
will be spoken as "three quarters".+
symbol as a separator: <say-as interpret-as=\"fraction\">2+1/2</say-as>
will be spoken as "two and one half".digits
: Spells out a numeric value digit by digit, rather than trying to pronounce it as a cardinal or ordinal number. <say-as interpret-as=\"digits\">12345</say-as>
will be spoken as "1, 2, 3, 4, 5".Bixby supports several SSML tags to make longer speech sound more natural.
You can mark paragraphs and sentences with <p>
and <s>
tags respectively. Bixby will add breaks of appropriate lengths when it reads that speech aloud.
speech("<speak><s>This is the first sentence.</s><s>This is the second sentence.</s></speak>")
You can also specify a variable length break in speech using the <break>
tag, which lets you specify a break with either the strength
or time
attribute:
strength
specifies a duration in relative values, from no pause (none
) to the equivalent of a paragraph break (x-strong
). The following are possible values:none
x-weak
weak
medium
(default)strong
x-strong
time
specifies a duration in absolute values, in either seconds (s
) or milliseconds (ms
). Include the unit with the time, such as 10s
.10s
(10000ms
)0s
speech("<speak>Take a deep breath.<break time=\"200ms\"/>Exhale.<break strength=\"strong\"/>Dance.</speak>")
While both <break>
attributes are optional, you must specify one or the other for the tag to have an effect.
The <break>
tag is standalone (self-closing), and must end with a trailing (forward) slash after the attribute: <break time=\"1s\"/>
. Don't use open and closing tags as you do with <p>
and <s>
breaks.
Bixby's SSML implementation treats x-weak
and none
strength values as identical (no pause) and weak
and medium
values as identical (about a comma-length pause).
You can use the <prosody>
tag to affect the rate, pitch, and volume of Bixby's speech by using the corresponding attribute:
rate
values:x-slow
: 50% of normal speedslow
: 75%medium
: 100% (default)fast
: 125%x-fast
: 150%20%
to 200%
pitch
values:x-low
: -25% below normal pitchlow
: -10%medium
: normal pitch (default)high
: +10%x-high
: +25%-33.3%
to +50%
volume
values:silent
: no volumex-soft
: -4db below normal volumesoft
: -2dbmedium
: no change in volume (default)loud
: +2dbx-loud
: +4db-6db
to +6db
speech("<speak>Normal volume for the first sentence. <prosody volume=\"x-loud\">
Louder volume for the second sentence</prosody>. When I wake up,
<prosody rate=\"x-slow\">I speak quite slowly</prosody>. I can speak with my
normal pitch, <prosody pitch=\"x-high\">but also with a much higher
pitch</prosody>, and also <prosody pitch=\"low\">with a lower pitch</prosody>.
</speak>")
While all the <prosody>
attributes are optional, you must specify at least one for the tag to have an effect.
If Bixby has trouble pronouncing some text within a speech element, or you'd like to have Bixby expand an acronym when it reads it aloud, use the <sub>
tag to provide an alternate pronunciation using the required alias
attribute. When Bixby encounters this tag, it will use the value of alias
as the speech to read rather than the text within the <sub>
tag.
speech("<speak>The speed is 50 <sub alias=\"miles per hour\">mph</sub>.</speak>")