Modifying speech synthesis characteristics
The IBM Watson® Text to Speech service includes query parameters that you can use to globally modify the characteristics of speech synthesis for an entire request: rate_percentage
, pitch_percentage
, and spell_out_mode
.
These parameters are available for both the HTTP and WebSocket interfaces.
Each of these parameters interacts with elements of the Speech Synthesis Markup Language (SSML). The descriptions of the parameters include information about how they interact with related SSML elements and attributes.
Modifying the speaking rate
The rate_percentage
parameter is beta functionality.
The speaking rate is the speed at which the service speaks the text that it synthesizes into speech. A higher rate causes the text to be spoken more quickly; a lower rate causes the text to be spoken more slowly.
You can use the optional rate_percentage
query parameter to modify the rate of the synthesized speech for a voice. Each voice has a default speaking rate that is optimized to represent a normal rate of speech. The parameter accepts
an integer that represents the percentage change from the voice's default:
- Specify a signed negative integer to reduce the speaking rate by that percentage. For example,
-10
reduces the rate by ten percent. - Specify an unsigned or signed positive integer to increase the speaking rate by that percentage. For example,
10
and+10
increase the rate by ten percent. - Specify
0
or omit the parameter to get the default speaking rate for the voice.
The best way to determine how the parameter affects the rate of a given voice is to experiment with different values. Increase or decrease the rate incrementally, by five or ten percent, before trying more significant changes. When decreasing
the rate, values less than -50
can begin to yield inadequate pronunciation. When increasing the rate, values greater than 100
might produce results that are not discernibly different from 100
.
Table 1 provides examples of how the service changes the speaking rate with different values for the rate_percentage
parameter. The examples use the en-US_AllisonV3Voice
.
Specification of the rate_percentage parameter |
Audio sample |
---|---|
rate_percentage=0 The service uses the default speaking rate. You can also omit the parameter to get the default behavior. |
|
rate_percentage=-20 The service uses a speaking rate that is 20 percent slower. |
|
rate_percentage=20 The service uses a speaking rate that is 20 percent faster. |
Interaction with SSML <prosody>
element
The rate_percentage
parameter changes the speaking rate for an entire request. You can also use rate
attribute of the the SSML <prosody>
element to modify the rate of specific words of text. Although
the rate_percentage
parameter accepts only a percentage value, the rate
attribute accepts values of different forms. For more information, see The rate
attribute.
If you use the rate_percentage
parameter and also use the rate
attribute of the <prosody>
tag to modify the rate for specific text of the same request, the effects are multiplicative. For example,
suppose you specify the rate_percentage
parameter with a value of -10
, which slows the speaking rate by 10 percent. If you also use the <prosody>
element to specify an equivalent rate
of -10%
for specific text of the request, that text is spoken at a rate that is 81 percent of the default rate for the voice.
Speaking rate examples
The following HTTP POST /v1/synthesize
method uses the rate_percentage
query parameter with a value of -5
. The resulting audio is spoken at a rate that is five percent slower than it is by default. The
example uses the US English voice en-US_AllisonV3Voice
.
IBM Cloud
curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output rate-percentage.wav \
--data "{\"text\":\"This text is spoken five percent slower than the default.\"}" \
"{url}/v1/synthesize?voice=en-US_AllisonV3Voice&rate_percentage=-5"
IBM Cloud Pak for Data
curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output rate-percentage.wav \
--data "{\"text\":\"This text is spoken five percent slower than the default.\"}" \
"{url}/v1/synthesize?voice=en-US_AllisonV3Voice&rate_percentage=-5"
The following WebSocket example shows inclusion of the parameter with the same value on the request to establish a connection:
var access_token = '{access_token}';
var wsURI = '{ws_url}/v1/synthesize'
+ '?access_token=' + access_token
+ '&voice=en-US_AllisonV3Voice'
+ '&rate_percentage=-5';
var websocket = new WebSocket(wsURI);
Modifying the speaking pitch
The pitch_percentage
parameter is beta functionality.
The speaking pitch represents the pitch, or tone, of the speech that the service synthesizes, which also affects the intonation. It represents how high or low the tone of the voice is perceived by the listener. A higher pitch results in speech that is spoken at a higher tone and is perceived as a higher voice; a lower pitch results in speech that is spoken in a lower tone and is perceived as a lower, deeper voice.
You can use the optional pitch_percentage
query parameter to modify the pitch of the synthesized speech for a voice. Each voice has a default, baseline pitch that reflects the intended tone of that voice. The parameter accepts an
integer that represents the percentage change from the voice's default:
- Specify a signed negative integer to reduce the pitch by that percentage. For example,
-10
reduces the pitch by ten percent. - Specify an unsigned or signed positive integer to increase the pitch by that percentage. For example,
10
and+10
increase the pitch by ten percent. - Specify
0
or omit the parameter to get the default pitch for the voice.
The best means of determining how the parameter affects the pitch of a given voice is to experiment with different values. Increase or decrease the rate incrementally, by five or ten percent, before trying more drastic changes. When decreasing
the pitch, a value of -50
halves the baseline pitch; lower values might not yield appreciable or positive results. Similarly, when increasing the pitch, a value of 100
doubles the pitch; higher values might produce
results that are not discernibly different from 100
.
Table 2 provides examples of how the service changes the speaking pitch with different values for the pitch_percentage
parameter. The examples use the en-US_AllisonV3Voice
.
Specification of the pitch_percentage parameter |
Audio sample |
---|---|
pitch_percentage=0 The service uses the default speaking pitch. You can also omit the parameter to get the default behavior. |
|
pitch_percentage=-20 The service uses a speaking pitch that is 20 percent lower. |
|
pitch-percentage=20 The service uses a speaking pitch that is 20 percent higher. |
Interaction with SSML <prosody>
element
The pitch_percentage
parameter changes the pitch for an entire request. You can also use pitch
attribute of the the SSML <prosody>
element to modify the pitch of specific words of text. Although
the pitch_percentage
parameter accepts only a percentage value, the pitch
attribute accepts values of different forms. For more information, see The pitch
attribute.
If you use the pitch_percentage
parameter and also use the pitch
attribute of the <prosody>
tag to modify the pitch for specific text of the same request, the effects are multiplicative. For example,
suppose you specify the pitch_percentage
parameter with a value of 10
, which increases the pitch by 10 percent. If you also use the <prosody>
element to specify an equivalent pitch
of +10%
for specific text of the request, that text is spoken with a pitch that is 121 percent of the default baseline pitch for the voice.
Speaking pitch examples
The following HTTP POST /v1/synthesize
method uses the pitch_percentage
query parameter with a value of 5
. The resulting audio is spoken with a pitch that is five percent higher than it is by default.
The example uses the US English voice en-US_AllisonV3Voice
.
IBM Cloud
curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output pitch-percentage.wav \
--data "{\"text\":\"This text is spoken five percent higher than the default.\"}" \
"{url}/v1/synthesize?voice=en-US_AllisonV3Voice&pitch_percentage=5"
IBM Cloud Pak for Data
curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output pitch-percentage.wav \
--data "{\"text\":\"This text is spoken five percent higher than the default.\"}" \
"{url}/v1/synthesize?voice=en-US_AllisonV3Voice&pitch_percentage=5"
The following WebSocket example shows inclusion of the parameter with the same value on the request to establish a connection:
var access_token = '{access_token}';
var wsURI = '{ws_url}/v1/synthesize'
+ '?access_token=' + access_token
+ '&voice=en-US_AllisonV3Voice'
+ '&pitch_percentage=5';
var websocket = new WebSocket(wsURI);
Specifying how strings are spelled out
The service's default pronunciation of alphabetic, numeric, and alphanumeric strings varies by language, with each language having its own rules. For any language, the pronunciation also depends on the length and context of the string, as well its composition in terms of the positions and grouping of its characters
- Alphabetic strings are pronounced as words if they are pronounceable in the language of the voice. Otherwise, the characters are spelled out individually. Whether a string is pronounced or spelled out can also depend on the case of the letters.
- Numeric strings can be pronounced as cardinal numbers, ordinal numbers, dates, or in other ways, including as individual digits.
- Alphanumeric strings that contain both letters and numbers can be pronounced in different ways depending on the characteristics of the string. For example, they might be pronounced as combinations of words, cardinal numbers, or digits. But as mentioned previously, the pronunciation depends largely on the string and its composition.
The best means of controlling how a specific string is pronounced is to use the <say-as>
element of the SSML. The element allows you to specify exactly how a string is synthesized. You use the interpret-as
attribute
of the <say-as>
element to indicate that a string of characters is to be pronounced as letters
, digits
, or in some other way. The available values differ by language.
If you use the letters
or digits
attributes, all letters and numbers of the string are pronounced individually. This is especially useful for spelling out the alphanumeric characters of a hexadecimal string or an identification
or account number. For more information, including language-specific differences, see The say-as
element.
Specifying the pace at which strings are spelled out
The spell_out_mode
parameter is beta functionality that is supported only for German voices.
By default, the service reads characters that it spells out at the same rate at which it reads text for the language being synthesized. For German voices, however, you can use the optional spell_out_mode
query parameter to indicate
how the service is to spell out individual characters: one, two, or three at a time. The parameter applies globally to control the pace at which the service reads all strings that it spells out from a request. Use the spell_out_mode
parameter only for longer strings. The parameter has no effect on a string that contains fewer than three characters.
Table 3 provides examples of how the service pronounces the following text with different values for the spell_out_mode
parameter:
Die Nummer ist <say-as interpret-as=‘digits’>AB7234987FFA</say-as>.
Specification of the spell_out_mode parameter |
Audio sample |
---|---|
spell_out_mode=default The service reads the characters in succession, typically with no pause between them, at the rate at which it synthesizes speech for the request. You can also omit the parameter to get the default behavior. |
|
spell_out_mode=singles The service reads the characters one at a time, with a brief pause between each character. |
|
spell_out_mode=pairs The service reads the characters two at a time, with a brief pause between each pair. |
|
spell_out_mode=triplets The service reads the characters three at a time, with a brief pause between each triplet. |
Alphanumeric strings that mix letters and digits can yield different groupings of characters. If you do not use the <say-as>
element, groups of letters and digits can be pronounced in ways that do not necessarily conform
to the requested mode.
Spell-out mode examples
The following HTTP POST /v1/synthesize
method uses the spell_out_mode
query parameter with a value of singles
. The example uses the German voice de-DE_ErikaV3Voice
. The alphanumeric text
to be synthesized is a hexadecimal string that is wrapped in a <say-as>
element and interpreted as digits
.
IBM Cloud
curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output spell-out-mode.wav \
--data "{\"text\":\"Die Nummer ist <say-as interpret-as=‘digits’>AB7234987FFA</say-as>.\"}" \
"{url}/v1/synthesize?voice=de-DE_ErikaV3Voice&spell_out_mode=singles"
IBM Cloud Pak for Data
curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output spell-out-mode.wav \
--data "{\"text\":\"Die Nummer ist <say-as interpret-as=‘digits’>AB7234987FFA</say-as>.\"}" \
"{url}/v1/synthesize?voice=de-DE_ErikaV3Voice&spell_out_mode=singles"
The following WebSocket example shows inclusion of the parameter with the same value on the request to establish a connection:
var access_token = '{access_token}';
var wsURI = '{ws_url}/v1/synthesize'
+ '?access_token=' + access_token
+ '&voice=de-DE_ErikaV3Voice'
+ '&spell_out_mode=singles';
var websocket = new WebSocket(wsURI);