IBM Cloud Docs
About Text to Speech

About Text to Speech

The IBM Watson® Text to Speech service provides APIs that use IBM's speech-synthesis capabilities to convert written text to natural-sounding speech. The service streams the synthesized audio back to the client with minimal delay. The audio uses appropriate cadence and intonation for its language and dialect to provide voices that are smooth and natural.

The service can be used in applications such as voice-automated chatbots, as well as a variety of voice-driven and screenless applications, such as tools for the disabled or visually impaired, video narration and voice over, and educational and home-automation solutions. It is appropriate for any application where audio is the preferred method of output.

Product versions

Text to Speech can be deployed as a managed cloud service or can be installed on premises. This documentation describes how to use both versions of the product. Information such as topics, paragraphs, and examples that applies exclusively to one version is clearly denoted:

Speech synthesis

The Text to Speech service supports both HTTP and WebSocket interfaces for speech synthesis. Both interfaces accept plain text and text that is marked up with the XML-based Speech Synthesis Markup Language (SSML). The WebSocket interface can also produce timing information about the words of the audio. For more information, see the following service features:

Customization

The service provides a customization interface that you can use to specify how the service pronounces unusual words that occur in your input text. You can define custom models to include dictionaries of words for your application's lexicon. For more information, see Customizing the service in the service features.

With the Tune by Example feature, you can also add custom prompts to your custom models. Custom prompts let you dictate the prosody with which the service speaks user-specified prompts. For more information, see Using Tune by Example in the service features.

Language support

The service offers neural voices to synthesize text to speech in many languages and dialects:

  • Dutch (Netherlands)
  • English (Australian, United Kingdom, and United States dialects)
  • French (Canadian and France dialects)
  • German
  • Italian
  • Japanese
  • Korean
  • Portuguese (Brazilian)
  • Spanish (Castilian, Latin American, and North American dialects)

For different languages, the service offers female voices, male voices, or both. For more information about the supported languages and voices, the types of voices that the service provides for each language, and their status for both versions of the service, see Languages and voices.

Audio support

The service produces audio in many popular formats:

  • A-law
  • Basic audio
  • Free Lossless Audio Codec (FLAC)
  • Linear 16-bit Pulse-Code Modulation (PCM)
  • MP3 (or MPEG)
  • Mu-law (or u-law)
  • Ogg or Web Media (WebM) audio with the Opus or Vorbis codec
  • Waveform Audio File Format (WAV)

Different formats support different sampling rates and other characteristics. For more information, see Using audio formats.

Beta features

IBM occasionally releases features and language support that are classified as beta. Such features are provided so that you can evaluate their functionality. They might be unstable and are subject to change or removal with short notice. They are not intended for use in a production environment.

Beta features might not provide the same level of performance or compatibility as generally available features. Generally available features are ready for use in a production environment.

Pricing

IBM Cloud

The service offers multiple pricing plans to suit your usage and application needs. For more information about the pricing plans or to purchase a plan, see the Text to Speech service in the IBM Cloud® Catalog.