Release notes for Speech to Text for IBM Cloud
IBM Cloud
The following features and changes were included for each release and update of managed instances of IBM Watson® Speech to Text that are hosted on IBM Cloud or for instances that are hosted on IBM Cloud Pak for Data as a Service. Unless otherwise noted, all changes are compatible with earlier releases and are automatically and transparently available to all new and existing applications.
For information about known limitations of the service, see Known limitations.
For information about releases and updates of the service for IBM Cloud Pak for Data, see Release notes for Speech to Text for IBM Cloud Pak for Data.
19 November 2024
- New large speech model for German is now generally available
-
The large speech model for German is now generally available.
- For more information about large speech models, see Large speech languages and models.
- For more information about the features that are supported for large speech models, see Supported features for large speech models.
23 August 2024
- All Large Speech Models are now generally available
-
The large speech models for all languages are now generally available (GA). They are supported for use in production environments and applications.
- For more information about large speech models, see Large speech languages and models.
- For more information about the features that are supported for large speech models, see Supported features for large speech models.
18 June 2024
- New large speech models for Brazilian Portuguese and Spanish are now in open beta
-
The large speech models for Brazilian Portuguese and Spanish are now in open beta. Spanish includes the Castilian, Argentinian, Chilean, Colombian, Mexican, and Peruvian dialects.
- For more information about large speech models, see Large speech languages and models.
- For more information about the features that are supported for large speech models, see Supported features for large speech models.
15 May 2024
- Large Speech Model for English is now generally available
-
The large speech model for English, which includes the United States, Australian, Indian, and United Kingdom dialects, is now generally available (GA). It is supported for use in production environments and applications.
- For more information about large speech models, see Large speech languages and models.
- For more information about the features that are supported for large speech models, see Supported features for large speech models.
07 March 2024
- Large Speech Model for US English in Open Beta
- The new Large speech model for US English is in open beta. See Large speech languages and models for more details with supported features (beta).
30 November 2023
- Speech to Text parameter speech_begin_event
-
This parameter would enable the client application to know that some words or speech is detected and Speech to Text is in the process of decoding. For more details, see Using speech recognition parameters.
- Parameter 'mapping_only' for custom words
-
By using the 'mapping_only' parameter, you can use custom words directly to map 'sounds_like' (or word) to 'display_as' value as post-processing instead of training. For more information, see The words resource.
-
See the guidelines for Non-Japanese and Japanese.
- Support for Brazilian-Portuguese and French-Canadian on new improved next-generation language model customization
-
Language model customization for Brazilian-Portuguese and French-Canadian next-generation models is recently added. This service update includes further internal improvements.
- New Smart Formatting Feature
-
A new smart formatting feature for next-generation models is supported in US English, Brazilian Portuguese, French and German languages. See Smart formatting Version for details.
- Support for Castilian Spanish and LATAM Spanish on new improved next-generation language model customization
-
The language model customization for Castilian Spanish and LATAM Spanish next-generation models are added. This service update includes further internal improvements.
- Large Speech Models for English, Japanese, and French - for early access
-
For early access feature, Large Speech Models are available for English, Japanese and French languages for you in IBM Watson Speech-to-Text and IBM watsonx Assistant. The feature set for these Large Speech Models is limited, but more accurate than Next-Generation models and are faster and cheaper to run due to smaller size and better streaming mode capability.
If you are interested in testing these base models, and sharing results and feedback, contact our Product Management team by filling out this form.
28 July 2023
- Important: All previous-generation models are discontinued starting August 1, 2023
- Important: All previous-generation models are now discontinued from the service. New clients must now only use the next-generation models. All existing clients must now migrate to the equivalent next-generation model. For more information about all next-generation models, see Next-generation languages and models. For more information on how to migrate to the next-generation models, see Migrating to next-generation models.
9 June 2023
- Defect fix: Creating and training a custom Language Model is now optimal for both standard and low-latency Next-Generation models
- Defect fix: When creating and training a custom Language Model with corpora text files and / or custom words using a Next-generation low-latency model, it is now performing the same way as with a standard model. Previously, it was not optimal only when using a Next-Generation low-latency model.
- Defect fix: STT Websockets sessions no longer fail due to tensor error message
- Defect fix: When using STT websockets, sessions no longer fail due to an error message “STT returns the error: Sizes of tensors must match except in dimension 0”.
18 May 2023
- Updates to English next-generation Medical telephony model
-
The English next-generation Medical telephony model has been updated for improved speech recognition:
en-WW_Medical_Telephony
- Added support for French and German on new improved next-generation language model customization
-
Language model customization for French and German next-generation models was recently added. This service update includes further internal improvements.
For more information about improved next-generation customization, see
- Defect fix: Custom words containing half-width Katakana characters now return a clear error message with Japanese Telephony model
-
Defect fix: Per the documentation, only full-width Katakana characters are accepted in custom words and the next generation models now show an error message to explain that it's not supported. Previously, when creating custom words containing half-width Katakana characters, no error message was provided.
- Defect fix: Japanese Telephony language model no longer fails due to long training time
-
Defect fix: When training a custom language model with Japanese Telephony, the service now effectively handles large numbers of custom words without failing.
2 May 2023
- New procedure for upgrading a custom model that is based on an improved next-generation model
-
Two approaches are now available to upgrade a custom language model to an improved next-generation base model. You can still modify and then retrain the custom model, as already documented. But now, you can also upgrade the custom model by including the query parameter
force=true
with thePOST /v1/customizations/{customization_id}/train
request. Theforce
parameter upgrades the custom model regardless of whether it contains changes (is in theready
oravailable
state).For more information, see Upgrading a custom language model based on an improved next-generation model.
- Guidance for adding words to custom models that are based on improved next-generation models
-
The documentation now offers more guidance about adding words to custom models that are based on improved next-generation models. For performance reasons during training, the guidance encourages the use of corpora rather than the direct addition of custom words whenever possible.
For more information, see Guidelines for adding words to custom models based on improved next-generation models.
- Japanese custom words for custom models that are based on improved next-generation models are handled differently
-
For Japanese custom models that are based on next-generation models, custom words are handled differently from other languages. For Japanese, you can add a custom word or sounds-like that does not exceed 25 characters in length. If your custom word or sounds-like exceeds that limit, the service adds the word to the custom model as if it were added by a corpus. The word does not appear as a custom word for the model.
For more information, see Guidelines for adding words to Japanese models based on improved next-generation models.
12 April 2023
- Defect fix: The WebSocket interface now times out as expected when using next-generation models
- Defect fix: When used for speech recognition with next-generation models, the WebSocket interface now times out as expected after long periods of silence. Previously, when used for speech recognition of short audio files, the WebSocket session could fail to time out. When the session failed to time out, the service did not return a final hypothesis to the waiting client application, and the client instead timed while waiting for the results.
6 April 2023
- Defect fix: Limits to allow completion of training for next-generation Japanese custom models
-
Defect fix: Successful training of a next-generation Japanese custom language model requires that custom words and sounds-likes added to the model each contain no more than 25 characters. For the most effective training, it is recommended that custom words and sounds-likes contain no more than 20 characters. Training of Japanese custom models with longer custom words and sounds-likes fails to complete after multiple hours of training.
If you need to add the equivalent of a long word or sounds-like to a next-generation Japanese custom model, take these steps:
- Add a shorter word or sounds-like that captures the essence of the longer word or sounds-like to the custom model.
- Add one or more sentences that use the longer word or sounds-like to a corpus.
- Consider adding sentences to the corpus that provide more context for the word or sounds-like. Greater context gives the service more information with which to recognize the word and apply the correct sounds-like.
- Add the corpus to the custom model.
- Retrain the custom model on the combination of the shorter word or sounds-like and the corpus that contains the longer string.
The limits and steps just described allow next-generation Japanese custom models to complete training. Keep in mind that adding large numbers of new custom words to a custom language model increases the training time of the model. But the increased training time occurs only when the custom model is initially trained on the new words. Once the custom model has been trained on the new words, training time returns to normal.
For more information, see
- Further improvements to updated next-generation language model customization
-
Language model customization for English and Japanese next-generation models was recently improved. This service update includes further internal improvements. For more information about improved next-generation customization, see
13 March 2023
- Defect fix: Smart formatting for US English dates is now correct
- Defect fix: Smart formatting now correctly includes days of the week and dates when both are present in the spoken audio, for example,
Tuesday February 28
. Previously, in some cases the day of the week was omitted and the date was presented incorrectly. Note that smart formatting is beta functionality. - Defect fix: Update documentation for speech hesitation words for next-generation models
- Defect fix: Documentation for speech hesitation words for next-generation models has been updated. More details are provided about US English and Japanese hesitation words. Next-generation models include the actual hesitation words in transcription results, unlike previous-generation models, which include only hesitation markers. For more information, see Speech hesitations and hesitation markers.
27 February 2023
- New Japanese next-generation telephony model
-
The service now offers a next-generation telephony model for Japanese:
ja-JP_Telephony
. The new model supports low latency and is generally available. It also supports language model customization and grammars. For more information about next-generation models and low latency, see - Improved language model customization for next-generation English and Japanese models
-
The service now provides improved language model customization for next-generation English and Japanese models:
en-AU_Multimedia
en-AU_Telephony
en-IN_Telephony
en-GB_Multimedia
en-GB_Telephony
en-US_Multimedia
en-US_Telephony
ja-JP_Multimedia
ja-JP_Telephony
Visible improvements to the models: The new technology improves the default behavior of the new English and Japanese models. Among other changes, the new technology optimizes the default behavior for the following parameters:
- The default
customization_weight
for custom models that are based on the new versions of these models changes from0.2
to0.1
. - The default
character_insertion_bias
for custom models that are based on the new versions of these models remains0.0
, but the models have changed in a manner that makes use of the parameter for speech recognition less necessary.
Upgrading to the new models: To take advantage of the improved technology, you must upgrade any custom language models that are based on the new models. To upgrade to the new version of one of these base models, do the following:
-
Change your custom model by adding or modifying a custom word, corpus, or grammar that the model contains. Any change that you make moves the model to the
ready
state. -
Use the
POST /v1/customizations/{customization_id}/train
method to retrain the model. Retraining upgrades the custom model to the new technology and moves the model to theavailable
state.Known issue: At this time, you cannot use the
POST /v1/customizations/{customization_id}/upgrade_model
method to upgrade a custom model to one of the new base models. This issue will be addressed in a future release.
Using the new models: Following the upgrade to the new base model, you are advised to evaluate the performance of the upgraded custom model by paying special attention to the
customization_weight
andcharacter_insertion_bias
parameters for speech recognition. When you retrain your custom model:- The custom model uses the new default
customization_weight
of0.1
for your custom model. A non-defaultcustomization_weight
that you had associated with your custom model is removed. - The custom model might no longer require use of the
character_insertion_bias
parameter for optimal speech recognition.
Improvements to language model customization render these parameters less important for high-quality speech recognition:
- If you use the default values for these parameters, continue to do so after the upgrade. The default values will likely continue to offer the best results for speech recognition.
- If you specify non-default values for these parameters, experiment with the default values following upgrade. Your custom model might work well for speech recognition with the default values.
If you feel that using different values for these parameters might improve speech recognition with your custom model, experiment with incremental changes to determine whether the parameters are needed to improve speech recognition.
Note: At this time, the improvements to language model customization apply only to custom models that are based on the next-generation English or Japanese base language models listed earlier. Over time, the improvements will be made available for other next-generation language models.
More information: For more information about upgrading and about speech recognition with these parameters, see
- Defect fix: Grammar files now handle strings of digits correctly
-
Defect fix: When grammars are used, the service now handles longer strings of digits correctly. Previously, it was failing to complete recognition or returning incorrect results.
15 February 2023
- Important: All previous-generation models are deprecated and will reach end of service on 31 July 2023
-
Important: All previous-generation models are deprecated and will reach end of service effective 31 July 2023. On that date, all previous-generation models will be removed from the service and the documentation. The previous deprecation date was 3 March 2023. The new date allows users more time to migrate to the appropriate next-generation models. But users must migrate to the equivalent next-generation model by 31 July 2023.
Most previous-generation models were deprecated on 15 March 2022. Previously, the Arabic and Japanese models were not deprecated. Deprecation now applies to all previous-generation models.
- For more information about the next-generation models to which you can migrate from each of the deprecated models, see Previous-generation languages and models
- For more information about migrating from previous-generation to next-generation models, see Migrating to next-generation models.
- For more information about all next-generation models, see Next-generation languages and models
Note: When the previous-generation
en-US_BroadbandModel
is removed from service, the next-generationen-US_Multimedia
model will become the default model for speech recognition requests. - Defect fix: Improved training time for next-generation custom language models
-
Defect fix: Training time for next-generation custom language models is now significantly improved. Previously, training time took much longer than necessary, as reported for training of Japanese custom language models. The problem was corrected by an internal fix.
- Defect fix: Dynamically generated grammar files now work properly
-
Defect fix: Dynamically generated grammar files now work properly. Previously, dynamic grammar files could cause internal failures, as reported for integration of Speech to Text with IBM® watsonx™ Assistant. The problem was corrected by an internal fix.
20 January 2023
- Deprecated Arabic and United Kingdom model names are no longer available
-
The following Arabic and United Kingdom model names are no longer accepted by the service:
ar-AR_BroadbandModel
- Usear-MS_BroadbandModel
instead.en-UK_NarrowbandModel
- Useen-GB_NarrowbandModel
instead.en-UK_BroadbandModel
- Useen-GB_BroadbandModel
instead.
The Arabic model name was deprecated on 2 December 2020. The UK English model names were deprecated on 14 July 2017.
- Cloud Foundry deprecation and migration to resource groups
-
IBM announced the deprecation of IBM Cloud Foundry on 31 May 2022. As of 30 November 2022, new IBM Cloud Foundry applications cannot be created and only existing users are able to deploy applications. IBM Cloud Foundry reaches end of support on 1 June 2023. At that time, any IBM Cloud Foundry application runtime instances running IBM Cloud Foundry applications will be permanently disabled, deprovisioned, and deleted.
To continue to use your IBM Cloud applications beyond 1 June 2023, you must migrate to resource groups before that date. Resource groups are conceptually similar to Cloud Foundry spaces. They include several extra benefits, such as finer-grained access control by using IBM Cloud Identity and Access Management (IAM), the ability to connect service instances to apps and service across different regions, and an easy way to view usage per group.
- The
max_alternatives
parameter is now available for use with next-generation models -
The
max_alternatives
parameter is now available for use with all next-generation models. The parameter is generally available for all next-generation models. For more information, see Maximum alternatives. - Defect fix: Allow use of both
max_alternatives
andend_of_phrase_silence_time
parameters with next-generation models -
Defect fix: When you use both the
max_alternatives
andend_of_phrase_silence_time
parameters in the same request with next-generation models, the service now returns multiple alternative transcripts while also respecting the indicated pause interval. Previously, use of the two parameters in a single request generated a failure. (Use of themax_alternatives
parameter with next-generation models was previously available as an experimental feature to a limited number of customers.) - Defect fix: Update French Canadian next-generation telephony model (upgrade required)
-
Defect fix: The French Canadian next-generation telephony model,
fr-CA_Telephony
, was updated to address an internal inconsistency that could cause an error during speech recognition. You need to upgrade any custom models that are based on thefr-CA_Telephony
model. For more information about upgrading custom models, see - Defect fix: Add documentation guidelines for creating Japanese sounds-likes based on next-generation models
-
Defect fix: In sounds-likes for Japanese custom language models that are based on next-generation models, the character-sequence
ウー
is ambiguous in some left contexts. Do not use characters (syllables) that end with the phoneme/o/
, such asロ
andト
. In such cases, useウウ
or justウ
instead ofウー
. For example, useロウウマン
orロウマン
instead ofロウーマン
. For more information, see Guidelines for Japanese. - Adding words directly to custom models that are based on next-generation models increases the training time
-
Adding custom words directly to a custom model that is based on a next-generation model causes training of a model to take a few minutes longer than it otherwise would. If you are training a model with custom words that you added by using the
POST /v1/customizations/{customization_id}/words
orPUT /v1/customizations/{customization_id}/words/{word_name}
method, allow for some minutes of extra training time for the model. For more information, see - Maximum hours of audio resources for custom acoustic models in the Tokyo location has been increased
-
The maximum hours of audio resources that you can add to custom acoustic models in the Tokyo location is again 200 hours. Previously, the maximum was reduced to 50 hours for the Tokyo region. That reduction has been rescinded and postponed until next year. For more more information, see Maximum hours of audio.
5 December 2022
- New Netherlands Dutch next-generation multimedia model
- The service now offers a next-generation multimedia model for Netherlands Dutch:
nl-NL_Multimedia
. The new model supports low latency and is generally available. It also supports language model customization and grammars. For more information about next-generation models and low latency, see - Defect fix: Correct custom word recognition in transcription results for next-generation models
- Defect fix: For language model customization with next-generation models, custom words are now recognized and used in all transcripts. Previously, custom words sometimes failed to be recognized and used in transcription results.
- Defect fix: Correct use of
display_as
field in transcription results for next-generation models - Defect fix: For language model customization with next-generation models, the value of the
display_as
field for a custom word now appears in all transcripts. Previously, the value of theword
field sometimes appeared in transcription results. - Defect fix: Update custom model naming documentation
- Defect fix: The documentation now provides detailed rules for naming custom language models and custom acoustic models. For more information, see
20 October 2022
- Updates to English next-generation telephony models
-
The English next-generation telephony models have been updated for improved speech recognition:
en-AU_Telephony
en-GB_Telephony
en-IN_Telephony
en-US_Telephony
All of these models continue to support low latency. You do not need to upgrade custom models that are based on these models. For more information about all available next-generation models, see Next-generation languages and models.
- Defect fix: Update Japanese next-generation multimedia model (upgrade required)
-
Defect fix: The Japanese next-generation multimedia model,
ja-JP_Multimedia
, was updated to address an internal inconsistency that could cause an error during speech recognition with low latency. You need to upgrade any custom models that are based on theja-JP_Multimedia
model. For more information about upgrading custom models, see
7 October 2022
- New Swedish next-generation telephony model
-
The service now offers a next-generation telephony model for Swedish:
sv-SE_Telephony
. The new model supports low latency and is generally available. It also supports language model customization and grammars. For more information about next-generation models and low latency, see - Updates to English next-generation telephony models
-
The English next-generation telephony models have been updated for improved speech recognition:
en-AU_Telephony
en-GB_Telephony
en-IN_Telephony
en-US_Telephony
All of these models continue to support low latency. You do not need to upgrade custom models that are based on these models. For more information about all available next-generation models, see Next-generation languages and models.
21 September 2022
- New Activity Tracker event for GDPR deletion of user information
-
The service now returns an Activity Tracker event when you use the
DELETE /v1/user_data
method to delete all information about a user. The event is namedspeech-to-text.gdpr-user-data.delete
. For more information, see Activity Tracker events. - Defect fix: Update some next-generation models to improve low-latency response time
-
Defect fix: The following next-generation models were updated to improve their response time when the
low_latency
parameter is used:en-IN_Telephony
hi-IN_Telephony
it-IT_Multimedia
nl-NL_Telephony
Previously, these models did not return recognition results as quickly as expected when the
low_latency
parameter was used. You do not need to upgrade custom models that are based on these models. For more information about all available next-generation models, see Next-generation languages and models.
19 August 2022
- Important: Deprecation date for most previous-generation models is now 3 March 2023
-
Superseded: This deprecation notice is superseded by the 15 February 2023 service update. The end of service date for all previous-generation models is now 31 July 2023.
On 15 March 2022, the previous-generation models for all languages other than Arabic and Japanese were deprecated. At that time, the deprecated models were to remain available until 15 September 2022. To allow users more time to migrate to the appropriate next-generation models, the deprecated models will now remain available until 3 March 2023. As with the initial deprecation notice, the Arabic and Japanese previous-generation models are not deprecated. For a complete list of all deprecated models, see the 15 March 2022 service update.
On 3 March 2023, the deprecated models will be removed from the service and the documentation. If you use any of the deprecated models, you must migrate to the equivalent next-generation model by the 3 March 2023.
- For more information about the next-generation models to which you can migrate from each of the deprecated models, see Previous-generation languages and models
- For more information about the next-generation models, see Next-generation languages and models
- For more information about migrating from previous-generation to next-generation models, see Migrating to next-generation models.
Note: When the previous-generation
en-US_BroadbandModel
is removed from service, the next-generationen-US_Multimedia
model will become the default model for speech recognition requests.
15 August 2022
- New French Canadian next-generation multimedia model
-
The service now offers a next-generation multimedia model for French Canadian:
fr-CA_Multimedia
. The new model supports low latency and is generally available. It also supports language model customization and grammars. For more information about next-generation models and low latency, see - Updates to English next-generation telephony models
-
The English next-generation telephony models have been updated for improved speech recognition:
en-AU_Telephony
en-GB_Telephony
en-IN_Telephony
en-US_Telephony
All of these models continue to support low latency. You do not need to upgrade custom models that are based on these models. For more information about all available next-generation models, see Next-generation languages and models.
- Italian next-generation multimedia model now supports low latency
-
The Italian next-generation multimedia model,
it-IT_Multimedia
, now supports low latency. For more information about next-generation models and low latency, see - Important: Maximum hours of audio data being reduced for custom acoustic models
-
Important: The maximum amount of audio data that you can add to a custom acoustic model is being reduced from 200 hours to 50 hours. This change is being phased into different locations from August to September 2022. For information about the schedule for the limit reduction and what it means for existing custom acoustic models that contain more than 50 hours of audio, see Maximum hours of audio.
3 August 2022
- Defect fix: Update speech hesitations and hesitation markers documentation
-
Defect fix: Documentation for speech hesitations and hesitation markers has been updated. Previous-generation models include hesitation markers in place of speech hesitations in transcription results for most languages; smart formatting removes hesitation markers from US English final transcripts. Next-generation models include the actual speech hesitations in transcription results; smart formatting has no effect on their inclusion in final transcription results.
For more information, see:
1 June 2022
- Updates to multiple next-generation telephony models
-
The following next-generation telephony models have been updated for improved speech recognition:
en-AU_Telephony
en-GB_Telephony
en-IN_Telephony
en-US_Telephony
ko-KR_Telephony
You do not need to upgrade custom models that are based on these models. For more information about all available next-generation models, see Next-generation languages and models.
25 May 2022
- New beta
character_insertion_bias
parameter for next-generation models -
All next-generation models now support a new beta parameter,
character_insertion_bias
, which is available with all speech recognition interfaces. By default, the service is optimized for each individual model to balance its recognition of candidate strings of different lengths. The model-specific bias is equivalent to 0.0. Each model's default bias is sufficient for most speech recognition requests.However, certain use cases might benefit from favoring hypotheses with shorter or longer strings of characters. The parameter accepts values between -1.0 and 1.0 that represent a change from a model's default. Negative values instruct the service to favor shorter strings of characters. Positive values direct the service to favor longer strings of characters. For more information, see Character insertion bias.
19 May 2022
- New Italian
it-IT_Multimedia
next-generation model -
The service now offers a next-generation multimedia model for Italian:
it-IT_Multimedia
. The new model is generally available. It does not support low latency, but it does support language model customization and grammars. For more information about all available next-generation models, see Next-generation languages and models. - Updated Korean telephony and multimedia next-generation models
-
The existing Korean next-generation models have been updated:
- The
ko-KR_Telephony
model has been updated for improved low-latency support for speech recognition. - The
ko-KR_Multimedia
model has been updated for improved speech recognition. The model now also supports low latency.
Both models are generally available, and both support language model customization and grammars. You do not need to upgrade custom language models that are based on these models. For more information about all available next-generation models, see Next-generation languages and models.
- The
- Defect fix: Confidence scores are now reported for all transcription results
-
Defect fix: Confidence scores are now reported for all transcription results. Previously, when the service returned multiple transcripts for a single speech recognition request, confidence scores might not be returned for all transcripts.
11 April 2022
- New Brazilian Portuguese
pt-BR_Multimedia
next-generation model -
The service now offers a next-generation multimedia model for Brazilian Portuguese:
pt-BR_Multimedia
. The new model supports low latency and is generally available. It also supports language model customization and grammars. For more information about next-generation models and low latency, see - Update to German
de-DE_Multimedia
next-generation model to support low latency -
The next-generation German model,
de-DE_Multimedia
, now supports low latency. You do not need to upgrade custom models that are based on the updated German base model. For more information about the next-generation models and low latency, see - Support for sounds-like is now documented for custom models based on next-generation models
-
For custom language models that are based on next-generation models, support is now documented for sounds-like specifications for custom words. Support for sounds-likes has been available since late 2021.
Differences exist between the use of the
sounds_like
field for custom models that are based on next-generation and previous-generation models. For more information about using thesounds_like
field with custom models that are based on next-generation models, see Working with custom words for next-generation models. - Important: Deprecated
customization_id
parameter removed from the documentation -
Important: On 9 October 2018, the
customization_id
parameter of all speech recognition requests was deprecated and replaced by thelanguage_customization_id
parameter. Thecustomization_id
parameter has now been removed from the documentation for the speech recognition methods:/v1/recognize
for WebSocket requestsPOST /v1/recognize
for synchronous HTTP requests (including multipart requests)POST /v1/recognitions
for asynchronous HTTP requests
Note: If you use the Watson SDKs, make sure that you have updated any application code to use the
language_customization_id
parameter instead of thecustomization_id
parameter. Thecustomization_id
parameter will no longer be available from the equivalent methods of the SDKs as of their next major release. For more information about the speech recognition methods, see the API & SDK reference.
17 March 2022
- Grammar support for next-generation models is now generally available
-
Grammar support is now generally available (GA) for next-general models that meet the following conditions:
- The models are generally available.
- The models support language model customization.
For more information, see the following topics:
- For more information about the status of grammar support for next-generation models, see Customization support for next-generation models.
- For more information about grammars, see Grammars.
- New German next-generation multimedia model
-
The service now offers a next-generation multimedia model for German:
de-DE_Multimedia
. The new model is generally available. It does not support low latency. It does support language model customization (generally available) and grammars (beta).For more information about all available next-generation models and their customization support, see
- Beta next-generation
en-WW_Medical_Telephony
model now supports low latency -
The beta next-generation
en-WW_Medical_Telephony
model now supports low latency. For more information about all next-generation models and low latency, see
15 March 2022
- Important: Deprecation of most previous-generation models
-
Superseded: This deprecation notice is superseded by the 15 February 2023 service update. The end of service date for all previous-generation models is now 31 July 2023.
Effective 15 March 2022, previous-generation models for all languages other than Arabic and Japanese are deprecated. The deprecated models remain available until 15 September 2022, when they will be removed from the service and the documentation. The Arabic and Japanese previous-generation models are not deprecated.
The following previous-generation models are now deprecated:
- Chinese (Mandarin):
zh-CN_NarrowbandModel
andzh-CN_BroadbandModel
- Dutch (Netherlands):
nl-NL_NarrowbandModel
andnl-NL_BroadbandModel
- English (Australian):
en-AU_NarrowbandModel
anden-AU_BroadbandModel
- English (United Kingdom):
en-GB_NarrowbandModel
anden-GB_BroadbandModel
- English (United States):
en-US_NarrowbandModel
,en-US_BroadbandModel
, anden-US_ShortForm_NarrowbandModel
- French (Canadian):
fr-CA_NarrowbandModel
andfr-CA_BroadbandModel
- French (France):
fr-FR_NarrowbandModel
andfr-FR_BroadbandModel
- German:
de-DE_NarrowbandModel
andde-DE_BroadbandModel
- Italian:
it-IT_NarrowbandModel
andit_IT_BroadbandModel
- Korean:
ko-KR_NarrowbandModel
andko-KR_BroadbandModel
- Portuguese (Brazilian):
pt-BR_NarrowbandModel
andpt-BR_BroadbandModel
- Spanish (Argentinian):
es-AR_NarrowbandModel
andes-AR_BroadbandModel
- Spanish (Castilian):
es-ES_NarrowbandModel
andes-ES_BroadbandModel
- Spanish (Chilean):
es-CL_NarrowbandModel
andes-CL_BroadbandModel
- Spanish (Colombian):
es-CO_NarrowbandModel
andes-CO_BroadbandModel
- Spanish (Mexican):
es-MX_NarrowbandModel
andes-MX_BroadbandModel
- Spanish (Peruvian):
es-PE_NarrowbandModel
andes-PE_BroadbandModel
If you use any of these deprecated models, you must migrate to the equivalent next-generation model by the end of service date.
- For more information about the next-generation models to which you can migrate from each of the deprecated models, see Previous-generation languages and models
- For more information about the next-generation models, see Next-generation languages and models
- For more information about migrating from previous-generation to next-generation models, see Migrating to next-generation models.
Note: When the previous-generation
en-US_BroadbandModel
is removed from service on 15 September, the next-generationen-US_Multimedia
model will become the default model for speech recognition requests. - Chinese (Mandarin):
- Next-generation models now support audio-parsing parameters
-
All next-generation models now support the following audio-parsing parameters as generally available features:
end_of_phrase_silence_time
specifies the duration of the pause interval at which the service splits a transcript into multiple final results. For more information, see End of phrase silence time.split_transcript_at_phrase_end
directs the service to split the transcript into multiple final results based on semantic features of the input. For more information, see Split transcript at phrase end.
- Defect fix: Correct speaker labels documentation
-
Defect fix: Documentation of speaker labels included the following erroneous statement in multiple places: For next-generation models, speaker labels are not supported for use with interim results or low latency. Speaker labels are supported for use with interim results and low latency for next-generation models. For more information, see Speaker labels.
28 February 2022
- Updates to English and French next-generation multimedia models to support low latency
-
The following multimedia models have been updated to support low latency:
- Australian English:
en-AU_Multimedia
- UK English:
en-GB_Multimedia
- US English:
en-US_Multimedia
- French:
fr-FR_Multimedia
You do not need to upgrade custom language models that are built on these base models. For more information about the next-generation models and low latency, see
- Australian English:
- New Castilian Spanish next-generation multimedia model
-
The service now offers a next-generation multimedia model for Castilian Spanish:
es-ES_Multimedia
. The new model supports low latency and is generally available. It also supports language model customization (generally available) and grammars (beta).For more information about all available next-generation models and their customization support, see
11 February 2022
- Defect fix: Correct custom model upgrade and base model version documentation
-
Defect fix: The documentation that describes the upgrade of custom models and the version strings that are used for different versions of base models has been updated. The documentation now states that upgrade for language model customization also applies to next-generation models. Also, the version strings that represent different versions of base models have been updated. And the
base_model_version
parameter can also be used with upgraded next-generation models.For more information about custom model upgrade, when upgrade is necessary, and how to use older versions of custom models, see
- Defect fix: Update capitalization documentation
-
Defect fix: The documentation that describes the service's automatic capitalization of transcripts has been updated. The service capitalizes appropriate nouns only for the following languages and models:
- All previous-generation US English models
- The next-generation German model
For more information, see Capitalization.
2 February 2022
- New beta
en-WW_Medical_Telephony
model is now available -
A new beta next-generation
en-WW_Medical_Telephony
is now available. The new model understands terms from the medical and pharmacological domains. Use the model in situations where you need to transcribe common medical terminology such as medicine names, product brands, medical procedures, illnesses, types of doctor, or COVID-19-related terminology. Common use cases include conversations between a patient and a medical provider (for example, a doctor, nurse, or pharmacist).The new model is available for all supported English dialects: Australian, Indian, UK, and US. The new model supports language model customization and grammars as beta functionality. It supports most of the same parameters as the
en-US_Telephony
model, includingsmart_formatting
for US English audio. It does not support the following parameters:low_latency
,profanity_filter
,redaction
, andspeaker_labels
.For more information, see The English medical telephony model.
- Update to Chinese
zh-CN_Telephony
model -
The next-generation Chinese model
zh-CN_Telephony
has been updated for improved speech recognition. The model continues to support low latency. By default, the service automatically uses the updated model for all speech recognition requests. For more information about all available next-generation models, see Next-generation languages and models.If you have custom language models that are based on the updated model, you must upgrade your existing custom models to take advantage of the updates by using the
POST /v1/customizations/{customization_id}/upgrade_model
method. For more information, see Upgrading custom models. - Update to Japanese
ja-JP_Multimedia
next-generation model to support low latency -
The next-generation Japanese model
ja-JP_Multimedia
now supports low latency. You can use thelow_latency
parameter with speech recognition requests that use the model. You do not need to upgrade custom models that are based on the updated Japanese base model. For more information about the next-generation models and low latency, see
3 December 2021
- New Latin American Spanish next-generation telephony model
-
The service now offers a next-generation telephony model for Latin American Spanish:
es-LA_Telephony
. The new model supports low latency and is generally available.The
es-LA_Telephony
model applies to all Latin American dialects. It is the equivalent of the previous-generation models that are available for the Argentinian, Chilean, Colombian, Mexican, and Peruvian dialects. If you used a previous-generation model for any of these specific dialects, use thees-LA_Telephony
model to migrate to the equivalent next-generation model.For more information about all available next-generation models, see Next-generation languages and models.
- Important: Custom language models based on certain next-generation models must be re-created
-
Important: If you created custom language models based on certain next-generation models, you must re-create the custom models. Until you re-create the custom language models, speech recognition requests that attempt to use the custom models fail with HTTP error code 400.
You need to re-create custom language models that you created based on the following versions of next-generation models:
- For the
en-AU_Telephony
model, custom models that you created fromen-AU_Telephony.v2021-03-03
toen-AU_Telephony.v2021-10-04
. - For the
en-GB_Telephony
model, custom models that you created fromen-GB_Telephony.v2021-03-03
toen-GB_Telephony.v2021-10-04
. - For the
en-US_Telephony
model, custom models that you created fromen-US_Telephony.v2021-06-17
toen-US_Telephony.v2021-10-04
. - For the
en-US_Multimedia
model, custom models that you created fromen-US_Multimedia.v2021-03-03
toen-US_Multimedia.v2021-10-04
.
To identify the version of a model on which a custom language model is based, use the
GET /v1/customizations
method to list all of your custom language models or theGET /v1/customizations/{customization_id}
method to list a specific custom language model. Theversions
field of the output shows the base model for a custom language model. For more information, see Listing custom language models.To re-create a custom language model, first create a new custom model. Then add all of the previous custom model's corpora and custom words to the new model. You can then delete the previous custom model. For more information, see Creating a custom language model.
- For the
28 October 2021
- New Chinese next-generation telephony model
-
The service now offers a next-generation telephony model for Mandarin Chinese:
zh-CN_Telephony
. The new model supports low latency and is generally available. For more information about all available next-generation models, see Next-generation languages and models. - New Australian English and UK English next-generation multimedia models
-
The service now offers the following next-generation multimedia models. The new models are generally available, and neither model supports low latency.
- Australian English:
en-AU_Multimedia
- UK English:
en-GB_Multimedia
For more information about all available next-generation models, see Next-generation languages and models.
- Australian English:
- Updates to multiple next-generation models for improved speech recognition
-
The following next-generation models have been updated for improved speech recognition:
- Australian English telephony model (
en-AU_Telephony
) - UK English telephony model (
en-GB_Telephony
) - US English multimedia model (
en-US_Multimedia
) - US English telephony model (
en-US_Telephony
) - Castilian Spanish telephony model (
es-ES_Telephony
)
For more information about all available next-generation models, see Next-generation languages and models.
- Australian English telephony model (
- Grammar support for previous-generation models is now generally available
-
Grammar support is now generally available (GA) for previous-general models that meet the following conditions:
- The models are generally available.
- The models support language model customization.
For more information, see the following topics:
- For more information about the status of grammar support for previous-generation models, see Customization support for previous-generation models.
- For more information about grammars, see Grammars.
- New beta grammar support for next-generation models
-
Grammar support is now available as beta functionality for all next-generation models. All next-generation models are generally available (GA) and support language model customization. For more information, see the following topics:
- For more information about the status of grammar support for next-generation models, see Customization support for next-generation models.
- For more information about grammars, see Grammars.
Note: Beta support for grammars by next-generation models is available for the Speech to Text service on IBM Cloud only. Grammars are not yet supported for next-generation models on IBM Cloud Pak for Data.
- New
custom_acoustic_model
field for supported features -
The
GET /v1/models
andGET /v1/models/{model_id}
methods now report whether a model supports acoustic model customization. TheSupportedFeatures
object now includes an additional field,custom_acoustic_model
, a boolean that istrue
for a model that supports acoustic model customization andfalse
otherwise. Currently, the field istrue
for all previous-generation models andfalse
for all next-generation models.- For more information about these methods, see Listing information about models.
- For more information about support for acoustic model customization, see Language support for customization.
22 October 2021
- Defect fix: Address asynchronous HTTP failures
- Defect fix: The asynchronous HTTP interface failed to transcribe some audio. In addition, the callback for the request returned status
recognitions.completed_with_results
instead ofrecognitions.failed
. This error has been resolved.
6 October 2021
- Updates to Czech and Dutch next-generation models
-
The following next-generation language models have changed as indicated:
- The Czech telephony model,
cs-CZ_Telephony
, is now generally available (GA). The model continues to support low latency. - The Belgian Dutch telephony model,
nl-BE_Telephony
, has been updated for improved speech recognition. The model continues to support low latency. - The Netherlands Dutch telephony model,
nl-NL_Telephony
, is now GA. In addition, the model now supports low latency.
For more information about all available next-generation language models, see Next-generation languages and models.
- The Czech telephony model,
- New US HIPAA support for Premium plans in Dallas location
-
US Health Insurance Portability and Accountability Act (HIPAA) support is now available for Premium plans that are hosted in the Dallas (
us-south
) location. For more information, see Health Insurance Portability and Accountability Act (HIPAA).
16 September 2021
- New beta Czech and Netherlands Dutch next-generation models
-
The service now supports the following new next-generation language models. Both new models are beta functionality.
- Czech:
cs-CZ_Telephony
. The new model supports low latency. - Netherlands Dutch:
nl-NL_Telephony
. The new model does not support low latency.
For more information about all available next-generation language models, see Next-generation languages and models.
- Czech:
- Updates to Korean and Brazilian Portuguese next-generation models
-
The following next-generation models have been updated:
- The Korean model
ko-KR_Telephony
now supports low latency. - The Brazilian Portuguese model
pt-BR_Telephony
has been updated for improved speech recognition.
- The Korean model
- Defect fix: Correct interim results and low-latency documentation
-
Defect fix: Documentation that describes the interim results and low-latency features with next-generation models has been rewritten for clarity and correctness. For more information, see the following topics:
- Defect fix: Improve speakers labels results
-
Defect fix: When you use speakers labels with next-generation models, the service now identifies the speaker for all words of the input audio, including very short words that have the same start and end timestamps.
31 August 2021
- All next-generation models are now generally available
-
All existing next-generation language models are now generally available (GA). They are supported for use in production environments and applications.
- For more information about all available next-generation language models, see Next-generation languages and models.
- For more information about the features that are supported for each next-generation model, see Supported features for next-generation models.
- Language model customization for next-generation models is now generally available
-
Language model customization is now generally available (GA) for all available next-generation languages and models. Language model customization for next-generation models is supported for use in production environments and applications.
You use the same commands to create, manage, and use custom language models, corpora, and custom words for next-generation models as you do for previous-generation models. But customization for next-generation models works differently from customization for previous-generation models. For custom models that are based on next-generation models:
- The custom models have no concept of out-of-vocabulary (OOV) words.
- Words from corpora are not added to the words resource.
- You cannot currently use the sounds-like feature for custom words.
- You do not need to upgrade custom models when base language models are updated.
- Grammars are not currently supported.
For more information about using language model customization for next-generation models, see
- Understanding customization
- Language support for customization
- Creating a custom language model
- Using a custom language model for speech recognition
- Working with corpora and custom words for next-generation models
Additional topics describe managing custom language models, corpora, and custom words. These operations are the same for custom models based on previous- and next-generation models.
16 August 2021
- New beta Indian English, Indian Hindi, Japanese, and Korean next-generation models
-
The service now supports the following new next-generation language models. All of the new models are beta functionality.
- Indian English:
en-IN_Telephony
. The model supports low latency. - Indian Hindi:
hi-IN_Telephony
. The model supports low latency. - Japanese:
ja-JP_Multimedia
. The model does not support low latency. - Korean:
ko-KR_Multimedia
andko-KR_Telephony
. The models do not support low latency.
For more information about the next-generation models and low latency, see Next-generation languages and models and Low latency.
- Indian English:
16 July 2021
- New beta French next-generation model
- The French next-generation language model
fr-FR_Multimedia
is now available. The new model does not support low latency. The model is beta functionality. - Updates to beta US English next-generation model for improved speech recognition
- The next-generation US English
en-US_Telephony
model has been updated for improved speech recognition. The updated model continues to be beta functionality. - Defect fix: Update documentation for hesitation markers
- Defect fix: The documentation failed to state that next-generation models do not produce hesitation markers. The documentation has been updated to note that only previous-generation models produce hesitation markers. Next-generation models include the actual hesitations in transcription results. For more information, see Speech hesitations and hesitation markers.
15 June 2021
- New beta Belgian Dutch next-generation model
-
The Belgian Dutch (Flemish) next-generation language model
nl-BE_Telephony
is now available. The new model supports low latency. The model is beta functionality. For more information about the next-generation models and about low latency, see Next-generation languages and models and Low latency. - New beta low-latency support for Arabic, Canadian French, and Italian next-generation models
-
The following existing beta next-generation language models now support low latency:
- Arabic
ar-MS_Telephony
model - Canadian French
fr-CA_Telephony
model - Italian
it-IT_Telephony
model
For more information about the next-generation models and about low latency, see Next-generation languages and models and Low latency.
- Arabic
- Updates to beta Arabic and Brazilian Portuguese next-generation models for improved speech recognition
-
The following existing beta next-generation language models have been updated for improved speech recognition:
- Arabic
ar-MS_Telephony
model - Brazilian Portuguese
pt-BR_Telephony
model
For more information about the next-generation models and about low latency, see Next-generation languages and models and Low latency.
- Arabic
26 May 2021
- New beta support for
audio_metrics
parameter for next-generation models - The
audio_metrics
parameter is now supported as beta functionality for use with all next-generation languages and models. For more information, see Audio metrics. - New beta support for
word_confidence
parameter for next-generation models - The
word_confidence
parameter is now supported as beta functionality for use with all next-generation languages and models. For more information, see Word confidence. - Defect fix: Update documentation for next-generation models
- Defect fix: The documentation has been updated to correct the following information:
- When you use a next-generation model for speech recognition, final transcription results now include the
confidence
field. The field was always included in final transcription results when you use a previous-generation model. This fix addresses a limitation that was reported for the 12 April 2021 release of the next-generation models. - The documentation incorrectly stated that using the
smart_formatting
parameter causes the service to remove hesitation markers from final transcription results for Japanese. Smart formatting does not remove hesitation markers from final results for Japanese, only for US English. For more information, see What results does smart formatting affect?
- When you use a next-generation model for speech recognition, final transcription results now include the
27 April 2021
- New beta Arabic and Brazilian Portuguese next-generation models
-
The service supports two new beta next-generation models:
- The Brazilian Portuguese
pt-BR_Telephony
model, which supports low latency. - The Arabic (Modern Standard )
ar-MS_Telephony
model, which does not support low latency.
For more information, see Next-generation languages and models.
- The Brazilian Portuguese
- Updates to beta Castilian Spanish next-generation model for improved speech recognition
-
The beta next-generation Castilian Spanish
es-ES_Telephony
model now supports thelow_latency
parameter. For more information, see Low latency. - New beta support for speaker labels with next-generation models
-
The
speaker_labels
parameter is now supported as beta functionality for use with the following next-generation models:- Australian English
en-AU_Telephony
model - UK English
en-GB_Telephony
model - US English
en-US_Multimedia
anden-US_Telephony
models - German
de-DE_Telephony
model - Castilian Spanish
es-ES_Telephony
model
With the next generation models, the
speaker_labels
parameter is not supported for use with theinterim_results
orlow_latency
parameters at this time. For more information, see Speaker labels. - Australian English
- New HTTP error code for use of
word_confidence
with next-generation models -
The
word_confidence
parameter is not supported for use with next-generation models. The service now returns the following 400 error code if you use theword_confidence
parameter with a next-generation model for speech recognition:{ "error": "word_confidence is not a supported feature for model {model}", "code": 400, "code_description": "Bad Request" }
12 April 2021
- New beta next-generation language models and
low_latency
parameter -
The service now supports a growing number of next-generation language models. The next-generation multimedia and telephony models improve upon the speech recognition capabilities of the service's previous generation of broadband and narrowband models. The new models leverage deep neural networks and bidirectional analysis to achieve both higher throughput and greater transcription accuracy. At this time, the next-generation models support only a limited number of languages and speech recognition features. The supported languages, models, and features will increase with future releases. The next-generation models are beta functionality.
Many of the next-generation models also support a new
low_latency
parameter that lets you request faster results at the possible expense of reduced transcription quality. When low latency is enabled, the service curtails its analysis of the audio, which can reduce the accuracy of the transcription. This trade-off might be acceptable if your application requires lower response time more than it does the highest possible accuracy.Thelow_latency
parameter is beta functionality.The
low_latency
parameter impacts your use of theinterim_results
parameter with the WebSocket interface. Interim results are available only for those next-generation models that support low latency, and only if both theinterim_results
andlow_latency
parameters are set totrue
.- For more information about the next-generation models and their capabilities, see Next-generation languages and models.
- For more information about language support for next-generation models and about which next-generation models support low latency, see Supported next-generation language models.
- For more information about feature support for next-generation models, see Supported features for next-generation models.
- For more information about the
low_latency
parameter, see Low latency. - For more information about the interaction between the
low_latency
andinterim_results
parameters for next-generation models, see Requesting interim results and low latency.
17 March 2021
- Defect fix: Fix limitation for asynchronous HTTP interface
- Defect fix: The limitation that was reported with the asynchronous HTTP interface in the Dallas (
us-south
) location on 16 December 2020 has been addressed. Previously, a small percentage of jobs were entering infinite loops that prevented their execution. Asynchronous HTTP requests in the Dallas data center no longer experience this limitation.
2 December 2020
- Arabic model renamed to
ar-MS_BroadbandModel
- The Arabic language broadband model is now named
ar-MS_BroadbandModel
. The former name,ar-AR_BroadbandModel
, is deprecated. It will continue to function for at least one year but might be removed at a future date. You are encouraged to migrate to the new name at your earliest convenience.
2 November 2020
- Canadian French models now generally available
-
The Canadian French models,
fr-CA_BroadbandModel
andfr-CA_NarrowbandModel
, are now generally available (GA). They were previously beta. They also now support language model and acoustic model customization.- For more information about supported languages and models, see Previous-generation languages and models.
- For more information about language support for customization, see Language support for customization.
22 October 2020
- Australian English models now generally available
-
The Australian English models,
en-AU_BroadbandModel
anden-AU_NarrowbandModel
, are now generally available (GA). They were previously beta. They also now support language model and acoustic model customization.- For more information about supported languages and models, see Previous-generation languages and models.
- For more information about language support for customization, see Language support for customization.
- Updates to Brazilian Portuguese models for improved speech recognition
-
The Brazilian Portuguese models,
pt-BR_BroadbandModel
andpt-BR_NarrowbandModel
, have been updated for improved speech recognition. By default, the service automatically uses the updated models for all speech recognition requests. If you have custom language or custom acoustic models that are based on the models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- The
split_transcript_at_phrase_end
parameter now generally available for all languages -
The speech recognition parameter
split_transcript_at_phrase_end
is now generally available (GA) for all languages. Previously, it was generally available only for US and UK English. For more information, see Split transcript at phrase end.
7 October 2020
- Updates to Japanese broadband model for improved speech recognition
-
The
ja-JP_BroadbandModel
model has been updated for improved speech recognition. By default, the service automatically uses the updated model for all speech recognition requests. If you have custom language or custom acoustic models that are based on this model, you must upgrade your existing custom models to take advantage of the updates by using the following methods:POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
30 September 2020
- Updates to pricing plans for the service
-
The pricing plans for the service have changed:
- The service continues to offer a Lite plan that provides basic no-charge access to limited minutes of speech recognition per month.
- The service offers a new Plus plan that provides a simple tiered pricing model and access to the service's customization capabilities.
- The service offers a new Premium plan that provides significantly greater capacity and enhanced features.
The Plus plan replaces the Standard plan. The Standard plan continues to be available for purchase for a short time. It also continues to be available indefinitely to existing users of the plan with no change in their pricing. Existing users can upgrade to the Plus plan at any time.
For more information about the available pricing plans, see the following resources:
- For general information about the pricing plans and answers to common questions, see the Pricing FAQs.
- For more information about the pricing plans or to purchase a plan, see the Speech to Text service in the IBM Cloud® Catalog.
20 August 2020
- New Canadian French models
-
The service now offers beta broadband and narrowband models for Canadian French:
fr-CA_BroadbandModel
fr-CA_NarrowbandModel
The new models do not support language model or acoustic model customization, speaker labels, or smart formatting. For more information about these and all supported models, see Supported previous-generation language models.
5 August 2020
- New Australian English models
-
The service now offers beta broadband and narrowband models for Australian English:
en-AU_BroadbandModel
en-AU_NarrowbandModel
The new models do not support language model or acoustic model customization, or smart formatting. The new models do support speakers labels. For more information, see
- Updates to multiple models for improved speech recognition
-
The following models have been updated for improved speech recognition:
- French broadband model (
fr-FR_BroadbandModel
) - German broadband (
de-DE_BroadbandModel
) and narrowband (de-DE_NarrowbandModel
) models - UK English broadband (
en-GB_BroadbandModel
) and narrowband (en-GB_NarrowbandModel
) models - US English short-form narrowband (
en-US_ShortForm_NarrowbandModel
) model
By default, the service automatically uses the updated models for all speech recognition requests. If you have custom language or custom acoustic models that are based on these models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:
POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- French broadband model (
- Hesitation marker for German changed
-
The hesitation marker that is used for the updated German broadband and narrowband models has changed from
[hesitation]
to%HESITATION
. For more information, see Speech hesitations and hesitation markers.
4 June 2020
- Defect fix: Improve latency for custom language models with many grammars
- Defect fix: The latency issue for custom language models that contain a large number of grammars has been resolved. When initially used for speech recognition, such custom models could take multiple seconds to load. The custom models now load much faster, greatly reducing latency when they are used for recognition.
28 April 2020
- Updates to Italian models for improved speech recognition
-
The Italian broadband (
it-IT_BroadbandModel
) and narrowband (it-IT_NarrowbandModel
) models have been updated for improved speech recognition. By default, the service automatically uses the updated models for all speech recognition requests. If you have custom language or custom acoustic models that are based on these models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- Dutch and Italian models now generally available
-
The Dutch and Italian language models are now generally available (GA) for speech recognition and for language model and acoustic model customization:
- Dutch broadband model (
nl-NL_BroadbandModel
) - Dutch narrowband model (
nl-NL_NarrowbandModel
) - Italian broadband model (
it-IT_BroadbandModel
) - Italian narrowband model (
it-IT_NarrowbandModel
)
For more information about all available language models, see
- Dutch broadband model (
1 April 2020
- Acoustic model customization now generally available
-
Acoustic model customization is now generally available (GA) for all supported languages. As with custom language models, IBM does not charge for creating or hosting a custom acoustic model. You are charged only for using a custom model with a speech recognition request.
Using a custom language model, a custom acoustic model, or both types of model for transcription incurs an add-on charge of $0.03 (USD) per minute. This charge is in addition to the standard usage charge of $0.02 (USD) per minute, and it applies to all languages supported by the customization interface. So the total charge for using one or more custom models for speech recognition is $0.05 (USD) per minute.
- For more information about support for individual language models, see Language support for customization.
- For more information about pricing, see the pricing page for the Speech to Text service or the Pricing FAQs.
16 March 2020
- Speaker labels now supported for German and Korean
- The service now supports speaker labels (the
speaker_labels
parameter) for German and Korean language models. Speaker labels identify which individuals spoke which words in a multi-participant exchange. For more information, see Speaker labels. - Activity Tracker now supported for asynchronous HTTP interface
- The service now supports the use of Activity Tracker events for all operations of the asynchronous HTTP interface. IBM Cloud Activity Tracker records user-initiated activities that change the state of a service in IBM Cloud®. For more information, see Activity Tracker events.
24 February 2020
- Updates to multiple models for improved speech recognition
-
The following models have been updated for improved speech recognition:
- Dutch broadband model (
nl-NL_BroadbandModel
) - Dutch narrowband model (
nl-NL_NarrowbandModel
) - Italian broadband model (
it-IT_BroadbandModel
) - Italian narrowband model (
it-IT_NarrowbandModel
) - Japanese narrowband model (
ja-JP_NarrowbandModel
) - US English broadband model (
en-US_BroadbandModel
)
By default, the service automatically uses the updated models for all speech recognition requests. If you have custom language or custom acoustic models that are based on the models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:
POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- Dutch broadband model (
- Language model customization now available for Dutch and Italian
-
Language model customization is now supported for Dutch and Italian with the new versions of the following models:
- Dutch broadband model (
nl-NL_BroadbandModel
) - Dutch narrowband model (
nl-NL_NarrowbandModel
) - Italian broadband model (
it-IT_BroadbandModel
) - Italian narrowband model (
it-IT_NarrowbandModel
)
For more information, see
- Parsing of Dutch, English, French, German, Italian, Portuguese, and Spanish
- Guidelines for Dutch, French, German, Italian, Portuguese, and Spanish
Because the Dutch and Italian models are beta, their support for language model customization is also beta.
- Dutch broadband model (
- Japanese narrowband model now includes some multigram word units
-
The Japanese narrowband model (
ja-JP_NarrowbandModel
) now includes some multigram word units for digits and decimal fractions. The service returns these multigram units regardless of whether you enable smart formatting. The smart formatting feature understands and returns the multigram units that the model generates. If you apply your own post-processing to transcription results, you need to handle these units appropriately. For more information, see Japanese in the smart formatting documentation. - New speech activity detection and background audio suppression parameters for speech recognition
-
The service now offers two new optional parameters for controlling the level of speech activity detection. The parameters can help ensure that only relevant audio is processed for speech recognition.
- The
speech_detector_sensitivity
parameter adjusts the sensitivity of speech activity detection. You can use the parameter to suppress word insertions from music, coughing, and other non-speech events. - The
background_audio_suppression
parameter suppresses background audio based on its volume to prevent it from being transcribed or otherwise interfering with speech recognition. You can use the parameter to suppress side conversations or background noise.
You can use the parameters individually or together. They are available for all interfaces and for most language models. For more information about the parameters, their allowable values, and their effect on the quality and latency of speech recognition, see Speech activity detection.
- The
- Activity Tracker now supported for customization interfaces
-
The service now supports the use of Activity Tracker events for all customization operations. IBM Cloud Activity Tracker records user-initiated activities that change the state of a service in IBM Cloud. You can use this service to investigate abnormal activity and critical actions and to comply with regulatory audit requirements. In addition, you can be alerted about actions as they happen. For more information, see Activity Tracker events.
- Defect fix: Correct generation of processing metrics with WebSocket interface
-
Defect fix: The WebSocket interface now works seamlessly when generating processing metrics. Previously, processing metrics could continue to be delivered after the client sent a
stop
message to the service.
18 December 2019
- New beta Italian models available
-
The service now offers beta broadband and narrowband models for the Italian language:
it-IT_BroadbandModel
it-IT_NarrowbandModel
These language models support acoustic model customization. They do not support language model customization. Because they are beta, these models might not be ready for production use and are subject to change. They are initial offerings that are expected to improve in quality with time and usage.
For more information, see the following sections:
- New
end_of_phrase_silence_time
parameter for speech recognition -
For speech recognition, the service now supports the
end_of_phrase_silence_time
parameter. The parameter specifies the duration of the pause interval at which the service splits a transcript into multiple final results. Each final result indicates a pause or extended silence that exceeds the pause interval. For most languages, the default pause interval is 0.8 seconds; for Chinese the default interval is 0.6 seconds.You can use the parameter to effect a trade-off between how often a final result is produced and the accuracy of the transcription. Increase the interval when accuracy is more important than latency. Decrease the interval when the speaker is expected to say short phrases or single words.
For more information, see End of phrase silence time.
- New
split_transcript_at_phrase_end
parameter for speech recognition -
For speech recognition, the service now supports the
split_transcript_at_phrase_end
parameter. The parameter directs the service to split the transcript into multiple final results based on semantic features of the input, such as at the conclusion of sentences. The service bases its understanding of semantic features on the base language model that you use with a request. Custom language models and grammars can also influence how and where the service splits a transcript.The parameter causes the service to add an
end_of_utterance
field to each final result to indicate the motivation for the split:full_stop
,silence
,end_of_data
, orreset
.For more information, see Split transcript at phrase end.
12 December 2019
- Full support for IBM Cloud IAM
-
The Speech to Text service now supports the full implementation of IBM Cloud Identity and Access Management (IAM). API keys for IBM Watson® services are no longer limited to a single service instance. You can create access policies and API keys that apply to more than one service, and you can grant access between services. For more information about IAM, see Authenticating to Watson services.
To support this change, the API service endpoints use a different domain and include the service instance ID. The pattern is
api.{location}.speech-to-text.watson.cloud.ibm.com/instances/{instance_id}
.-
Example HTTP URL for an instance hosted in the Dallas location:
https://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/6bbda3b3-d572-45e1-8c54-22d6ed9e52c2
-
Example WebSocket URL for an instance hosted in the Dallas location:
wss://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/6bbda3b3-d572-45e1-8c54-22d6ed9e52c2
For more information about the URLs, see the API & SDK reference.
These URLs do not constitute a breaking change. The new URLs work for both your existing service instances and for new instances. The original URLs continue to work on your existing service instances for at least one year, until December 2020.
-
- New network and data security features available
-
Support for the following new network and data security features is now available:
-
Support for private network endpoints
Users of Premium plans can create private network endpoints to connect to the Speech to Text service over a private network. Connections to private network endpoints do not require public internet access. For more information, see Public and private network endpoints.
-
Support for data encryption with customer-managed keys
Users of new Premium and Dedicated instances can integrate IBM® Key Protect for IBM Cloud® with the Speech to Text service to encrypt your data and manage encryption keys. For more information, see Protecting sensitive information in your Watson service.
-
10 December 2019
- New beta Netherlands Dutch models available
-
The service now offers beta broadband and narrowband models for the Netherlands Dutch language:
nl-NL_BroadbandModel
nl-NL_NarrowbandModel
These language models support acoustic model customization. They do not support language model customization. Because they are beta, these models might not be ready for production use and are subject to change. They are initial offerings that are expected to improve in quality with time and usage.
For more information, see the following sections:
25 November 2019
- Updates to speaker labels for improved identification of individual speakers
- Speaker labels are updated to improve the identification of individual speakers for further analysis of your audio sample. For more information about the speaker labels feature, see Speaker labels. For more information about the improvements to the feature, see IBM Research AI Advances Speaker Diarization in Real Use Cases.
12 November 2019
- New Seoul location now available
- The Speech to Text service is now available in the IBM Cloud Seoul location (kr-seo). As with other locations, the IBM Cloud location uses token-based IAM authentication. All new services instances that you create in this location use IAM authentication.
1 November 2019
- New limits on maximum number of custom models
- You can create no more than 1024 custom language models and no more than 1024 custom acoustic models per owning credentials. For more information, see Maximum number of custom models.
1 October 2019
- New US HIPAA support for Premium plans in Washington, DC, location
- US HIPAA support is available for Premium plans that are hosted in the Washington, DC (us-east), location and are created on or after 1 April 2019. For more information, see US Health Insurance Portability and Accountability Act (HIPAA).
22 August 2019
- Defect fix: Multiple small improvements
- The service was updated for small defect fixes and improvements.
30 July 2019
- New models for Spanish dialects now available
-
The service now offers broadband and narrowband language models in six Spanish dialects:
- Argentinian Spanish (
es-AR_BroadbandModel
andes-AR_NarrowbandModel
) - Castilian Spanish (
es-ES_BroadbandModel
andes-ES_NarrowbandModel
) - Chilean Spanish (
es-CL_BroadbandModel
andes-CL_NarrowbandModel
) - Colombian Spanish (
es-CO_BroadbandModel
andes-CO_NarrowbandModel
) - Mexican Spanish (
es-MX_BroadbandModel
andes-MX_NarrowbandModel
) - Peruvian Spanish (
es-PE_BroadbandModel
andes-PE_NarrowbandModel
)
The Castilian Spanish models are not new. They are generally available (GA) for speech recognition and language model customization, and beta for acoustic model customization.
The other five dialects are new and are beta for all uses. Because they are beta, these additional dialects might not be ready for production use and are subject to change. They are initial offerings that are expected to improve in quality with time and usage.
For more information, see the following sections:
- Argentinian Spanish (
24 June 2019
- Updates to Brazilian Portuguese and US English models for improved speech recognition
-
The following narrowband models have been updated for improved speech recognition:
- Brazilian Portuguese narrowband model (
pt-BR_NarrowbandModel
) - US English narrowband model (
en-US_NarrowbandModel
)
By default, the service automatically uses the updated models for all speech recognition requests. If you have custom language or custom acoustic models that are based on the models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:
POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- Brazilian Portuguese narrowband model (
- New support for concurrent requests to update different custom acoustic models
-
The service now allows you to submit multiple simultaneous requests to add different audio resources to a custom acoustic model. Previously, the service allowed only one request at a time to add audio to a custom model.
- New
updated
field for methods that list custom models -
The output of the HTTP
GET
methods that list information about custom language and custom acoustic models now includes anupdated
field. The field indicates the date and time in Coordinated Universal Time (UTC) at which the custom model was last modified. - Change to schema for warnings associated with custom model training
-
The schema changed for a warning that is generated by a custom model training request when the
strict
parameter is set tofalse
. The names of the fields changed fromwarning_id
anddescription
tocode
andmessage
, respectively. For more information, see the API & SDK reference.
10 June 2019
- Processing metrics not available with synchronous HTTP interface
- Processing metrics are available only with the WebSocket and asynchronous HTTP interfaces. They are not supported with the synchronous HTTP interface. For more information, see Processing metrics.
17 May 2019
- New processing metrics and audio metrics features for speech recognition
-
The service now offers two types of optional metrics with speech recognition requests:
- Processing metrics provide detailed timing information about the service's analysis of the input audio. The service returns the metrics at specified intervals and with transcription events, such as interim and final results. Use the metrics to gauge the service's progress in transcribing the audio.
- Audio metrics provide detailed information about the signal characteristics of the input audio. The results provide aggregated metrics for the entire input audio at the conclusion of speech processing. Use the metrics to determine the characteristics and quality of the audio.
You can request both types of metrics with any speech recognition request. By default, the service returns no metrics for a request.
- Updates to Japanese broadband model for improved speech recognition
-
The Japanese broadband model (
ja-JP_BroadbandModel
) has been updated for improved speech recognition. By default, the service automatically uses the updated model for all speech recognition requests. If you have custom language or custom acoustic models that are based on the model, you must upgrade your existing custom models to take advantage of the updates by using the following methods:POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
10 May 2019
- Updates to Spanish models for improved speech recognition
-
The Spanish language models have been updated for improved speech recognition:
es-ES_BroadbandModel
es-ES_NarrowbandModel
By default, the service automatically uses the updated models for all speech recognition requests. If you have custom language or custom acoustic models that are based on the models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:
POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
19 April 2019
- New
strict
parameter for custom model training now available - The training methods of the customization interface now include a
strict
query parameter that indicates whether training is to proceed if a custom model contains a mix of valid and invalid resources. By default, training fails if a custom model contains one or more invalid resources. Set the parameter tofalse
to allow training to proceed as long as the model contains at least one valid resource. The service excludes invalid resources from the training.- For more information about using the
strict
parameter with thePOST /v1/customizations/{customization_id}/train
method, see Train the custom language model and Training failures. - For more information about using the
strict
parameter with thePOST /v1/acoustic_customizations/{customization_id}/train
method, see Train the custom acoustic model and Training failures.
- For more information about using the
- New limits on maximum number of out-of-vocabulary words for custom language models
- You can now add a maximum of 90 thousand out-of-vocabulary (OOV) words to the words resource of a custom language model. The previous maximum was 30 thousand OOV words. This figure includes OOV words from all sources (corpora, grammars, and individual custom words that you add directly). You can add a maximum of 10 million total words to a custom model from all sources. For more information, see How much data do I need?.
3 April 2019
- New limits on maximum amount of audio for custom acoustic models
- Custom acoustic models now accept a maximum of 200 hours of audio. The previous maximum limit was 100 hours of audio.
21 March 2019
- Visibility of service credentials now restricted by role
-
Users can now see only service credential information that is associated with the role that has been assigned to their IBM Cloud account. For example, if you are assigned a
reader
role, anywriter
or higher levels of service credentials are no longer visible.This change does not affect API access for users or applications with existing service credentials. The change affects only the viewing of credentials within IBM Cloud.
15 March 2019
- New support for A-law audio format
- The service now supports audio in the A-law (
audio/alaw
) format. For more information, see audio/alaw format.
11 March 2019
- Change to passing value of
0
formax_alternatives
parameter - For the
max_alternatives
parameter, the service again accepts a value of0
. If you specify0
. the service automatically uses the default value,1
. A change made for the March 4 service update caused a value of0
to return an error. (The service returns an error if you specify a negative value.) - Change to passing value of
0
forword_alternatives_threshold
parameter - For the
word_alternatives_threshold
parameter, the service again accepts a value of0
. A change made for the March 4 service update caused a value of0
to return an error. (The service returns an error if you specify a negative value.) - New limit on maximum precision for confidence scores
- The service now returns all confidence scores with a maximum precision of two decimal places. This change includes confidence scores for transcripts, word confidence, word alternatives, keyword results, and speaker labels.
4 March 2019
- Updates to Brazilian Portuguese, French, and Spanish narrowband models for improved speech recognition
-
The following narrowband language models have been updated for improved speech recognition:
- Brazilian Portuguese narrowband model (
pt-BR_NarrowbandModel
) - French French model (
fr-FR_NarrowbandModel
) - Spanish narrowband model (
es-ES_NarrowbandModel
)
By default, the service automatically uses the updated models for all speech recognition requests. If you have custom language or custom acoustic models that are based on the models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:
POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- Brazilian Portuguese narrowband model (
28 January 2019
- New support for IBM Cloud IAM by WebSocket interface
-
The WebSocket interface now supports token-based Identity and Access Management (IAM) authentication from browser-based JavaScript code. The limitation to the contrary has been removed. To establish an authenticated connection with the WebSocket
/v1/recognize
method:- If you use IAM authentication, include the
access_token
query parameter. - If you use Cloud Foundry service credentials, include the
watson-token
query parameter.
For more information, see Open a connection.
- If you use IAM authentication, include the
20 December 2018
- New beta grammars feature for custom language models now available
-
The service now supports grammars for speech recognition. Grammars are available as beta functionality for all languages that support language model customization. You can add grammars to a custom language model and use them to restrict the set of phrases that the service can recognize from audio. You can define a grammar in Augmented Backus-Naur Form (ABNF) or XML Form.
The following four methods are available for working with grammars:
POST /v1/customizations/{customization_id}/grammars/{grammar_name}
adds a grammar file to a custom language model.GET /v1/customizations/{customization_id}/grammars
lists information about all grammars for a custom model.GET /v1/customizations/{customization_id}/grammars/{grammar_name}
returns information about a specified grammar for a custom model.DELETE /v1/customizations/{customization_id}/grammars/{grammar_name}
removes an existing grammar from a custom model.
You can use a grammar for speech recognition with the WebSocket and HTTP interfaces. Use the
language_customization_id
andgrammar_name
parameters to identify the custom model and the grammar that you want to use. Currently, you can use only a single grammar with a speech recognition request.For more information about grammars, see the following documentation:
- Using grammars with custom language models
- Understanding grammars
- Adding a grammar to a custom language model
- Using a grammar for speech recognition
- Managing grammars
- Example grammars
For information about all methods of the interface, see the API & SDK reference.
- New numeric redaction feature for US English, Japanese, and Korean now available
-
A new numeric redaction feature is now available to mask numbers that have three or more consecutive digits. Redaction is intended to remove sensitive personal information, such as credit card numbers, from transcripts. You enable the feature by setting the
redaction
parameter totrue
on a recognition request. The feature is beta functionality that is available for US English, Japanese, and Korean only. For more information, see Numeric redaction. - New French and German narrowband models now available
-
The following new German and French language models are now available with the service:
- French narrowband model (
fr-FR_NarrowbandModel
) - German narrowband model (
de-DE_NarrowbandModel
)
Both new models support language model customization (GA) and acoustic model customization (beta). For more information, see Language support for customization.
- French narrowband model (
- New US English
en-US_ShortForm_NarrowbandModel
now available -
A new US English language model,
en-US_ShortForm_NarrowbandModel
, is now available. The new model is intended for use in Interactive Voice Response and Automated Customer Support solutions. The model supports language model customization (GA) and acoustic model customization (beta). For more information, see The US English short-form model. - Updates to UK English and Spanish narrowband models for improved speech recognition
-
The following language models have been updated for improved speech recognition:
- UK English narrowband model (
en-GB_NarrowbandModel
) - Spanish narrowband model (
es-ES_NarrowbandModel
)
By default, the service automatically uses the updated models for all speech recognition requests. If you have custom language or custom acoustic models that are based on the models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:
POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- UK English narrowband model (
- New support for G.279 audio format
-
The service now supports audio in the G.729 (
audio/g729
) format. The service supports only G.729 Annex D for narrowband audio. For more information, see audio/g729 format. - Speakers labels feature now available for UK English narrowband model
-
The speaker labels feature is now available for the narrowband model for UK English (
en-GB_NarrowbandModel
). The feature is beta functionality for all supported languages. For more information, see Speaker labels. - New limits on maximum amount of audio for custom acoustic models
-
The maximum amount of audio that you can add to a custom acoustic model has increased from 50 hours to 100 hours.
13 December 2018
- New London location now available
- The Speech to Text service is now available in the IBM Cloud London location (eu-gb). Like all locations, London uses token-based IAM authentication. All new services instances that you create in this location use IAM authentication.
12 November 2018
- New support for smart formatting for Japanese speech recognition
- The service now supports smart formatting for Japanese speech recognition. Previously, the service supported smart formatting for US English and Spanish only. The feature is beta functionality for all supported languages. For more information, see Smart formatting.
7 November 2018
- New Tokyo location now available
- The Speech to Text service is now available in the IBM Cloud Tokyo location (jp-tok). Like all locations, Tokyo uses token-based IAM authentication. All new services instances that you create in this location use IAM authentication.
30 October 2018
- New support for token-based IBM Cloud IAM
-
The Speech to Text service has migrated to token-based IAM authentication for all locations. All IBM Cloud services now use IAM authentication. The Speech to Text service migrated in each location on the following dates:
- Dallas (us-south): October 30, 2018
- Frankfurt (eu-de): October 30, 2018
- Washington, DC (us-east): June 12, 2018
- Sydney (au-syd): May 15, 2018
The migration to IAM authentication affects new and existing service instances differently:
- All new service instances that you create in any location now use IAM authentication to access the service. You can pass either a bearer token or an API key: Tokens support authenticated requests without embedding service credentials in every call; API keys use HTTP basic authentication. When you use any of the Watson SDKs, you can pass the API key and let the SDK manage the lifecycle of the tokens.
- Existing service instances that you created in a location before the indicated migration date continue to use the
{username}
and{password}
from their previous Cloud Foundry service credentials for authentication until you migrate them to use IAM authentication.
For more information, see the following documentation:
- To learn which authentication mechanism your service instance uses, view your service credentials by clicking the instance on the IBM Cloud dashboard.
- For more information about using IAM tokens with Watson services, see Authenticating to Watson services.
- For examples that use IAM authentication, see the API & SDK reference.
9 October 2018
- Important updates to pricing charges for speech recognition requests
-
As of 1 October 2018, you are now charged for all audio that you pass to the service for speech recognition. The first one thousand minutes of audio that you send each month are no longer free. For more information about the pricing plans for the service, see the Speech to Text service in the IBM Cloud Catalog.
- The
Content-Type
header now optional for most speech recognition requests -
The
Content-Type
header is now optional for most speech recognition requests. The service now automatically detects the audio format (MIME type) of most audio. You must continue to specify the content type for the following formats:audio/basic
audio/l16
audio/mulaw
Where indicated, the content type that you specify for these formats must include the sampling rate and can optionally include the number of channels and the endianness of the audio. For all other audio formats, you can omit the content type or specify a content type of
application/octet-stream
to have the service auto-detect the format.When you use the
curl
command to make a speech recognition request with the HTTP interface, you must specify the audio format with theContent-Type
header, specify"Content-Type: application/octet-stream"
, or specify"Content-Type:"
. If you omit the header entirely,curl
uses a default value ofapplication/x-www-form-urlencoded
. Most of the examples in this documentation continue to specify the format for speech recognition requests regardless of whether it's required.This change applies to the following methods:
/v1/recognize
for WebSocket requests. Thecontent-type
field of the text message that you send to initiate a request over an open WebSocket connection is now optional.POST /v1/recognize
for synchronous HTTP requests. TheContent-Type
header is now optional. (For multipart requests, thepart_content_type
field of the JSON metadata is also now optional.)POST /v1/recognitions
for asynchronous HTTP requests. TheContent-Type
header is now optional.
For more information, see Audio formats.
- Updates to Brazilian Portuguese broadband model for improved speech recognition
-
The Brazilian Portuguese broadband model,
pt-BR_BroadbandModel
, was updated for improved speech recognition. By default, the service automatically uses the updated model for all recognition requests. If you have custom language or custom acoustic models that are based on this model, you must upgrade your existing custom models to take advantage of the updates by using the following methods:POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- The
customization_id
parameter renamed tolanguage_customization_id
-
The
customization_id
parameter of the speech recognition methods is deprecated and will be removed in a future release. To specify a custom language model for a speech recognition request, use thelanguage_customization_id
parameter instead. This change applies to the following methods:/v1/recognize
for WebSocket requestsPOST /v1/recognize
for synchronous HTTP requests (including multipart requests)POST /v1/recognitions
for asynchronous HTTP requests
10 September 2018
- New German broadband model
-
The service now supports a German broadband model,
de-DE_BroadbandModel
. The new German model supports language model customization (generally available) and acoustic model customization (beta).- For information about how the service parses corpora for German, see Parsing of Dutch, English, French, German, Italian, Portuguese, and Spanish.
- For more information about creating sounds-like pronunciations for custom words in German, see Guidelines for Dutch, French, German, Italian, Portuguese, and Spanish.
- Language model customization now available for Brazilian Portuguese
-
The existing Brazilian Portuguese models,
pt-BR_BroadbandModel
andpt-BR_NarrowbandModel
, now support language model customization (generally available). The models were not updated to enable this support, so no upgrade of existing custom acoustic models is required.- For information about how the service parses corpora for Brazilian Portuguese, see Parsing of Dutch, English, French, German, Italian, Portuguese, and Spanish.
- For more information about creating sounds-like pronunciations for custom words in Brazilian Portuguese, see Guidelines for Dutch, French, German, Italian, Portuguese, and Spanish.
- Updates to US English and Japanese models for improved speech recognition
-
New versions of the US English and Japanese broadband and narrowband models are available:
- US English broadband model (
en-US_BroadbandModel
) - US English narrowband model (
en-US_NarrowbandModel
) - Japanese broadband model (
ja-JP_BroadbandModel
) - Japanese narrowband model (
ja-JP_NarrowbandModel
)
The new models offer improved speech recognition. By default, the service automatically uses the updated models for all recognition requests. If you have custom language or custom acoustic models that are based on these models, you must upgrade your existing custom models to take advantage of the updates by using the following methods:
POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- US English broadband model (
- Keyword spotting and word alternatives features now generally available
-
The keyword spotting and word alternatives features are now generally available (GA) rather than beta functionality for all languages. For more information, see
- Defect fix: Improve documentation for customization interface
-
Defect fix: The following known issues that were associated with the customization interface have been resolved and are fixed in production. The following information is preserved for users who may have encountered the problems in the past.
-
If you add data to a custom language or custom acoustic model, you must retrain the model before using it for speech recognition. The problem shows up in the following scenario:
-
The user creates a new custom model (language or acoustic) and trains the model.
-
The user adds additional resources (words, corpora, or audio) to the custom model but does not retrain the model.
-
The user cannot use the custom model for speech recognition. The service returns an error of the following form when used with a speech recognition request:
{ "code_description": "Bad Request", "code": 400, "error": "Requested custom language model is not available. Please make sure the custom model is trained." }
To work around this issue, the user must retrain the custom model on its latest data. The user can then use the custom model with speech recognition.
-
-
Before training an existing custom language or custom acoustic model, you must upgrade it to the latest version of its base model. The problem shows up in the following scenario:
- The user has an existing custom model (language or acoustic) that is based on a model that has been updated.
- The user trains the existing custom model against the old version of the base model without first upgrading to the latest version of the base model.
- The user cannot use the custom model for speech recognition.
To work around this issue, the user must use the
POST /v1/customizations/{customization_id}/upgrade_model
orPOST /v1/acoustic_customizations/{customization_id}/upgrade_model
method to upgrade the custom model to the latest version of its base model. The user can then use the custom model with speech recognition.
-
7 September 2018
- Session-based interface no longer available
-
The session-based HTTP REST interface is no longer supported. All information related to sessions is removed from the documentation. The following methods are no longer available:
POST /v1/sessions
POST /v1/sessions/{session_id}/recognize
GET /v1/sessions/{session_id}/recognize
GET /v1/sessions/{session_id}/observe_result
DELETE /v1/sessions/{session_id}
If your application uses the sessions interface, you must migrate to one of the remaining HTTP REST interfaces or to the WebSocket interface. For more information, see the service update for 8 August 2018.
8 August 2018
- Deprecation notice for session-based speech recognition interface
-
The session-based HTTP REST interface is deprecated as of August 8, 2018. All methods of the sessions API will be removed from service as of September 7, 2018, after which you will no longer be able to use the session-based interface. This notice of immediate deprecation and 30-day removal applies to the following methods:
POST /v1/sessions
POST /v1/sessions/{session_id}/recognize
GET /v1/sessions/{session_id}/recognize
GET /v1/sessions/{session_id}/observe_result
DELETE /v1/sessions/{session_id}
If your application uses the sessions interface, you must migrate to one of the following interfaces by September 7:
- For stream-based speech recognition (including live-use cases), use the WebSocket interface, which provides access to interim results and the lowest latency.
- For file-based speech recognition, use one of the following interfaces:
- For shorter files of up to a few minutes of audio, use either the synchronous HTTP interface
(POST /v1/recognize
) or the asynchronous HTTP interface (POST /v1/recognitions
). - For longer files of more than a few minutes of audio, use the asynchronous HTTP interface. The asynchronous HTTP interface accepts as much as 1 GB of audio data with a single request.
- For shorter files of up to a few minutes of audio, use either the synchronous HTTP interface
The WebSocket and HTTP interfaces provide the same results as the sessions interface (only the WebSocket interface provides interim results). You can also use one of the Watson SDKs, which simplify application development with any of the interfaces. For more information, see the API & SDK reference.
13 July 2018
- Updates to Spanish narrowband model for improved speech recognition
-
The Spanish narrowband model,
es-ES_NarrowbandModel
, was updated for improved speech recognition. By default, the service automatically uses the updated model for all recognition requests. If you have custom language or custom acoustic models that are based on this model, you must upgrade your custom models to take advantage of the updates by using the following methods:POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
As of this update, the following two versions of the Spanish narrowband model are available:
es_ES.8kHz.general.lm20180522235959.am20180522235959
(current)es_ES.8kHz.general.lm20180308235959.am20180308235959
(previous)
The following version of the model is no longer available:
es_ES.8kHz.general.lm20171031235959.am20171031235959
A recognition request that attempts to use a custom model that is based on the now unavailable base model uses the latest base model without any customization. The service returns the following warning message:
Using non-customized default base model, because your custom {type} model has been built with a version of the base model that is no longer supported.
To resume using a custom model that is based on the unavailable model, you must first upgrade the model by using the appropriateupgrade_model
method described previously.
12 June 2018
- New features for applications hosted in Washington, DC, location
-
The following features are enabled for applications that are hosted in Washington, DC (us-east):
- The service now supports a new API authentication process. For more information, see the 30 October 2018 service update.
- The service now supports the
X-Watson-Metadata
header and theDELETE /v1/user_data
method. For more information, see Information security.
15 May 2018
- New features for applications hosted in Sydney location
-
The following features are enabled for applications that are hosted in Sydney (au-syd):
- The service now supports a new API authentication process. For more information, see the 30 October 2018 service update.
- The service now supports the
X-Watson-Metadata
header and theDELETE /v1/user_data
method. For more information, see Information security.
26 March 2018
- Language model customization now available for French broadband model
-
The service now supports language model customization for the French broadband language model,
fr-FR_BroadbandModel
. The French model is generally available (GA) for production use with language model customization.- For more information about how the service parses corpora for French, see Parsing of Dutch, English, French, German, Italian, Portuguese, and Spanish.
- For more information about creating sounds-like pronunciations for custom words in French, see Guidelines for Dutch, French, German, Italian, Portuguese, and Spanish.
- Updates to French, Korean, and Spanish models for improved speech recognition
-
The following models were updated for improved speech recognition:
- Korean narrowband model (
ko-KR_NarrowbandModel
) - Spanish narrowband model (
es-ES_NarrowbandModel
) - French broadband model (
fr-FR_BroadbandModel
)
By default, the service automatically uses the updated models for all recognition requests. If you have custom language or custom acoustic models that are based on either of these models, you must upgrade your custom models to take advantage of the updates by using the following methods:
POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models.
- Korean narrowband model (
- The
version
parameter renamed tobase_model_version
-
The
version
parameter of the following methods is now namedbase_model_version
:/v1/recognize
for WebSocket requestsPOST /v1/recognize
for sessionless HTTP requestsPOST /v1/sessions
for session-based HTTP requestsPOST /v1/recognitions
for asynchronous HTTP requests
The
base_model_version
parameter specifies the version of a base model that is to be used for speech recognition. For more information, see Using upgraded custom models for speech recognition and Making speech recognition requests with upgraded custom models. - New support for smart formatting for Spanish speech recognition
-
Smart formatting is now supported for Spanish as well as US English. For US English, the feature also now converts keyword strings into punctuation symbols for periods, commas, question marks, and exclamation points. For more information, see Smart formatting.
1 March 2018
- Updates to French and Spanish broadband models for improved speech recognition
-
The French and Spanish broadband models,
fr-FR_BroadbandModel
andes-ES_BroadbandModel
, have been updated for improved speech recognition. By default, the service automatically uses the updated models for all recognition requests. If you have custom language or custom acoustic models that are based on either of these models, you must upgrade your custom models to take advantage of the updates by using the following methods:POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information, see Upgrading custom models. The section presents rules for upgrading custom models, the effects of upgrading, and approaches for using upgraded models.
1 February 2018
- New Korean models
-
The service now offers language models for Korean:
ko-KR_BroadbandModel
for audio that is sampled at a minimum of 16 kHz, andko-KR_NarrowbandModel
for audio that is sampled at a minimum of 8 kHz. For more information, see Previous-generation languages and models.For language model customization, the Korean models are generally available (GA) for production use; for acoustic model customization, they are beta functionality. For more information, see Language support for customization.
- For more information about how the service parses corpora for Korean, see Parsing of Korean.
- For more information about creating sounds-like pronunciations for custom words in Korean, see Guidelines for Korean.
14 December 2017
- Language model customization now generally available
-
Language model customization and all associated parameters are now generally available (GA) for all supported languages: Japanese, Spanish, UK English, and US English.
- Beta acoustic model customization now available for all languages
-
The service now supports acoustic model customization as beta functionality for all available languages. You can create custom acoustic models for broadband or narrowband models for all languages. For an introduction to customization, including acoustic model customization, see Understanding customization.
- New
version
parameter for speech recognition -
The various methods for making recognition requests now include a new
version
parameter that you can use to initiate requests that use either the older or upgraded versions of base and custom models. Although it is intended primarily for use with custom models that have been upgraded, theversion
parameter can also be used without custom models. For more information, see Making speech recognition requests with upgraded custom models. - Updates to US English models for improved speech recognition
-
The US English models,
en-US_BroadbandModel
anden-US_NarrowbandModel
, have been updated for improved speech recognition. By default, the service automatically uses the updated models for all recognition requests. If you have custom language or custom acoustic models that are based on the US English models, you must upgrade your custom models to take advantage of the updates by using the following methods:POST /v1/customizations/{customization_id}/upgrade_model
POST /v1/acoustic_customizations/{customization_id}/upgrade_model
For more information about the procedure, see Upgrading custom models. The section presents rules for upgrading custom models, the effects of upgrading, and approaches for using upgraded models. Currently, the methods apply only to the new US English base models. But the same information will apply to upgrades of other base models as they become available.
- Language model customization now available for UK English
-
The service now supports language model customization for the UK English models,
en-GB_BroadbandModel
anden-GB_NarrowbandModel
. Although the service handles UK and US English corpora and custom words in a generally similar fashion, some important differences exist:- For more information about how the service parses corpora for UK English, see Parsing of Dutch, English, French, German, Italian, Portuguese, and Spanish.
- For more information about creating sounds-like pronunciations for custom words in UK English, see Guidelines for English. Specifically, for UK English, you cannot use periods or dashes in sounds-like pronunciations.
2 October 2017
- New beta acoustic model customization interface for US English, Japanese, and Spanish
-
The customization interface now offers acoustic model customization. You can create custom acoustic models that adapt the service's base models to your environment and speakers. You populate and train a custom acoustic model on audio that more closely matches the acoustic signature of the audio that you want to transcribe. You then use the custom acoustic model with recognition requests to increase the accuracy of speech recognition.
Custom acoustic models complement custom language models. You can train a custom acoustic model with a custom language model, and you can use both types of model during speech recognition. Acoustic model customization is a beta interface that is available only for US English, Japanese, and Spanish.
- For more information about the languages that are supported by the customization interface and the level of support that is available for each language, see Language support for customization.
- For more information about the service's customization interface, see Understanding customization.
- For more information about creating a custom acoustic model, see Creating a custom acoustic model.
- For more information about using a custom acoustic model, see Using a custom acoustic model for speech recognition.
- For more information about all methods of the customization interface, see the API & SDK reference.
- New beta
customization_weight
parameter for custom language models -
For language model customization, the service now includes a beta feature that sets an optional customization weight for a custom language model. A customization weight specifies the relative weight to be given to words from a custom language model versus words from the service's base vocabulary. You can set a customization weight during both training and speech recognition. For more information, see Using customization weight.
- Updates to Japanese broadband model for improved speech recognition
-
The
ja-JP_BroadbandModel
language model was upgraded to capture improvements in the base model. The upgrade does not affect existing custom models that are based on the model. - New
endianness
parameter foraudio/l16
audio format -
The service now includes a parameter to specify the endianness of audio that is submitted in
audio/l16
(Linear 16-bit Pulse-Code Modulation (PCM)) format. In addition to specifyingrate
andchannels
parameters with the format, you can now also specifybig-endian
orlittle-endian
with theendianness
parameter. For more information, see audio/l16 format.
14 July 2017
- New support for MP3 (MPEG) audio format
-
The service now supports the transcription of audio in the MP3 or Motion Picture Experts Group (MPEG) format. For more information, see audio/mp3 and audio/mpeg formats.
- Beta language model customization now available for Spanish
-
The language model customization interface now supports Spanish as beta functionality. You can create a custom model based on either of the base Spanish language models:
es-ES_BroadbandModel
ores-ES_NarrowbandModel
; for more information, see Creating a custom language model. Pricing for recognition requests that use Spanish custom language models is the same as for requests that use US English and Japanese models. - New
dialect
field for method that creates a custom language model -
The JSON
CreateLanguageModel
object that you pass to thePOST /v1/customizations
method to create a new custom language model now includes adialect
field. The field specifies the dialect of the language that is to be used with the custom model. By default, the dialect matches the language of the base model. The parameter is meaningful only for Spanish models, for which the service can create a custom model that is suited for speech in one of the following dialects:es-ES
for Castilian Spanish (the default)es-LA
for Latin-American Spanishes-US
for North-American (Mexican) Spanish
The
GET /v1/customizations
andGET /v1/customizations/{customization_id}
methods of the customization interface include the dialect of a custom model in their output. For more information, see Creating a custom language model and Listing custom language models. - New names for UK English models
-
The names of the language models
en-UK_BroadbandModel
anden-UK_NarrowbandModel
have been deprecated. The models are now available with the namesen-GB_BroadbandModel
anden-GB_NarrowbandModel
.The deprecated
en-UK_{model}
names continue to function, but theGET /v1/models
method no longer returns the names in the list of available models. You can still query the names directly with theGET /v1/models/{model_id}
method.
1 July 2017
- Language model custom now generally available for US English and Japanese
-
The language model customization interface of the service is now generally available (GA) for both of its supported languages, US English and Japanese. IBM does not charge for creating, hosting, or managing custom language models. As described in the next bullet, IBM now charges an extra $0.03 (USD) per minute of audio for recognition requests that use custom models.
- Updates to pricing plans for the service
-
IBM updated the pricing for the service by
- Eliminating the add-on price for using narrowband models
- Providing Graduated Tiered Pricing for high-volume customers
- Charging an additional $0.03 (USD) per minute of audio for recognition requests that use US English or Japanese custom language models
For more information about the pricing updates, see
- The Speech to Text service in the IBM Cloud Catalog
- The Pricing FAQs
- Empty body no longer required for HTTP POST requests
-
You no longer need to pass an empty data object as the body for the following
POST
requests:POST /v1/sessions
POST /v1/register_callback
POST /v1/customizations/{customization_id}/train
POST /v1/customizations/{customization_id}/reset
POST /v1/customizations/{customization_id}/upgrade_model
For example, you now call the
POST /v1/sessions
method withcurl
as follows:curl -X POST -u "{username}:{password}" \ --cookie-jar cookies.txt \ "{url}/v1/sessions"
You no longer need to pass the following
curl
option with the request:--data "{}"
. If you experience any problems with one of thesePOST
requests, try passing an empty data object with the body of the request. Passing an empty object does not change the nature or meaning of the request in any way.
22 May 2017
- The
continuous
parameter removed from all methods -
The
continuous
parameter is removed from all methods that initiate recognition requests. The service now transcribes an entire audio stream until it ends or times out, whichever occurs first. This behavior is equivalent to setting the formercontinuous
parameter totrue
. By default, the service previously stopped transcription at the first half-second of non-speech (typically silence) if the parameter was omitted or set tofalse
.Existing applications that set the parameter to
true
will see no change in behavior. Applications that set the parameter tofalse
or that relied on the default behavior will likely see a change. If a request specifies the parameter, the service now responds by returning a warning message for the unknown parameter:"warnings": [ "Unknown arguments: continuous." ]
The request succeeds despite the warning, and an existing session or WebSocket connection is not affected.
IBM removed the parameter to respond to overwhelming feedback from the developer community that specifying
continuous=false
added little value and could reduce overall transcription accuracy. - Sending audio required to avoid session timeout
-
It is no longer possible to avoid a session timeout without sending audio:
- When you use the WebSocket interface, the client can no longer keep a connection alive by sending a JSON text message with the
action
parameter set tono-op
. Sending ano-op
message does not generate an error, but it has no effect. - When you use sessions with the HTTP interface, the client can no longer extend the session by sending a
GET /v1/sessions/{session_id}/recognize
request. The method still returns the status of an active session, but it does not keep the session active.
You can now do the following to keep a session alive:
- Set the
inactivity_timeout
parameter to-1
to avoid the 30-second inactivity timeout. - Send any audio data, including just silence, to the service to avoid the 30-second session timeout. You are charged for the duration of any data that you send to the service, including the silence that you send to extend a session.
For more information, see Timeouts. Ideally, you would establish a session just before you obtain audio for transcription and maintain the session by sending audio at a rate that is close to real time. Also, make sure your application recovers gracefully from closed sessions or connections.
IBM removed this functionality to ensure that it continues to offer all users a best-in-class, low-latency speech recognition service.
- When you use the WebSocket interface, the client can no longer keep a connection alive by sending a JSON text message with the
10 April 2017
- Speaker labels now supported for US English, Spanish, and Japanese
-
The service now supports the speaker labels feature for the following broadband models:
- US English broadband model (
en-US-BroadbandModel
) - Spanish broadband model (
es-ES-BroadbandModel
) - Japanese broadband model (
ja-JP_BroadbandModel
)
For more information, see Speaker labels.
- US English broadband model (
- New support for Web Media (WebM) audio format
-
The service now supports the Web Media (WebM) audio format with the Opus or Vorbis codec. The service now also supports the Ogg audio format with the Vorbis codec in addition to the Opus codec. For more information about supported audio formats, see audio/webm format.
- New support for Cross-Origin Resource Sharing
-
The service now supports Cross-Origin Resource Sharing (CORS) to allow browser-based clients to call the service directly. For more information, see CORS support.
- New method to unregister a callback URL with asynchronous HTTP interface
-
The asynchronous HTTP interface now offers a
POST /v1/unregister_callback
method that removes the registration for an allowlisted callback URL. For more information, see Unregistering a callback URL. - Defect fix: Eliminate timeouts for long audio with WebSocket interface
-
Defect fix: The WebSocket interface no longer times out for recognition requests for especially long audio files. You no longer need to request interim results with the JSON
start
message to avoid the timeout. (This issue was described in the update for 10 March 2016.) - New HTTP error codes
-
The following language model customization methods can now return the following HTTP error codes:
- The
DELETE /v1/customizations/{customization_id}
method now returns HTTP response code 401 if you attempt to delete a nonexistent custom model. - The
DELETE /v1/customizations/{customization_id}/corpora/{corpus_name}
method now returns HTTP response code 400 if you attempt to delete a nonexistent corpus.
- The
8 March 2017
- Asynchronous HTTP interface is now generally available
- The asynchronous HTTP interface is now generally available (GA). Prior to this date, it was beta functionality.
1 December 2016
- New beta speaker labels feature
-
The service now offers a beta speaker labels feature for narrowband audio in US English, Spanish, or Japanese. The feature identifies which words were spoken by which speakers in a multi-person exchange. The sessionless, session-based, asynchronous, and WebSocket recognition methods each include a
speaker_labels
parameter that accepts a boolean value to indicate whether speaker labels are to be included in the response. For more information about the feature, see Speaker labels. - Beta language model customization now available for Japanese
-
The beta language model customization interface is now supported for Japanese in addition to US English. All methods of the interface support Japanese. For more information, see the following sections:
- For more information, see Creating a custom language model and Using a custom language model for speech recognition.
- For general and Japanese-specific considerations for adding a corpus text file, see Preparing a corpus text file and What happens when I add a corpus file?
- For Japanese-specific considerations when specifying the
sounds_like
field for a custom word, see Guidelines for Japanese. - For more information about all methods of the customization interface, see the API & SDK reference.
- New method for listing information about a corpus
-
The language model customization interface now includes a
GET /v1/customizations/{customization_id}/corpora/{corpus_name}
method that lists information about a specified corpus. The method is useful for monitoring the status of a request to add a corpus to a custom model. For more information, see Listing corpora for a custom language model. - New
count
field for methods that list words for custom language models -
The JSON response that is returned by the
GET /v1/customizations/{customization_id}/words
andGET /v1/customizations/{customization_id}/words/{word_name}
methods now includes acount
field for each word. The field indicates the number of times the word is found across all corpora. If you add a custom word to a model before it is added by any corpora, the count begins at1
. If the word is added from a corpus first and later modified, the count reflects only the number of times it is found in corpora. For more information, see Listing custom words from a custom language model.For custom models that were created prior to the existence of the
count
field, the field always remains at0
. To update the field for such models, add the model's corpora again and include theallow_overwrite
parameter with thePOST /v1/customizations/{customization_id}/corpora/{corpus_name}
method. - New
sort
parameter for methods that list words for custom language models -
The
GET /v1/customizations/{customization_id}/words
method now includes asort
query parameter that controls the order in which the words are to be listed. The parameter accepts two arguments,alphabetical
orcount
, to indicate how the words are to be sorted. You can prepend an optional+
or-
to an argument to indicate whether the results are to be sorted in ascending or descending order. By default, the method displays the words in ascending alphabetical order. For more information, see Listing custom words from a custom language model.For custom models created prior to the introduction of the
count
field, use of thecount
argument with thesort
parameter is meaningless. Use the defaultalphabetical
argument with such models. - New
error
field format for methods that list words for custom language models -
The
error
field that can be returned as part of the JSON response from theGET /v1/customizations/{customization_id}/words
andGET /v1/customizations/{customization_id}/words/{word_name}
methods is now an array. If the service discovered one or more problems with a custom word's definition, the field lists each problem element from the definition and provides a message that describes the problem. For more information, see Listing custom words from a custom language model. - The
keywords_threshold
andword_alternatives_threshold
parameters no longer accept a null value -
The
keywords_threshold
andword_alternatives_threshold
parameters of the recognition methods no longer accept a null value. To omit keywords and word alternatives from the response, omit the parameters. A specified value must be a float.
22 September 2016
- New beta language model customization interface
- The service now offers a new beta language model customization interface for US English. You can use the interface to tailor the service's base vocabulary and language models via the creation of custom language models that include domain-specific
terminology. You can add custom words individually or have the service extract them from corpora. To use your custom models with the speech recognition methods that are offered by any of the service's interfaces, pass the
customization_id
query parameter. For more information, see - New support for
audio/mulaw
audio format - The list of supported audio formats now includes
audio/mulaw
, which provides single-channel audio encoded using the u-law (or mu-law) data algorithm. When you use this format, you must also specify the sampling rate at which the audio is captured. For more information, see audio/mulaw format. - New
supported_features
identified when listing models - The
GET /v1/models
andGET /v1/models/{model_id}
methods now return asupported_features
field as part of their output for each language model. The additional information describes whether the model supports customization. For more information, see the API & SDK reference.
30 June 2016
- Beta asynchronous HTTP interface now supports all available languages
- The beta asynchronous HTTP interface now supports all languages that are supported by the service. The interface was previously available for US English only. For more information, see The asynchronous HTTP interface and the API & SDK reference.
23 June 2016
- New beta asynchronous HTTP interface now available
- A beta asynchronous HTTP interface is now available. The interface provides full recognition capabilities for US English transcription via non-blocking HTTP calls. You can register callback URLs and provide user-specified secret strings to achieve authentication and data integrity with digital signatures. For more information, see The asynchronous HTTP interface and the API & SDK reference.
- New beta
smart_formatting
parameter for speech recognition - A beta smart formatting feature that converts dates, times, series of digits and numbers, phone numbers, currency values, and Internet addresses into more conventional representations in final transcripts. You enable the feature by setting
the
smart_formatting
parameter totrue
on a recognition request. The feature is beta functionality that is available for US English only. For more information, see Smart formatting. - New French broadband model
- The list of supported models for speech recognition now includes
fr-FR_BroadbandModel
for audio in the French language that is sampled at a minimum of 16 kHz. For more information, see Previous-generation languages and models. - New support for
audio/basic
audio format - The list of supported audio formats now includes
audio/basic
. The format provides single-channel audio that is encoded by using 8-bit u-law (or mu-law) data that is sampled at 8 kHz. For more information, see audio/basic format. - Speech recognition methods now return warnings for invalid parameters
- The various recognition methods can return a
warnings
response that includes messages about invalid query parameters or JSON fields that are included with a request. The format of the warnings changed. For example,"warnings": "Unknown arguments: [u'{invalid_arg_1}', u'{invalid_arg_2}']."
is now"warnings": "Unknown arguments: {invalid_arg_1}, {invalid_arg_2}."
- Empty body required for HTTP
POST
methods that pass no data - For HTTP
POST
requests that do not otherwise pass data to the service, you must include an empty request body of the form{}
. With thecurl
command, you use the--data
option to pass the empty data.
10 March 2016
- New maximum limits on audio transmitted for speech recognition
- Both forms of data transmission (one-shot delivery and streaming) now impose a size limit of 100 MB on the audio data, as does the WebSocket interface. Formerly, the one-shot approach had a maximum limit of 4 MB of data. For more information, see Audio transmission (for all interfaces) and Send audio and receive recognition results (for the WebSocket interface). The WebSocket section also discusses the maximum frame or message size of 4 MB enforced by the WebSocket interface.
- HTTP and WebSocket interfaces can now return warnings
- The JSON response for a recognition request can now include an array of warning messages for invalid query parameters or JSON fields that are included with a request. Each element of the array is a string that describes the nature of the warning
followed by an array of invalid argument strings. For example,
"warnings": [ "Unknown arguments: [u'{invalid_arg_1}', u'{invalid_arg_2}']." ]
. For more information, see the API & SDK reference. - Beta Apple iOS SDK is deprecated
- The beta Watson Speech Software Development Kit (SDK) for the Apple® iOS operating system is deprecated. Use the Watson SDK for the Apple® iOS operating system instead. The new SDK is available from the ios-sdk repository in the
watson-developer-cloud
namespace on GitHub. - WebSocket interface can produce delayed results
- The WebSocket interface can take minutes to produce final results for a recognition request for an especially long audio file. For the WebSocket interface, the underlying TCP connection remains idle while the service prepares the response.
Therefore, the connection can close due to a timeout. To avoid the timeout with the WebSocket interface, request interim results (
\"interim_results\": \"true\"
) in the JSON for thestart
message to initiate the request. You can discard the interim results if you do not need them. This issue will be resolved in a future update.
19 January 2016
- New profanity filtering feature
- The service was updated to include a new profanity filtering feature on January 19, 2016. By default, the service censors profanity from its transcription results for US English audio. For more information, see Profanity filtering.
17 December 2015
- New keyword spotting feature
- The service now offers a keyword spotting feature. You can specify an array of keyword strings that are to be matched in the input audio. You must also specify a user-defined confidence level that a word must meet to be considered a match for a keyword. For more information, see Keyword spotting. The keyword spotting feature is beta functionality.
- New word alternatives feature
- The service now offers a word alternatives feature. The feature returns alternative hypotheses for words in the input audio that meet a user-defined confidence level. For more information, see Word alternatives. The word alternatives feature is beta functionality.
- New UK English and Arabic models
- The service supports more languages with its transcription models:
en-UK_BroadbandModel
anden-UK_NarrowbandModel
for UK English, andar-AR_BroadbandModel
for Modern Standard Arabic. For more information, see Previous-generation languages and models. - New
session_closed
field for session-based methods - In the JSON responses that it returns for errors with session-based methods, the service now also includes a new
session_closed
field. The field is set totrue
if the session is closed as a result of the error. For more information about possible return codes for any method, see the API & SDK reference. - HTTP platform timeout no longer applies
- HTTP recognition requests are no longer subject to a 10-minute platform timeout. The service now keeps the connection alive by sending a space character in the response JSON object every 20 seconds while recognition is ongoing. For more information, see Timeouts.
- Rate limiting with curl command is no longer needed
- When you use the
curl
command to transcribe audio with the service, you no longer need to use the--limit-rate
option to transfer data at a rate no faster than 40,000 bytes per second. - Changes to HTTP error codes
- The service no longer returns HTTP status code 490 for the session-based HTTP methods
GET /v1/sessions/{session_id}/observe_result
andPOST /v1/sessions/{session_id}/recognize
. The service now responds with HTTP status code 400 instead.
21 September 2015
- New mobile SDKs available
-
Two new beta mobile SDKs are available for the speech services. The SDKs enable mobile applications to interact with both the Speech to Text and Text to Speech services.
- The Watson Speech SDK for the Google Android™ platform supports streaming audio to the Speech to Text service in real time and receiving a transcript of the audio as you speak. The project includes an example application that
showcases interaction with both of the speech services. The SDK is available from the speech-android-sdk repository in the
watson-developer-cloud
namespace on GitHub. - The Watson Speech SDK for the Apple® iOS operating system supports streaming audio to the Speech to Text service and receiving a transcript of the audio in response. The SDK is available from the speech-ios-sdk repository in the
watson-developer-cloud
namespace on GitHub.
Both SDKs support authenticating with the speech services by using either your IBM Cloud service credentials or an authentication token. Because the SDKs are beta, they are subject to change in the future.
- The Watson Speech SDK for the Google Android™ platform supports streaming audio to the Speech to Text service in real time and receiving a transcript of the audio as you speak. The project includes an example application that
showcases interaction with both of the speech services. The SDK is available from the speech-android-sdk repository in the
- New Brazilian Portuguese and Mandarin Chinese models
-
The service supports two new languages, Brazilian Portuguese and Mandarin Chinese, with the following models:
- Brazilian Portuguese broadband model (
pt-BR_BroadbandModel
) - Brazilian Portuguese narrowband model (
pt-BR_NarrowbandModel
) - Mandarin Chinese broadband model (
zh-CN_BroadbandModel
) - Mandarin Chinese narrowband model (
zh-CN_NarrowbandModel
)
For more information, see Previous-generation languages and models.
- Brazilian Portuguese broadband model (
- New support for
audio/ogg;codecs=opus
audio format -
The HTTP
POST
requests/v1/sessions/{session_id}/recognize
and/v1/recognize
, as well as the WebSocket/v1/recognize
request, support transcription of a new media type:audio/ogg;codecs=opus
for Ogg format files that use the Opus codec. In addition, theaudio/wav
format for the methods now supports any encoding. The restriction about the use of linear PCM encoding is removed. For more information, see audio/ogg format. - New
sequence_id
parameter for long polling of sessions -
The service now supports overcoming timeouts when you transcribe long audio files with the HTTP interface. When you use sessions, you can employ a long polling pattern by specifying sequence IDs with the
GET /v1/sessions/{session_id}/observe_result
andPOST /v1/sessions/{session_id}/recognize
methods for long-running recognition tasks. By using the newsequence_id
parameter of these methods, you can request results before, during, or after you submit a recognition request. - New capitalization feature for US English transcription
-
For the US English language models,
en_US_BroadbandModel
anden_US_NarrowbandModel
, the service now correctly capitalizes many proper nouns. For example, the service would return new text that reads "Barack Obama graduated from Columbia University" instead of "barack obama graduated from columbia university". This change might be of interest to you if your application is sensitive in any way to the case of proper nouns. - New HTTP error code
-
The HTTP
DELETE /v1/sessions/{session_id}
request does not return status code 415 "Unsupported Media Type". This return code is removed from the documentation for the method.
1 July 2015
- The Speech to Text service is now generally available
-
The service moved from beta to general availability (GA) on July 1, 2015. The following differences exist between the beta and GA versions of the Speech to Text APIs. The GA release requires that users upgrade to the new version of the service.
The GA version of the HTTP API is compatible with the beta version. You need to change your existing application code only if you explicitly specified a model name. For example, the sample code available for the service from GitHub included the following line of code in the file
demo.js
:model: 'WatsonModel'
This line specified the default model
WatsonModel
, for the beta version of the service. If your application also specified this model, you need to change it to use one of the new models that are supported by the GA version. For more information, see the next bullet. - New token-based programming model
-
The service now supports a new programming model for direct interaction between a client and the service over a WebSocket connection. By using this model, a client can obtain an authentication token for communicating directly with the service. The token bypasses the need for a server-side proxy application in IBM Cloud to call the service on the client's behalf. Tokens are the preferred means for clients to interact with the service.
The service continues to support the old programming model that relied on a server-side proxy to relay audio and messages between the client and the service. But the new model is more efficient and provides higher throughput.
- New
model
parameter for speech recognition -
The
POST /v1/sessions
andPOST /v1/recognize
methods, along with the WebSocket/v1/recognize
method, now support amodel
query parameter. You use the parameter to specify information about the audio:- The language: English, Japanese, or Spanish
- The minimum sampling rate: broadband (16 kHz) or narrowband (8 kHz)
For more information, see Previous-generation languages and models.
- New
inactivity_timeout
parameter for speech recognition -
The
inactivity_timeout
parameter sets the timeout value in seconds after which the service closes the connection if it detects silence (no speech) in streaming mode. By default, the service terminates the session after 30 seconds of silence. ThePOST /v1/recognize
and WebSocket/v1/recognize
methods support the parameter. For more information, see Timeouts. - New
max_alternatives
parameter for speech recognition -
The
max_alternatives
parameter instructs the service to return the n-best alternative hypotheses for the audio transcription. ThePOST /v1/recognize
and WebSocket/v1/recognize
methods support the parameter. For more information, see Maximum alternatives. - New
word_confidence
parameter for speech recognition -
The
word_confidence
parameter instructs the service to return a confidence score for each word of the transcription. ThePOST /v1/recognize
and WebSocket/v1/recognize
methods support the parameter. For more information, see Word confidence. - New
timestamps
parameter for speech recognition -
The
timestamps
parameter instructs the service to return the beginning and ending time relative to the start of the audio for each word of the transcription. ThePOST /v1/recognize
and WebSocket/v1/recognize
methods support the parameter. For more information, see Word timestamps. - Renamed sessions method for observing results
-
The
GET /v1/sessions/{session_id}/observeResult
method is now namedGET /v1/sessions/{session_id}/observe_result
. The nameobserveResult
is still supported for backward compatibility. - New support for Waveform Audio File (WAV) audio format
-
The
Content-Type
header of therecognize
methods now supportsaudio/wav
for Waveform Audio File (WAV) files, in addition toaudio/flac
andaudio/l16
. For more information, see audio/wav format. - Limits on maximum amount of audio for speech recognition
-
The service now has a limit of 100 MB of data per session in streaming mode. You can specify streaming mode by specifying the value
chunked
with the headerTransfer-Encoding
. One-shot delivery of an audio file still imposes a size limit of 4 MB on the data that is sent. For more information, see Audio transmission. - New header to opt out of contributing to service improvements
-
The
GET /v1/sessions/{session_id}/observe_result
,POST /v1/sessions/{session_id}/recognize
, andPOST /v1/recognize
methods now include the header parameterX-WDC-PL-OPT-OUT
to control whether the service uses the audio and transcription data from a request to improve future results. The WebSocket interface includes an equivalent query parameter. Specify a value of1
to prevent the service from using the audio and transcription results. The parameter applies only to the current request. The new header replaces theX-logging
header from the beta API. See Controlling request logging for Watson services. - Changes to HTTP error codes
-
The service can now respond with the following HTTP error codes:
- For the
/v1/models
,/v1/models/{model_id}
,/v1/sessions
,/v1/sessions/{session_id}
,/v1/sessions/{session_id}/observe_result
,/v1/sessions/{session_id}/recognize
, and/v1/recognize
methods, error code 415 ("Unsupported Media Type") is added. - For
POST
andGET
requests to the/v1/sessions/{session_id}/recognize
method, the following error codes are modified:- Error code 404 ("Session_id not found") has a more descriptive message (
POST
andGET
). - Error code 503 ("Session is already processing a request. Concurrent requests are not allowed on the same session. Session remains alive after this error.") has a more descriptive message (
POST
only). - For HTTP
POST
requests to the/v1/sessions
and/v1/recognize
methods, error code 503 ("Service Unavailable") can be returned. The error code can also be returned when you create a WebSocket connection with the/v1/recognize
method.
- Error code 404 ("Session_id not found") has a more descriptive message (
- For the