Creating a custom acoustic model
Acoustic model customization is available only for previous-generation models. It is not available for next-generation and large speech models.
Follow these steps to create a custom acoustic model for the IBM Watson® Speech to Text service:
-
Create a custom acoustic model. You can create multiple custom models for the same or different domains or environments. The process is the same for any model that you create. Acoustic model customization is available for all languages that are supported by previous-generation models. For more information about which languages are generally available and beta, see Language support for customization.
-
Add audio to the custom acoustic model. The service accepts the same audio file formats for acoustic modeling that it accepts for speech recognition. It also accepts archive files that contain multiple audio files. Archive files are the preferred means of adding audio resources. You can repeat the method to add more audio or archive files to a custom model.
-
Train the custom acoustic model. Once you add audio resources to the custom model, you must train the model. Training prepares the custom acoustic model for use in speech recognition. The training time depends on the cumulative amount of audio data that the model contains.
You can specify a helper custom language model during training of your custom acoustic model. A custom language model that includes transcriptions of your audio files or OOV words from the domain of your audio files can improve the quality of the custom acoustic model. For more information, see Training a custom acoustic model with a custom language model.
-
After you train your custom model, you can use it with recognition requests. If the audio passed for transcription has acoustic qualities that are similar to the audio of the custom model, the results reflect the service's enhanced understanding. You can use only one custom acoustic model at a time with a speech recognition request. For more information, see Using a custom acoustic model for speech recognition.
You can pass both a custom acoustic model and a custom language model in the same recognition request to further improve recognition accuracy. For more information, see Using custom language and custom acoustic models for speech recognition.
The steps for creating a custom acoustic model are iterative. You can add or delete audio and retrain a model as often as needed. You must retrain a model for any changes to its audio to take effect.
Create a custom acoustic model
You use the POST /v1/acoustic_customizations
method to create a new custom acoustic model. The method accepts a JSON object that defines the attributes of the new custom model as the body of the request. The new custom model is
owned by the instance of the service whose credentials are used to create it. For more information, see Ownership of custom models.
You can create a maximum of 1024 custom acoustic models per owning credentials. For more information, see Maximum number of custom models.
A new custom acoustic model has the following attributes:
name
(required string)- A user-defined name for the new custom acoustic model. Use a localized name that matches the language of the custom model and reflects the acoustic environment of the model, such as
Mobile custom model
orNoisy car custom model
.- Include a maximum of 256 characters in the name.
- Do not use backslashes, slashes, colons, equal signs, ampersands, or question marks in the name.
- Use a name that is unique among all custom acoustic models that you own.
base_model_name
(required string)- The name of the base language model that is to be customized by the new model. You must use the name of a model that is returned by the
GET /v1/models
method. The new custom model can be used only with the base model that it customizes. description
(optional string)- A recommended description of the new custom model.
- Use a localized description that matches the language of the custom model.
- Include a maximum of 128 characters in the description.
The following example creates a new custom acoustic model named Example acoustic model
. The model is created for the base model en-US_BroadbandModel
and has the description Example custom acoustic model
.
The Content-Type
header specifies that JSON data is being passed to the method.
IBM Cloud
curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--data "{\"name\": \"Example acoustic model\", \
\"base_model_name\": \"en-US_BroadbandModel\", \
\"description\": \"Example custom acoustic model\"}" \
"{url}/v1/acoustic_customizations"
IBM Cloud Pak for Data
curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--data "{\"name\": \"Example acoustic model\", \
\"base_model_name\": \"en-US_BroadbandModel\", \
\"description\": \"Example custom acoustic model\"}" \
"{url}/v1/acoustic_customizations"
The example returns the customization ID of the new model. Each custom model is identified by a unique customization ID, which is a Globally Unique Identifier (GUID). You specify a custom model's GUID with the customization_id
parameter
of calls that are associated with the model.
{
"customization_id": "74f4807e-b5ff-4866-824e-6bba1a84fe96"
}
Add audio to the custom acoustic model
Once you create your custom acoustic model, the next step is to add audio resources to it. You use the POST /v1/acoustic_customizations/{customization_id}/audio/{audio_name}
method to add an audio resource to a custom model. You
can add
- An individual audio file in any format that is supported for speech recognition. For more information, see Supported audio formats.
- An archive file (a
.zip
or.tar.gz
file) that includes multiple audio files. Gathering multiple audio files into a single archive file and loading that single file is significantly more efficient than adding audio files individually.
You pass the audio resource as the body of the request and assign the resource an audio_name
. For more information, see Working with audio resources.
The following examples show the addition of both audio- and archive-type resources:
-
This example adds an audio-type resource to the custom acoustic model with the specified
customization_id
. TheContent-Type
header identifies the type of the audio asaudio/wav
. The audio file,audio1.wav
, is passed as the body of the request, and the resource is given the nameaudio1
.IBM Cloud
curl -X POST -u "apikey:{apikey}" \ --header "Content-Type: audio/wav" \ --data-binary @audio1.wav \ "{url}/v1/acoustic_customizations/{customization_id}/audio/audio1"
IBM Cloud Pak for Data
curl -X POST \ --header "Authorization: Bearer {token}" \ --header "Content-Type: audio/wav" \ --data-binary @audio1.wav \ "{url}/v1/acoustic_customizations/{customization_id}/audio/audio1"
-
This example adds an archive-type resource to the specified custom acoustic model. The
Content-Type
header identifies the type of the archive asapplication/zip
. TheContained-Contented-Type
header indicates that all files that are contained in the archive have the formataudio/l16
and are sampled at a rate of 16 kHz. The archive file,audio2.zip
, is passed as the body of the request, and the resource is given the nameaudio2
.IBM Cloud
curl -X POST -u "apikey:{apikey}" \ --header "Content-Type: application/zip" \ --header "Contained-Content-Type: audio/l16;rate=16000" \ --data-binary @audio2.zip \ "{url}/v1/acoustic_customizations/{customization_id}/audio/audio2"
IBM Cloud Pak for Data
curl -X POST \ --header "Authorization: Bearer {token}" \ --header "Content-Type: application/zip" \ --header "Contained-Content-Type: audio/l16;rate=16000" \ --data-binary @audio2.zip \ "{url}/v1/acoustic_customizations/{customization_id}/audio/audio2"
The method also accepts an optional allow_overwrite
query parameter to overwrite an existing audio resource for a custom model. Use the parameter if you need to update an audio resource after you add it to a model.
The method is asynchronous. It can take several seconds or minutes to complete depending on the duration of the audio. For an archive file, the length of the operation depends on the duration of all audio files contained in the archive. The duration of the operation also depends on the current load on the service. For more information about checking the status of a request to add an audio resource, see Monitoring the add audio request.
You can add any number of audio resources to a custom model by calling the method once for each audio or archive file. You can make multiple requests to add different audio resources simultaneously.
You must add a minimum of 10 minutes of audio that includes speech, not silence, to a custom acoustic model before you can train it. No audio- or archive-type resource can be larger than 100 MB.
- For general guidance about adding audio to a custom acoustic model, see Guidelines for adding audio.
- For more information about the maximum amount of audio that you can add to a custom acoustic model, see Maximum hours of audio.
Monitoring the add audio request
The service returns a 201 response code if the audio is valid. It then asynchronously analyzes the contents of the audio file or files and automatically extracts information about the audio such as length, sampling rate, and encoding. You cannot train the custom model until the service's analysis of all audio resources for current requests completes.
To determine the status of the request, use the GET /v1/acoustic_customizations/{customization_id}/audio/{audio_name}
method to poll the status of the audio. The method accepts the customization ID of the custom model and the
name of the audio resource. Its response includes the status
of the resource, which has one of the following values:
ok
indicates that the audio is acceptable and analysis is complete.being_processed
indicates that the service is still analyzing the audio.invalid
indicates that the audio file is not acceptable for processing. It might have the wrong format, the wrong sampling rate, or not be an audio file. For an archive file, if any of the audio files that it contains are invalid, the entire archive is invalid.
The content of the response and location of the status
field depend on the type of the resource, audio or archive.
-
For an audio-type resource, the
status
field is located in the top-level (AudioListing
) object.IBM Cloud
curl -X GET -u "apikey:{apikey}" \ "{url}/v1/acoustic_customizations/{customization_id}/audio/audio1"
IBM Cloud Pak for Data
curl -X GET \ --header "Authorization: Bearer {token}" \ "{url}/v1/acoustic_customizations/{customization_id}/audio/audio1"
The status of the audio-type resource is
ok
:{ "duration": 131, "name": "audio1", "details": { "codec": "pcm_s16le", "type": "audio", "frequency": 22050 } "status": "ok" }
-
For an archive-type resource, the
status
field is located in the second-level (AudioResource
) object that is nested in thecontainer
field.IBM Cloud
curl -X GET -u "apikey:{apikey}" \ "{url}/v1/acoustic_customizations/{customization_id}/audio/audio2"
IBM Cloud Pak for Data
curl -X GET \ --header "Authorization: Bearer {token}" \ "{url}/v1/acoustic_customizations/{customization_id}/audio/audio2"
The status of the archive-type resource is
ok
:{ "container": { "duration": 556, "name": "audio2", "details": { "type": "archive", "compression": "zip" }, "status": "ok" }, . . . }
Use a loop to check the status of the audio resource every few seconds until it becomes ok
. For more information about other fields that are returned by the method, see Listing audio resources for a custom acoustic model.
Train the custom acoustic model
Once you populate a custom acoustic model with audio resources, you must train the model on the new data. Training prepares a custom model for use in speech recognition. A model cannot be used for recognition requests until you train it on the new data. Also, updates to a custom model in the form of new or changed audio resources are not reflected by the model until you train it with the changes.
You use the POST /v1/acoustic_customizations/{customization_id}/train
method to train a custom model. You pass the method the customization ID of the model that you want to train, as in the following example.
IBM Cloud
curl -X POST -u "apikey:{apikey}" \
"{url}/v1/acoustic_customizations/{customization_id}/train"
IBM Cloud Pak for Data
curl -X POST \
--header "Authorization: Bearer {token}" \
"{url}/v1/acoustic_customizations/{customization_id}/train"
The method is asynchronous. Training time depends on the cumulative amount of audio data that the custom acoustic model contains and the current load on the service. When you train or retrain a model, the service uses all of the model's audio data (not just new data) in the training. So training time is commensurate with the total amount of audio that the model contains.
A general guideline is that training a custom acoustic model takes approximately as long as the length of its cumulative audio data. For example, it takes approximately 2 hours to train a model that contains a total of 2 hours of audio. For more information about checking the status of a training operation, see Monitoring the train model request.
The method includes the following optional query parameters:
- The
custom_language_model_id
parameter specifies a separately created custom language model that is to be used during training. You can train with a custom language model that contains transcriptions of your audio files or that contains corpora or OOV words that are relevant to the contents of the audio files. For training to succeed, the custom language model must be fully trained and available, and the custom acoustic and custom language models must be based on the same version of the same base model. For more information, see Training a custom acoustic model with a custom language model. - The
strict
parameter indicates whether training is to proceed if the custom model contains a mix of valid and invalid audio resources. By default, training fails if the model contains one or more invalid resources. Set the parameter tofalse
to allow training to proceed as long as the model contains at least one valid resource. The service excludes invalid resources from the training. For more information, see Training failures for custom acoustic models.
Monitoring the train model request
The service returns a 200 response code if the training process is successfully initiated. The service cannot accept subsequent training requests, or requests to add more audio resources, until the existing training request completes.
To determine the status of a training request, use the GET /v1/acoustic_customizations/{customization_id}
method to poll the model's status. The method accepts the customization ID of the acoustic model, as in the following example:
IBM Cloud
curl -X GET -u "apikey:{apikey}" \
"{url}/v1/acoustic_customizations/{customization_id}"
IBM Cloud Pak for Data
curl -X GET \
--header "Authorization: Bearer {token}" \
"{url}/v1/acoustic_customizations/{customization_id}"
The response includes the status of the model, which is training
:
{
"customization_id": "74f4807e-b5ff-4866-824e-6bba1a84fe96",
"created": "2016-06-01T18:42:25.324Z",
"updated": "2016-06-01T22:11:13.298Z",
"language": "en-US",
"owner": "297cfd08-330a-22ba-93ce-1a73f454dd98",
"name": "Example model",
"description": "Example custom acoustic model",
"base_model_name": "en-US_BroadbandModel",
"status": "training",
"progress": 0
}
The response includes status
and progress
fields that report the current state of the model. The meaning of the progress
field depends on the model's status. The status
field can have one
of the following values:
-
pending
indicates that the model was created but is waiting either for valid training data to be added or for the service to finish analyzing data that was added. Theprogress
field is0
. -
ready
indicates that the model contains valid data and is ready to be trained. Theprogress
field is0
.If the model contains a mix of valid and invalid audio resources, training of the model fails unless you set the
strict
query parameter tofalse
. For more information, see Training failures for custom acoustic models. -
training
indicates that the model is being trained. Theprogress
field changes from0
to100
when training is complete. -
available
indicates that the model is trained and ready to use. Theprogress
field is100
. -
upgrading
indicates that the model is being upgraded. Theprogress
field is0
. -
failed
indicates that training of the model failed. Theprogress
field is0
. For more information, see Training failures for custom acoustic models.
Use a loop to check the status of the training once a minute until the model becomes available
. For more information about other fields that are returned by the method, see Listing custom acoustic models.
Training failures for custom acoustic models
Training fails to start if the service is handling another request for the custom acoustic model. A conflicting request could be another training request or a request to add audio resources to the model. The service returns a status code of 409.
Training also fails to start for the following reasons:
- The custom model contains less than 10 minutes of audio data.
- The custom model contains more than 200 hours of audio data.
- One or more of the custom model's audio resources is invalid.
- You passed a custom language model with the
custom_language_model_id
query parameter that is not in theavailable
state. A custom language model must be fully trained and available to be used to train a custom acoustic model. - You passed an incompatible custom language model with the
custom_language_model_id
query parameter. Both custom models must be based on the same version of the same base model.
The service returns a status code of 400 and sets the custom model's status to failed
. Take one of the following actions:
-
Use the
GET /v1/acoustic_customizations/{customization_id}/audio
andGET /v1/acoustic_customizations/{customization_id}/audio/{audio_name}
methods to examine the model's audio resources. For more information, see Listing audio resources for a custom acoustic model.For each invalid audio resource, do one of the following:
- Correct the audio resource and use the
allow_overwrite
parameter of thePOST /v1/acoustic_customizations/{customization_id}/audio/{audio_name}
method to add the corrected audio to the model. For more information, see Add audio to the custom acoustic model. - Use the
DELETE /v1/acoustic_customizations/{customization_id}/audio/{audio_name}
method to delete the audio resource from the model. For more information, see Deleting an audio resource from a custom acoustic model.
- Correct the audio resource and use the
-
Set the
strict
parameter of thePOST /v1/acoustic_customizations/{customization_id}/train
method tofalse
to exclude invalid audio resources from the training. The model must contain at least one valid audio resource for training to succeed. Thestrict
parameter is useful for training a custom model that contains a mix of valid and invalid audio resources.