Generating data from a taxonomy
Complete the following steps to generate data from your taxonomy.
Data cannot be augmented, curated, or manually uploaded to train the model. Use this task to generate the data.
Prerequisites
Generating data by using the console
-
In the console, open the Red Hat AI InstructLab service.
-
Click Projects > your project > Training data > Generate.
-
Provide an alphanumeric name for the training data, select the taxonomy to use, and click Generate. The state is
queued
, thenrunning
. Wait for the state to becompleted
. When the data is generated, in the COS bucket, asynthetic_data
directory is created with logs for troubleshooting.
Generating data by using the CLI
- List your taxonomies and make a note of the taxonomy you want to use.
Example output.ibmcloud ilab taxonomy list
id name taxonomy_path 669a88c9488ee7b95ce8fe05 test-tax taxonomy.tar.gz
- Generate data from your taxonomy. Note the ID for the data to use in the next step. Use alphanumeric characters in the name.
Example command.ibmcloud ilab data generate [--name NAME] [--taxonomy-id TAXONOMY-ID]
Example output.ibmcloud ilab data generate --name testdata --taxonomy-id 669a88c9488ee7b95ce8fe05
id 66a268c170dcb21150050e8e name test-data state queued status created_at 2024-07-19T15:40:29.000Z taxonomy_id 669a88c9488ee7b95ce8fe05
- Check the details of your data generation. Include the ID for the data. The state is
queued
, thenrunning
. Wait for the state to becompleted
.
Exampleibmcloud ilab data get --id DATA_ID
data get
command.
Example output.ibmcloud ilab data get --id 66a268c170dcb21150050e8e
Exampleid 66a268c170dcb21150050e8e name test-data state running status Generating data for taxonomy path compositional_skills->STEM->math->area: 12% 12/100 (total qna processed 1/147) created_at 2024-07-19T15:40:29.000Z taxonomy_id 669a88c9488ee7b95ce8fe05
data get
command with the--output json
option which includes metrics.
Example JSON outputibmcloud ilab data get --id 66a268c170dcb21150050e8e --output json
{ "created_at": "2024-07-19T15:40:29.000Z", "data_metrics": { "samples": { "knowledge": 30, "skills": 70, "total": 100 } }, "id": "66a268c170dcb21150050e8e", "name": "test-data", "state": "completed", "status": "completed", "taxonomy_id": "669a88c9488ee7b95ce8fe05" }
When the state is completed
, in the COS bucket, a synthetic_data
directory is created with logs for troubleshooting.
Generating data by using the API
-
List your taxonomies and make a note of the taxonomy you want to use.
Example command.
curl -X 'GET' \ 'https://us-east.instructlab.ibm.com/v1/taxonomies' \ -H 'accept: application/json
Example output.
{ "taxonomies": [ { "id": "202a03c4-dcf1-432a-82b7-abecb2e019f7", "name": "example-taxonomy-name-1", "taxonomy_path_cos": "taxonomies/taxonomy.tar.gz", "created_at": "2024-10-23T02:58:50.000Z" } ] }
-
Generate data from your taxonomy. Note the ID for the data to use in the next step. Use alphanumeric characters in the name.
Example command.
curl -X 'POST' \ 'https://us-east.instructlab.ibm.com/v1/data' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "example-data-1", "taxonomy_id": "202a03c4-dcf1-432a-82b7-abecb2e019f7" }'
Example output.
{ "id": "add785e6-a8c3-4f5f-ab89-c506a3f115da", "name": "example-data-1", "state": "", "status": "queued", "created_at": "2024-10-23T02:58:50.000Z", "taxonomy_id": "202a03c4-dcf1-432a-82b7-abecb2e019f7", "data_metrics": { "samples": { "additionalProp1": 1, "additionalProp2": 2, "additionalProp3": 3 } } }
-
Check the details of your data generation. Include the ID for the data. The state is
queued
, thenrunning
. Wait for the state to becompleted
.Example command.
curl -X 'GET' \ 'https://us-east.instructlab.ibm.com/v1/data/add785e6-a8c3-4f5f-ab89-c506a3f115da' \ -H 'accept: application/json'
Example output.
{ "id": "add785e6-a8c3-4f5f-ab89-c506a3f115da", "name": "example-data-1", "state": "", "status": "queued", "created_at": "2024-10-23T02:58:50.000Z", "taxonomy_id": "202a03c4-dcf1-432a-82b7-abecb2e019f7", "data_metrics": { "samples": { "additionalProp1": 1, "additionalProp2": 2, "additionalProp3": 3 } } }
When the state is completed
, in the COS bucket, a synthetic_data
directory is created with logs for troubleshooting.
What's in my COS bucket after generating data?
After you generate data, your COS bucket contains a synthetic_data
directory with the following files.
- Artifacts
- These files contain the samples on each leaf node. These are not used for training the model, but are provided for readability and can be used to see if a QNA is generating the expected number of samples.
- Logs
- These files contain the Red Hat AI InstructLab execution logs and system details.
knowledge_train_msgs.jsonl
andskills_train_msgs.jsonl
- These are the Phase 1 and Phase 2 training files and contain samples used for training the model.
To understand why and how your data gets generated, see the SDG FAQs community doc.
Next steps
After you've generated data from your taxonomy, you can begin training your model.