IBM Cloud Docs
Using the External enrichment API

Using the External enrichment API

The external enrichment feature is not supported in the Analyze API.

The external enrichment feature allows you to annotate documents with a model of your choice. Through a webhook interface, you can use custom models or advanced foundation models, and other third-party models for enriching your documents in a collection. The documents are enriched by your external application and then merged to a collection in a Discovery project.

IBM Cloud Pak for Data When you run Discovery in an air-gapped environment, you must connect to the external application through an HTTP proxy. For more information, see Setting up HTTP proxy in air-gapped environments.

For using the external enrichment feature, do the following things:

  1. Set up the external application that can receive webhook notifications from Discovery and annotate documents.

    To do so, you must register your external app as a webhook endpoint on a project by using the create enrichment method. For more information, see Create enrichment in the API reference.

    After setting up the external enrichment for a project, it becomes available to all collections in the project. The external application also receives a webhook ping event, which notifies that an external enrichment is created.

  2. Specify the collection in which you want to apply the external enrichment. You can use the API to apply the external enrichment to a collection. For more information, see Using the API to manage enrichments.

    Alternatively, on the user interface, you can browse to the Manage collections page, and choose the collection where you want to apply the external enrichment. Then, open the Enrichments tab, and apply your external enrichment to a field in the collection.

    When documents are processed or uploaded to this collection, Discovery creates a batch of documents with a unique batch_id. The external application also receives a webhook enrichment.batch.created event, which notifies that batches are ready to be pulled. Your external application can then pull batches from Discovery for external enrichment.

    If the external application shuts down or restarts in between, you can get the following by using the List batches method:

    • Notified batches that are not yet pulled by the external enrichment application.
    • Batches that are pulled, but not yet pushed to Discovery by the external enrichment application.

    For more information, see List batches in the API reference.

  3. Specify the batch_id provided by Discovery in the pull batches method to pull the documents from Discovery for enrichment by your external application. For more information, see Pull batches in the API reference.

    The pull batches method returns a binary file attachment from Discovery. For more information about the binary attachment, see Binary attachment from the pull batches method.

  4. Specify the same batch_id in the push batches method after your external enrichment annotates the documents in the batch. For more information, see Push batches in the API reference.

    The documents are pushed to Discovery as a binary attachment. For more information, see Binary attachment in the push batches method.

  5. Verify that the documents are merged and indexed in the collection. The documents must contain the annotations that are applied by your external application.

Authentiating the request for webhook security

To authenticate the webhook request, verify the JSON Web Token (JWT) that is sent with the request. The webhook microservice automatically generates a JWT and sends it in the Authorization header with each webhook call. It is your responsibility to add code to the external service that verifies the JWT.

The system can generate a JWT based on the sample secret that you specify, and in the Authorization header, you can pass this system-generated JWT to the external application. If you specify a value in the header, then the webhook microservice sends that value to the external application instead of the JWT.

For example, if you specify sample secret in the Secret field of the Webhooks object in the Create collection or update collection APIs, you might add sample code such as the following in Node.js:

const jwt = require('jsonwebtoken');
...
const token = request.headers.authentication; // grab the "Authentication" header
try {
  const decoded = jwt.verify(token, 'sample secret');
} catch(err) {
  // error thrown if token is invalid
}

Data model of the ping event

Following are the ping event parameters:

Ping event
Parameter Description
event The event name is ping.
instance_id The Discovery instance ID.
version The Discovery API version in the format yyyy-mm-dd.
data

An object with the event information: url, events, and metadata.

  • url: The configured webhook endpoint (URL).

  • events: An array of event string values. The events in this array are sent to the webhook URL.

  • metadata: An object with information that is specific to the created webhook.

created_at The date and time the event was created.

Data model of the enrichment.batch.created event

Following are the enrichment.batch.created event parameters:

Enrichment.batch.created
Parameter Description
event The event name is enrichment.batch.created.
instance_id The UUID of the Discovery instance, which is also known as the tenant ID.
version The webhook event version date in the yyyy-mm-dd format.
data

An object with the event specific information: project_id, collection_id, enrichment_id, and batch_id.

  • project_id: The Universally Unique Identifier (UUID) of a project.

  • collection_id: The Universally Unique Identifier (UUID) of a collection.

  • enrichment_id: The Universally Unique Identifier (UUID) of an enrichment.

  • batch_id: The Universally Unique Identifier (UUID) of a batch.

created_at The date and time the event was created.

External enrichment limits

External enrichment limits
Plan Maximum amount of webhook enrichment per collection Maximum amount of webhook enrichment per tenant
Enterprise 1 100
Plus 1 10
Premium 1 100

Binary attachment from the pull batches method

The pull batches method returns a binary attachment file from Discovery.

The returned file is a compressed newline-delimited JSON (NDJSON) file. This file contains structured data that represents the document properties. For example, the following is a JSON value included in the NDJSON file:

{
    "document_id": "3bafc09abfaacd90d66f57181b50d041",
    "location_encoding": "utf-16",
    "language": "en",
    "artifact": "{\"text_positions\":[0,21],\"space_above\":93.07864284515381,\"space_below\":32.53530788421631,\"is_start_of_block\":true,\"image_id\":-1}{\"text_positions\":[22,63],\"space_above\":32.53530788421631,\"space_below\":13.935576438903809,\"is_start_of_block\":true,\"image_id\":-1}{\"parent_document_id\":\"3bafc09abfaacd90d66f57181b50d041\",\"source\":{\"ListId\":\"f0ac1d32-b9e5-41af-b9da-e1e37e965d99\",\"UniqueId\":\"357d7a48-4460-442c-be56-d8bdd40a8c36\",\"ServerRelativeUrl\":\"/Lists/list1/Attachments/1/addattachments.csv\",\"FileNameAsPath\":{\"DecodedUrl\":\"addattachments.csv\"},\"ListItemId\":\"284dcb51-8021-56d0-9213-7f4eb134e083\",\"FileName\":\"addattachments.csv\",\"ServerRelativePath\":{\"DecodedUrl\":\"/Lists/list1/Attachments/1/addattachments.csv\"},\"WebId\":\"ad5bf592-3b4e-4dd1-bd3e-abc0ef179b03\"},\"ingest_datetime\":\"2023-06-26T09:24:02.573Z\",\"application_id\":\"sharepoint\",\"application_sub_type\":\"ListItemAttachmentCollection\"}0.51vanilla ice creamcontamination_tamperingotherchange_of_propertiesI love the ads for the new milk chocolate. Could you tell me the name of the actor in the commercial?{\"metadata\":{\"numPages\":\"54\",\"title\":\"\",\"publicationdate\":\"2010-06-03\"},\"info\":{\"histogram\":{\"mean-char-height\":{},\"mean-char-width\":{},\"number-of-chars\":{}},\"styles\":[]}}1451692800000",
    "features": [
        {
            "type": "field",
            "location": {
                "begin": 0,
                "end": 128
            },
            "properties": {
                "field_name": "multi_nested",
                "field_index": 0,
                "field_type": "json"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 128,
                "end": 258
            },
            "properties": {
                "field_name": "multi_nested",
                "field_index": 1,
                "field_type": "json"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 258,
                "end": 889
            },
            "properties": {
                "field_name": "metadata",
                "field_index": 0,
                "field_type": "json"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 889,
                "end": 892
            },
            "properties": {
                "field_name": "claim_score",
                "field_index": 0,
                "field_type": "double"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 892,
                "end": 893
            },
            "properties": {
                "field_name": "claim_id",
                "field_index": 0,
                "field_type": "long"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 893,
                "end": 910
            },
            "properties": {
                "field_name": "claim_product",
                "field_index": 0,
                "field_type": "string"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 910,
                "end": 933
            },
            "properties": {
                "field_name": "label",
                "field_index": 0,
                "field_type": "string"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 933,
                "end": 938
            },
            "properties": {
                "field_name": "label",
                "field_index": 1,
                "field_type": "string"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 938,
                "end": 958
            },
            "properties": {
                "field_name": "label",
                "field_index": 2,
                "field_type": "string"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 958,
                "end": 1059
            },
            "properties": {
                "field_name": "body",
                "field_index": 0,
                "field_type": "string"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 1059,
                "end": 1230
            },
            "properties": {
                "field_name": "nested",
                "field_index": 0,
                "field_type": "json"
            }
        },
        {
            "type": "field",
            "location": {
                "begin": 1230,
                "end": 1243
            },
            "properties": {
                "field_name": "claim_date",
                "field_index": 0,
                "field_type": "date"
            }
        }
    ]
}

Following are the binary file properties:

Pull method binary file properties
Property Type Description
document_id string The identifier of the document.
location_encoding string The encoding type used to calculate the location of each feature. The supported types are: utf-8utf-16, and utf-32. The external enrichment application must calculate the location of each feature based on the location_encoding of the corresponding document from Discovery. The location of features in a string representation of data varies depending on the encoding type of the programming language that is used for implementing the external enrichment. For example, C++ and Go use UTF-8, Java and JavaScript use UTF-16, and Python uses UTF-32.
language string The content language of the document.
artifact string The package of all the text values.
features array The list of features in a document. For more information, see Feature types.

Binary attachment in the push batches method

After external enrichment, the documents can be pushed to Discovery as a binary attachment in the push batches method.

The file must be a compressed NDJSON file with structured data that represents the document properties. For example, the following is an NDJSON file:

{
  "document_id": "3bafc09abfaacd90d66f57181b50d041",
  "features": [
    {
      "type": "annotation",
      "location": {
        "begin": 958,
        "end": 1000
      },
      "properties": {
        "type": "element_classes",
        "class_name": "expression",
        "confidence": 0.7905777096748352
      }
    },
    {
      "type": "annotation",
      "location": {
        "begin": 1001,
        "end": 1059
      },
      "properties": {
        "type": "element_classes",
        "class_name": "question",
        "confidence": 0.9507029056549072
      }
    },
    {
      "type": "annotation",
      "location": {
        "begin": 1035,
        "end": 1040
      },
      "properties": {
        "type": "entities",
        "entity_type": "JobTitle",
        "entity_text": "actor",
        "confidence": 0.70953685
      }
    },
    {
      "type": "annotation",
      "properties": {
        "type": "document_classes",
        "class_name": "amount.shortage",
        "confidence": 0.43297016620635986
      }
    },
    {
      "type": "notice",
      "properties": {
        "description": "something wrong happened",
      }
    },
    {
      "type": "notice",
      "properties": {
        "description": "something wrong happened again",
        "created": 1689076276402,
      }
    }
  ]
}

Following are the binary file properties:

Push method binary file properties
Property Type Description
document_id string The identifier of the document.
features array The list of features in a document. For more information, see Feature types.

Feature types

A feature type can be one of the following in a binary file:

Feature types
Feature Type Description
field string Represents a specific field value of the document.
annotation string Represents a specific annotation that can enrich the document.
notice string Represents any error that might occur in the external application during document enrichment. The information in notice is used to generate a message on the Discovery UI.

The following are the other properties in the binary file:

Other properties in the binary file
Feature Type Description
location object Location information to get the text value from the artifact by using the begin and end values. The begin value is a string value that represents the begin location in the artifact. The end value is a string value that represents an exclusive end location in the artifact. This property is null when a feature represents a document level information. For example, when type=annotation and properties.type=document_classes.
properties object The properties of a feature in the document. Supported properties vary depending on the type of feature. For more information, see Field type properties, Annotation type properties, and Notice type properties.

Field type properties

For field type, the following properties represent a certain field of the document that was converted by Discovery from an original file:

Field type properties
Property Type Description
field_name string The name of the field.
field_index int The index of a field value. This value is 0 for a single-valued field, but can be > 0 when a field is multi-valued, such as, for an array of values.
field_type string (enum: long, double, date, json) The data type of the feature. This value determines how to parse the text representation of the feature in a programming language.

Annotation type properties

For annotation type, the following properties represent an annotation that can enrich a document:

Annotation type properties
Property Type Description
type string (enum: entities, element_classes, document_classes) The type of enriched annotation that a feature represents. The entities are merged to entities of enriched fields. The element_classes are merged to element classes of enriched fields. The document_classes are merged to classes of document level enrichment field.
confidence double The optional confidence score by the external model. It is between 0 to 1, and is 0 by default.
entity_type string The type of entity that an external model assigns to a thing. Required for the entities type.
entity_text string The representative text of an entity that the external application extracts. Required for the entities type.
class_name string The name of a class that the external application assigns to a thing. Required for the element_classes and document_classes type.

Notice type properties

For notice type, the following properties represent errors and exceptions that occurred in the external application while enriching a document:

Notice type properties
Property Type Description
description string The message that describes an error that occurred during external enrichment.
created long Unix epoch time in milliseconds when an error occurred during external enrichment.