IBM Cloud Docs
Building a custom crawler plug-in

In an upcoming release, the bundled JVM for the crawler plug-in and customer connector features will be transitioned to IBM Semeru Runtimes, Version 21. If your crawler plug-in or custom connectors utilize any features that are incompatible between IBM SDK, Java Technology Edition, Version 8 and IBM Semeru Runtimes, Version 21, you need to revise your code to ensure compatibility with future releases, such as Discovery 5.2.x and later.

For JVM migration, see the following pages:

Building a custom crawler plug-in

Discovery features the option to build your own crawler plug-in with a Java SDK. By using crawler plug-ins, you can now quickly develop relevant solutions for your use cases. You can download the SDK from your installed Discovery cluster. For more information, see Obtaining the crawler plug-in SDK package.

IBM Cloud Pak for Data IBM Software Hub

This information applies only to installed deployments.

Any custom code that you use with IBM Watson® Discovery is the responsibility of the developer; IBM Support does not cover any custom code that the developer creates.

The crawler plug-ins support the following functions:

  • Update the metadata list of a crawled document
  • Update the content of a crawled document
  • Exclude a crawled document
  • Reference crawler configurations, masking password values
  • Show notice messages in the Discovery user interface
  • Output log messages to the crawler pod console

However, the crawler plug-ins cannot support the following functions:

  • Split a crawled document into multiple documents
  • Combine content from multiple documents into a single document
  • Modify access control lists

Crawler plug-in requirements

Make sure that the following items are installed on the development server that you plan to use to develop a crawler plug-in by using this SDK:

  • Java SE Development Kit (JDK) 1.8 or higher
  • Gradle
  • cURL
  • sed (stream editor)

Obtaining the crawler plug-in SDK package

  1. Log in to your Discovery cluster.

  2. Enter the following command to obtain your crawler pod name:

    oc get pods | grep crawler
    

    The following example shows sample output.

    wd-discovery-crawler-57985fc5cf-rxk89     1/1     Running     0          85m
    
  3. Enter the following command to obtain the SDK package name, replacing {crawler-pod-name} with the crawler pod name that you obtained in step 2:

    oc exec {crawler-pod-name} -- ls -l /opt/ibm/wex/zing/resources/ | grep wd-crawler-plugin-sdk
    

    The following example shows sample output.

    -rw-r--r--. 1 dadmin dadmin 35575 Oct  1 16:51 wd-crawler-plugin-sdk-${build-version}.zip
    
  4. Enter the following command to copy the SDK package to the host server, replacing {build-version} with the build version number from the previous step:

    oc cp {crawler-pod-name}:/opt/ibm/wex/zing/resources/wd-crawler-plugin-sdk-${build-version}.zip wd-crawler-plugin-sdk.zip
    
  5. If necessary, copy the SDK package to the development server.

Building a crawler plug-in package

  1. Extract the SDK compressed file.
  2. Implement the plug-in logic in src/. Ensure that the dependency is written in build.gradle.
  3. Enter gradle packageCrawlerPlugin to create the plug-in package. The package is generated as build/distributed/wd-crawler-plugin-sample.zip.