Building a Cloud Pak for Data custom crawler plug-in
Discovery features the option to build your own crawler plug-in with a Java SDK. By using crawler plug-ins, you can now quickly develop relevant solutions for your use cases. You can download the SDK from your installed Discovery cluster. For more information, see Obtaining the crawler plug-in SDK package.
IBM Cloud Pak for Data IBM Cloud Pak for Data only
This information applies only to installed deployments.
Any custom code that you use with IBM Watson® Discovery is the responsibility of the developer; IBM Support does not cover any custom code that the developer creates.
The crawler plug-ins support the following functions:
- Update the metadata list of a crawled document
- Update the content of a crawled document
- Exclude a crawled document
- Reference crawler configurations, masking password values
- Show notice messages in the Discovery user interface
- Output log messages to the
crawler
pod console
However, the crawler
plug-ins cannot support the following functions:
- Split a crawled document into multiple documents
- Combine content from multiple documents into a single document
- Modify access control lists
Crawler plug-in requirements
Make sure that the following items are installed on the development server that you plan to use to develop a crawler
plug-in by using this SDK:
- Java SE Development Kit (JDK) 1.8 or higher
- Gradle
- cURL
- sed (stream editor)
Obtaining the crawler plug-in SDK package
-
Log in to your Discovery cluster.
-
Enter the following command to obtain your
crawler
pod name:oc get pods | grep crawler
The following example shows sample output.
wd-discovery-crawler-57985fc5cf-rxk89 1/1 Running 0 85m
-
Enter the following command to obtain the SDK package name, replacing
{crawler-pod-name}
with thecrawler
pod name that you obtained in step 2:oc exec {crawler-pod-name} -- ls -l /opt/ibm/wex/zing/resources/ | grep wd-crawler-plugin-sdk
The following example shows sample output.
-rw-r--r--. 1 dadmin dadmin 35575 Oct 1 16:51 wd-crawler-plugin-sdk-${build-version}.zip
-
Enter the following command to copy the SDK package to the host server, replacing
{build-version}
with the build version number from the previous step:oc cp {crawler-pod-name}:/opt/ibm/wex/zing/resources/wd-crawler-plugin-sdk-${build-version}.zip wd-crawler-plugin-sdk.zip
-
If necessary, copy the SDK package to the development server.
Building a crawler plug-in package
- Extract the SDK compressed file.
- Implement the plug-in logic in
src/
. Ensure that the dependency is written inbuild.gradle
. - Enter
gradle packageCrawlerPlugin
to create the plug-in package. The package is generated asbuild/distributed/wd-crawler-plugin-sample.zip
.