Overview of IBM Cloud data sources

You can use IBM Watson® Discovery on the IBM Cloud® to connect to and crawl documents from remote sources.

IBM Cloud IBM Cloud only

This information applies only to managed deployments. For more information about IBM Cloud Pak for Data data sources, see Overview of Cloud Pak for Data data sources.

Connect to an external data source so that you can pull documents into Discovery on a schedule. Discovery pulls documents from the data source by crawling the data source. Crawling is the process of systematically browsing and retrieving documents from a starting location that you specify. When the crawler first processes a data source, it performs a full crawl. Each time the crawler runs after the initial crawl, it performs a refresh, where it checks for new and changed files only.

All Discovery data source connectors are read-only. Regardless of the permissions that are granted to the crawl account, Discovery never writes, updates, or deletes any content in the original data source.

You can use Discovery to crawl from the following data sources:

Your data source isn't listed? Check whether IBM® App Connect has a connector to the data source. You can use a default connector that is built for App Connect to send data from a data source to Discovery. For a list of the data sources supported by App Connect default connectors, see Connectors A-Z. For more information about integrating App Connect with Discovery, see How to use IBM App Connect with IBM Watson® Discovery.

To use an App Connect connector, you must create a separate App Connect instance. Costs that are incurred from a paid App Connect instance are not included with the cost of using Discovery. Except for indexing, Discovery does not support any integration with App Connect that you perform on your own.

Data source requirements

The following requirements and limitations are specific to Discovery on IBM Cloud:

A collection can connect to only one data source.
For more information about size limits, which can differ per plan, see the following topics:
- Collection limits
- Document limits

Data source connection and data isolation

When you connect to external data sources, you reduce the data isolation of your service instance because data in transit between the source and the service cannot be isolated. All other data isolation (at-rest, administration, query) remains in full. All in-flight communication among services and data sources is encrypted with TLS v1.2. The private keys for the TLS certificates are encrypted at rest with AES-256-GCM encryption. The service certificates expire every three years and the certificate revocation lists are updated monthly. All credentials are sent over an encrypted connection that uses TLS v1.2 and are encrypted at rest with AES-256 encryption. Connections to data sources use the secure protocols that are supported by the data sources.

Connecting to data sources with IP restrictions

Some data sources allow crawlers from only a limited number of trusted network addresses or domains to access and process their data. If one of the data sources that you want to connect to limits access in this way, you can add IBM-managed IP addresses to the allowlist of the data source.

Network addresses are subject to change from time to time. You can monitor for updates to these addresses by subscribing to the repo notifications for this page. Click Edit Topic and then select Watching in the Notifications dialog of the repo.

For service instances that are hosted in a US-based data center and that were created on or after 1 May 2020, add the following IP addresses:
```
150.238.21.0/28
169.48.255.224/28
174.36.69.128/28
```
For service instances that are hosted in non-US data centers and that were created on or after 21 February 2021, add the following IP addresses:
```
159.122.203.64/28
158.175.114.128/28
158.176.107.48/28
```