Web crawl
Crawl a website. You can crawl public websites and websites that require authentication.
IBM Cloud Pak for Data IBM Cloud Pak for Data only
This information applies only to installed deployments. For more information about crawling a website from a managed deployment, see Web crawl.
What documents are crawled
- The website content is processed as HTML files.
- The web crawler does not crawl dynamic websites that use JavaScript to render content. You can confirm the use of JavaScript by viewing the source code of the website in your browser.
- When a source is recrawled, new documents are added, updated documents are modified to the current version, and deleted documents are deleted from the collection's index during refresh.
- All Discovery data source connectors are read-only. Regardless of the permissions that are granted to the crawl account, Discovery never writes, updates, or deletes any content in the original data source.
Prerequisite step
If you want to connect to a web site that requires authentication, you need to know the authentication credentials that are required to access the site.
-
For a website that requires basic authentication, get the following information:
- Username
- The username of a user with access to the content that you want to connect to on the website.
- Password
- The password that is associated with the username.
-
For a website that requires Windows NT LAN Manager (NTLM) authentication, get the following information:
- Username
- The username of a user with access to the content that you want to connect to on the website.
- Password
- The password that is associated with the username.
- NTLM domain name
- The NTLM domain name of the user that is authenticating with the site.
- NTLM host name
- The hostname of the NTLM server.
-
For a website that requires form-based authentication, choose how you want to access the site from the following options:
-
Direct access: Submits the form without getting the login page.
- Form action URL
- The URL to send the form data to when the form is submitted. For example,
/action_page.php
. - Required fields
- Find out the field values that must be provided in the form.
-
Indirect access: Fetches the login page and fills in the form fields. Make a note of the following information so you can provide it later:
- Form login URL
- URL of the website's login page.
- Form name
- Name of the login form.
- Required fields
- Find out the field values that must be provided in the login form.
-
Connecting to a web crawl data source
If you want to crawl a group of URLs that includes some websites that require authentication and some that don't, consider creating a different collection for each authentication type.
From your Discovery project, complete the following steps:
-
From the navigation pane, choose Manage collections.
-
Click New collection.
-
Click Web crawl, and then click Next.
-
Name the collection.
-
If the language of the website is not English, select the appropriate language.
For a list of supported languages, see Language support.
-
Optional: Change the synchronization schedule.
The Web crawl data source is designed to be used with websites that change only once or twice a week. To ensure that your collection captures all website updates, schedule the crawl to occur weekly.
For more information, see Crawl schedule options.
-
In the Specify where you want to crawl section, add the website URL to the Starting URLs field, and then click Add. Continue adding starting URLs.
The URLs where the crawler begins crawling. By default, the web crawl can crawl subtrees, and URLs can be crawled from the path that is supplied in the seed only. Use the full URL, for example
http://www.example.com/
. The start URL in the web crawl has two limitations as to what is crawled:- It crawls the same domain name as the start URL.
- It crawls all URL content up to and including the last slash (
/
) in Starting URLs. If your start URL has a subtree, the web crawl does not crawl that subtree, unless you specify its URL in Starting URLs.
-
If the URL starts with HTTPS: In the Advanced Configuration section, set the Ignore certificate switcher to On to ignore any SSL certificates on the target website.
-
Optional: Click Authentication settings to specify the authentication type to apply to one or more of the starting URLs:
-
Choose the starting URL.
-
Choose the authentication type from the following options:
- Basic authentication
- NTLM authentication
- FORM authentication
-
For Basic authentication, provide the following details:
- Username
- The username of a user with access to the content that you want to connect to on the website.
- Password
- The password that is associated with the user.
-
For NTLM authentication, provide the following details:
- Username
- The username of a user with access to the content that you want to connect to on the website.
- Password
- The password that is associated with the user.
- NTLM domain name
- The NTLM domain name that belongs to the user who is authenticating.
- NTLM host name
- The hostname of the NTLM server.
-
For FORM authentication, provide the following details:
-
In Form type, select one of the following options:
- Direct
- Click this option if you do not want to fetch the login page.
- Indirect
- Click this option if you want to fetch the login page and you want to fill the parameters in the login form.
-
Complete the following fields if you choose Direct:
- Form action url
- The form action URL that is required to submit the form.
- Form method
- Specify GET.
-
Complete the following fields if you choose Indirect:
- Form login url
- This field is required if you select the Indirect form type.
- Form name
- This field is required if you select the Indirect form type.
- Form method
- Specify POST.
-
In the Form parameters section, list of the key-value pairs of form parameters.
Complete the Key and Value fields, and then click + to add one or more form parameters.
-
-
-
Optional: If you are using a proxy server to access the data source server, then in the Proxy settings section, set the Enable proxy settings switch to
On
. Add values to the following fields:- Username
- The proxy server username to use to authenticate with the proxy server if the proxy server requires authentication.
- Password
- The proxy server password to use to authenticate with the proxy server if the proxy server requires authentication.
- Proxy server domains
- The domain or domains that the hosts reside in. You can specify a wildcard in this field, such as an asterisk (
*
) to crawl all domains or a leading asterisk (*.server1.bar.com
) to crawl domains that match a pattern. - Proxy server host name or IP address
- The hostname if you want to access the server by using a LAN, or the IP address of the server that you want to use as the proxy server.
- Proxy server port number
- The network port that you want to connect to on the proxy server.
-
Optional: Complete the following fields in Advanced Configuration:
- Code page to use
-
Specify the character encoding of the website pages. If unspecified, the default value of
UTF-8
is used.If you are crawling Chinese websites, specify
UTF-8
. - URL Path Depth
-
The level of site paths to crawl.
For example, if you specify the starting URL of
https://www.example.com
and a path depth of4
, then the crawler will access the pagehttps://www.example.com/some/more/examples/index.html
, which is located at a path that is four levels away from the root URL.You can enter a positive value only. If unspecified, the default value is
5
. The maximum path depth allowed is20
. - Maximum hops
-
The number of consecutive links to follow from the start URL.
If unspecified, the default value is
5
. The maximum number of links that the crawler can follow is20
. To not allow any hops, enter0
. - Ignore robots.txt
-
Enable this setting if you want the crawler to ignore the allow and deny rules that are outlined by the website in its robots.txt file.
Keep in mind that sites typically use the file to improve crawl results. For example, they might use the robots.txt file to prevent duplicate information from being crawled, to prevent draft content from being read, or to delay crawling so as not to overload their site.
- Rules to crawl domain
-
Specify the domain names that you want to either allow or forbid the crawler to crawl.
Domain names are case sensitive and the wildcard character (*) can occur anywhere in the domain name.
The order of the rules is significant. The crawler applies the first rule that matches a candidate URL. The default rule, forbid domain *, forbids all Web crawling and must occur last in the list of domain rules.
You can define the following types of rules, for example:
-
To exclude the entire ibm.com domain:
forbid domain www.ibm.com
-
To crawl any domain that ends with
ibm.com
:allow domain *.ibm.com
-
To crawl only port
443
on IBM domains that begin withserver
:allow domain server*.ibm.com:443
-
- Rules to crawl URL prefixes
-
Specify the HTTP and HTTPS prefixes that you want to either allow or forbid the crawler to crawl.
The wildcard character (*) can occur one or more times in the URL.
The order of the rules is significant. The crawler applies the first rule that matches a candidate URL.
You can define the following types of rules, for example:
-
To crawl pages in the public directory on this domain:
allow prefix http://*.ibm.com/public/*
-
To exclude all other directories on this domain:
forbid prefix http://*.ibm.com/*
-
- Advanced crawler properties
-
Only use when instructed to do so by IBM Support.
-
Optional: If you want to ignore any SSL certificates on the target website, set the Ignore certificate switch to
On
.This option applies to HTTPS URLs only.
-
If you want the crawler to extract text from images on the site, expand More processing settings, and set Apply optical character recognition (OCR) to
On
.When OCR is enabled and your documents contain images, processing takes longer. For more information, see Optical character recognition.
-
Click Finish.
The collection is created quickly. It takes more time for the data to be processed as it is added to the collection.
If you want to check the progress, go to the Activity page. From the navigation pane, click Manage collections, and then click to open the collection.