Specify the pages and subdomains you want your Webcrawler external content connector to retrieve from your specified web source.
Before you begin
A connector administrator must have already created the Webcrawler external content connector that you want to configure crawl settings for. To learn about this procedure, see Create a Webcrawler external content connector.
Role required: sn_ext_conn.xcc_admin
About this task
This task is optional. By default, the Webcrawler external content connector crawls all pages and subdomains from its specified source system. Only perform this task if you want to specify inclusion or exclusion
filters for the subdomains to crawl or pages to retrieve when running content crawls.
Content is only retrieved from the source system if it passes all of your configured crawl setting filters. If any crawl setting filter excludes a
content item, the external content connector doesn't retrieve it.
Each Webcrawler connector can retrieve up to 50,000 items (URLs) from its source system when running content crawls.
Note: This is an exception to the general content crawl limit of one million (1,000,000) items.
Procedure
-
Navigate to .
-
In the Connectors list, select the record for the Webcrawler external content connector whose settings you want to modify.
-
In the connector editor's
Settings tab, select Crawl settings.
-
Select one of the following Content filtering options:
- To crawl all pages and subdomains from the source system, select Crawl all content.
-
To crawl only a specified set of pages and subdomains from the source system, select Include only these URLs, then use the Add URL field and Add
button to enter URLs or wildcard URL expressions for pages and subdomains that you want to include in the crawl.
For example, you might enter https://support.apple.com/ipad to include only searchable content from the specified page or subdomain. Alternately, you might enter
https://support.apple.com/ipad** to include every page or subdomain with a URL that matches the specified wildcard expression.
-
To crawl all except a specified set of pages and subdomains from the source system, select Exclude only these URLs, then use the Add URL field and
Add button to enter URLs or wildcard URL expressions for pages and subdomains that you want to exclude from the crawl.
For example, you might enter https://knowledgebase.paloaltonetworks.com/KCSArticleDetail to exclude searchable content from the specified page or subdomain. Alternately, you might enter
https://knowledgebase.paloaltonetworks.com/KCSArticleDetail** to exclude every page or subdomain with a URL that matches the specified wildcard expression.
Note: Wildcard URL expressions can include a URL prefix followed by the ** suffix. They match all URLs that begin with the specified prefix.
-
Select Save and validate.
Result
The Webcrawler external content connector is updated with your modified crawl settings.
What to do next
To retrieve content from the public web source using your modified crawl settings, create and run a one-time content crawl for your Webcrawler external content connector. To learn about creating and running one-time content crawls, see Create a content crawl for an external content connector.