Web Spiders
In the Web Spiders, you can confirm, add, and update a list of created web spiderss.
Web Spiders List
Accessing the screen
Click on [AI/RAG] -> [Web Spiders].

Item Description

| Item | Description | 
|---|---|
| Enabled | Indicates whether the web spiders is enabled. | 
| Title | Displays the title of the web spiders. | 
| The Source for Crawling | Displays the target to crawl. | 
| History | Click to view the crawl history. | 
| Updated on | Displays the date and time when the web spiders was last updated. | 
Web spiders editor
Accessing the screen
Click on [AI/RAG] -> [Web Spiders].

From the Web Spiders list page, click on the [Title] of the web spiders you want to edit.

Basic Settings

| Item | Description | 
|---|---|
| Title | Set the title of the web spiders. | 
| Memo | Enter a memo. | 
| Data Import API | Select the endpoint for data import: 
 | 
| The Source for Crawling | Select the target for crawling. Currently supported targets: 
 | 
| Crawl Limit | Set the crawl limit. Specify 0 for unlimited. | 
| Collecting Images | Enable if you want to collect images. | 
| Force Update | Enable for force update. | 
| Status | Select the enabled status of the web spiders. | 
Website crawling settings
General

| Item | Description | 
|---|---|
| Start Urls | Enter the URL to start crawling. Multiple entries can be made separated by line breaks. | 
| Allowed Urls | Enter the URLs to allow crawling. Multiple entries can be made separated by line breaks. | 
| Sitemap Urls | Enter the sitemap URL. | 
| Denied Urls | Enter the URLs to deny crawling. | 
| Allowed langs | Enter the languages to allow if there are multiple languages. | 
| Follow the links | Enable to crawl by following HTML links. | 
| Follow the secondary links | Enable to follow secondary links. | 
Data transformation and import settings

| Item | Description | 
|---|---|
| CSS selector for identifying main content | Enter the CSS selector to identify as main content. | 
| CSS selector for identifying categories | Enter the CSS selector to identify categories. | 
| Strings to remove from title tag | Enter the strings to remove from the title tag. | 
| CSS selector for the part to be removed from the main content | Enter the CSS selector to remove from the main content. | 
Content Structure Required for Saving Crawl Data
To save crawl results as content, the following content structure must be included.
| Item Name (Optional) | Repetition | Item Setting | Slug | Annotation (Optional) | 
|---|---|---|---|---|
| Date | Date picker Also include seconds (hh:mm:ss): Enabled | ymd | The updated date will be set. | |
| Contents | 1 | HTML Allow all tags: Enabled | data | Contains content converted to markdown format. | 
| URL | 1 | Single-line text | url | |
| Hash Value | 1 | Single-line text | etag | Used to check for updates to the content. | 
| Language | 1 | Single-line text | lang | |
| Main Content CSS Selector | 1 | Single-line text | selector | Specifies the content to extract from the page. | 
| Response Status | 1 | Number | response_status | |
| Content Size | 1 | Number | content-length | |
| Content Type | 1 | Single-line text | content-type | |
| Manual Adjustment Flag | 1 | Single choice 0: Disabled (Default) 1: Enabled | manual_override_flag | When enabled, the crawler will not overwrite. | 
| Domain | 1 | Single-line text | domain | |
| Description | 1 | Single-line text | description | |
| Icon URL | 1 | Single-line text | icon_url | |
| OGP Image URL | 1 | Single-line text | ogp_image_url | |
| Images | 20 | Grouping of the 3 Items Below | images | |
| - Image URL | File (from File manager) | image_url | ||
| - Image src | Single-line text | image_src | ||
| - Alt Tag | Single-line text | alt | ||
| Last Modified | 1 | Date picker Also include time (hh:mm): Enabled | last-modified | 
Run the crawler History
Accessing the screen
Click on [AI/RAG] -> [Web Spiders].

Click on the [History] of the web spiders you want to edit from the list of web spiderss on the web spiders list page.

Item Description

| Item | Description | 
|---|---|
| Status | Displays the current state of the crawl. | 
| The Source for Crawling | Displays the target of the crawl. | 
| Content | Displays the content definition name where the crawled pages are registered. | 
| Start Urls | Displays the URL where the crawl started. | 
| Start Date and Time | Displays the date and time when the crawl was started. | 
| End Date and Time | Displays the date and time when the crawl ended. | 
| Processing time | Displays the processing time of the crawl. | 
| Reason for Termination | Displays the reason for the crawl ending. | 
| Crawled count | Displays the number of pages processed during the crawl. | 
| Log | Click to view logs related to the crawl. | 
| Rerun | Click to rerun the crawl. | 
Support
If you have any other questions, please contact us or check out Our Slack Community.