Web Crawling

Kapa provides a specialized web scraper designed for extracting data from public websites. It allows you to ingest documentation, tutorials, blogs, and other web content into Kapa.

Prerequisites

URLs of the websites you want to crawl
Basic understanding of CSS selectors (for defining which content to extract)
Access to the Kapa platform

Data ingested

When you connect Kapa to websites using Web Crawling, the following data is ingested:

Page URLs
Page titles and full content (converted to markdown)
Hierarchical structures (if discoverable)

Setup

The Web Crawling source setup follows a two-step process:

Crawling: defining which pages to include
Parsing: extracting the relevant content from each page

Step 1: Crawling

In this phase, you define the universe of web pages to include in your knowledge base:

Navigate to the Sources tab in the Kapa platform
Click Add new source
Select Web Crawling as the source type
Enter a Start URL or specify multiple start URLs
Configure crawling parameters (URLs to include/exclude, performance toggles)
Click Crawl to initiate the crawling process
Review the Crawl Preview to ensure the right pages are included
Once satisfied, proceed to the parsing step

Step 2: Parsing

After defining which pages to crawl, you need to tell Kapa which parts of each page to extract:

In the Content Selection section, specify CSS selectors to extract the main content
Configure exclusion parameters to remove irrelevant elements
Click Convert to preview the parsed content
Review the Source Preview to ensure the content is extracted correctly
Iterate on your selectors until you achieve clean content extraction
Click Save to finalize the source configuration

Best practices

Optimizing your crawl

Start specific, then expand: Begin with a focused set of URLs before crawling an entire site
Use URL patterns strategically: Include/exclude patterns can significantly improve crawl efficiency
Target content containers: Select CSS elements that contain just the main content, avoiding navigation, footers, etc.
Test iteratively: Use the preview functionality to refine your selectors
Exclude irrelevant elements: Remove navigation bars, sidebars, and other non-content elements
JavaScript rendering: Enable only when needed, as it dramatically increases crawl time
Limit by date when possible: For frequently updated sites, focus on recent content

Multiple start URLs

For sites where standard crawling isn't effective, you can provide multiple start URLs:

Create a text file (.txt) with each URL on a separate line:

https://docs.kapa.ai/core-concepts
https://docs.kapa.ai/authentication
https://docs.kapa.ai/feedback

Upload this file in the Start URL field
Configure parsing as normal

Common crawling patterns

Documentation sites

For typical documentation sites:

Include pattern: /docs/
Main content selector: article, .content, or .documentation
Exclude selectors: .sidebar, .navigation, .toc

Blog platforms

For blog platforms:

Include pattern: /blog/
Main content selector: .post-content, article, or .entry-content
Exclude selectors: .related-posts, .comments, .author-bio

Forums and community sites

For community forums:

Include pattern: /forum/, /community/
Main content selector: .topic-content, .post-body, or .thread-content
Exclude selectors: .user-info, .reaction-buttons, .signature

Troubleshooting

No pages found:
- Check your start URL and inclusion patterns
- Verify the site doesn't block crawlers
- If your website requires JavaScript, the crawler won't find any pages unless you enable the "Render JavaScript" option
Too many pages crawled: Refine your URL patterns to be more specific
Empty content after parsing: Your CSS selector might be incorrect; try a more general selector
Missing content sections: You may need to include multiple CSS selectors separated by commas
Error messages in parsed content: Add those elements to your "Selectors to exclude" list
Crawl taking too long: Narrow your URL scope or exclude large sections of the site

Prerequisites​

Data ingested​

Setup​

Step 1: Crawling​

Step 2: Parsing​

Best practices​

Optimizing your crawl​

Multiple start URLs​

Common crawling patterns​

Documentation sites​

Blog platforms​

Forums and community sites​

Troubleshooting​