Web Crawling
Kapa provides a specialized web scraper designed for extracting data from public websites. It allows you to ingest documentation, tutorials, blogs, and other web content into Kapa.
Prerequisites
- URLs of the websites you want to crawl
- Basic understanding of CSS selectors (for defining which content to extract)
- Access to the Kapa platform
Data ingested
When you connect Kapa to websites using Web Crawling, the following data is ingested:
- Page URLs
- Page titles and full content (converted to markdown)
- Hierarchical structures (if discoverable)
Setup
The Web Crawling source setup follows a two-step process:
Step 1: Crawling
In this phase, you define the universe of web pages to include in your knowledge base:
- Navigate to the Sources tab in the Kapa platform
- Click Add new source
- Select Web Crawling as the source type
- Enter a Start URL or specify multiple start URLs
- Configure crawling parameters (URLs to include/exclude, performance toggles)
- Click Crawl to initiate the crawling process
- Review the Crawl Preview to ensure the right pages are included
- Once satisfied, proceed to the parsing step
Step 2: Parsing
After defining which pages to crawl, you need to tell Kapa which parts of each page to extract:
- In the Content Selection section, specify CSS selectors to extract the main content
- Configure exclusion parameters to remove irrelevant elements
- Click Convert to preview the parsed content
- Review the Source Preview to ensure the content is extracted correctly
- Iterate on your selectors until you achieve clean content extraction
- Click Save to finalize the source configuration
Best practices
Optimizing your crawl
- Start specific, then expand: Begin with a focused set of URLs before crawling an entire site
- Use URL patterns strategically: Include/exclude patterns can significantly improve crawl efficiency
- Target content containers: Select CSS elements that contain just the main content, avoiding navigation, footers, etc.
- Test iteratively: Use the preview functionality to refine your selectors
- Exclude irrelevant elements: Remove navigation bars, sidebars, and other non-content elements
- JavaScript rendering: Enable only when needed, as it dramatically increases crawl time
- Limit by date when possible: For frequently updated sites, focus on recent content
Multiple start URLs
For sites where standard crawling isn't effective, you can provide multiple start URLs:
-
Create a text file (
.txt
) with each URL on a separate line:https://docs.kapa.ai/core-concepts
https://docs.kapa.ai/authentication
https://docs.kapa.ai/feedback -
Upload this file in the Start URL field
-
Configure parsing as normal
Common crawling patterns
Documentation sites
For typical documentation sites:
- Include pattern:
/docs/
- Main content selector:
article
,.content
, or.documentation
- Exclude selectors:
.sidebar
,.navigation
,.toc
Blog platforms
For blog platforms:
- Include pattern:
/blog/
- Main content selector:
.post-content
,article
, or.entry-content
- Exclude selectors:
.related-posts
,.comments
,.author-bio
Forums and community sites
For community forums:
- Include pattern:
/forum/
,/community/
- Main content selector:
.topic-content
,.post-body
, or.thread-content
- Exclude selectors:
.user-info
,.reaction-buttons
,.signature
Troubleshooting
- No pages found:
- Check your start URL and inclusion patterns
- Verify the site doesn't block crawlers
- If your website requires JavaScript, the crawler won't find any pages unless you enable the "Render JavaScript" option
- Too many pages crawled: Refine your URL patterns to be more specific
- Empty content after parsing: Your CSS selector might be incorrect; try a more general selector
- Missing content sections: You may need to include multiple CSS selectors separated by commas
- Error messages in parsed content: Add those elements to your "Selectors to exclude" list
- Crawl taking too long: Narrow your URL scope or exclude large sections of the site