Skip to main content

Web Crawling

Kapa provides a specialized web scraper designed for extracting data from public websites. It allows you to ingest documentation, tutorials, blogs, and other web content into Kapa.

Prerequisites

  • URLs of the websites you want to crawl
  • Basic understanding of CSS selectors (for defining which content to extract)
  • Access to the Kapa platform

Data ingested

When you connect Kapa to websites using Web Crawling, the following data is ingested:

  • Page URLs
  • Page titles and full content (converted to markdown)
  • Hierarchical structures (if discoverable)

Setup

The Web Crawling source setup follows a two-step process:

  1. Crawling: defining which pages to include
  2. Parsing: extracting the relevant content from each page

Step 1: Crawling

In this phase, you define the universe of web pages to include in your knowledge base:

  1. Navigate to the Sources tab in the Kapa platform
  2. Click Add new source
  3. Select Web Crawling as the source type
  4. Enter a Start URL or specify multiple start URLs
  5. Configure crawling parameters (URLs to include/exclude, performance toggles)
  6. Click Crawl to initiate the crawling process
  7. Review the Crawl Preview to ensure the right pages are included
  8. Once satisfied, proceed to the parsing step

Step 2: Parsing

After defining which pages to crawl, you need to tell Kapa which parts of each page to extract:

  1. In the Content Selection section, specify CSS selectors to extract the main content
  2. Configure exclusion parameters to remove irrelevant elements
  3. Click Convert to preview the parsed content
  4. Review the Source Preview to ensure the content is extracted correctly
  5. Iterate on your selectors until you achieve clean content extraction
  6. Click Save to finalize the source configuration

Best practices

Optimizing your crawl

  • Start specific, then expand: Begin with a focused set of URLs before crawling an entire site
  • Use URL patterns strategically: Include/exclude patterns can significantly improve crawl efficiency
  • Target content containers: Select CSS elements that contain just the main content, avoiding navigation, footers, etc.
  • Test iteratively: Use the preview functionality to refine your selectors
  • Exclude irrelevant elements: Remove navigation bars, sidebars, and other non-content elements
  • JavaScript rendering: Enable only when needed, as it dramatically increases crawl time
  • Limit by date when possible: For frequently updated sites, focus on recent content

Multiple start URLs

For sites where standard crawling isn't effective, you can provide multiple start URLs:

  1. Create a text file (.txt) with each URL on a separate line:

    https://docs.kapa.ai/core-concepts
    https://docs.kapa.ai/authentication
    https://docs.kapa.ai/feedback
  2. Upload this file in the Start URL field

  3. Configure parsing as normal

Common crawling patterns

Documentation sites

For typical documentation sites:

  • Include pattern: /docs/
  • Main content selector: article, .content, or .documentation
  • Exclude selectors: .sidebar, .navigation, .toc

Blog platforms

For blog platforms:

  • Include pattern: /blog/
  • Main content selector: .post-content, article, or .entry-content
  • Exclude selectors: .related-posts, .comments, .author-bio

Forums and community sites

For community forums:

  • Include pattern: /forum/, /community/
  • Main content selector: .topic-content, .post-body, or .thread-content
  • Exclude selectors: .user-info, .reaction-buttons, .signature

Troubleshooting

  • No pages found:
    • Check your start URL and inclusion patterns
    • Verify the site doesn't block crawlers
    • If your website requires JavaScript, the crawler won't find any pages unless you enable the "Render JavaScript" option
  • Too many pages crawled: Refine your URL patterns to be more specific
  • Empty content after parsing: Your CSS selector might be incorrect; try a more general selector
  • Missing content sections: You may need to include multiple CSS selectors separated by commas
  • Error messages in parsed content: Add those elements to your "Selectors to exclude" list
  • Crawl taking too long: Narrow your URL scope or exclude large sections of the site