Skip to main content

Full Project Setup Example

This guide will walk you through setting up a full project with Kapa. For this tutorial we will use the popular data integration platform Airbyte as an example as it has many of the frequently used sources. Plus, Airbyte is also a user of Kapa.

To get started, we will focus on the following sources:

  1. Developer Documentation (Web Crawl) πŸ“–
  2. Tutorials (Web Crawl) πŸ•ΈοΈ
  3. Community Forum (Discourse) πŸ’¬
  4. Technical Blogs (Web Crawl) πŸ“
  5. GitHub Issues (GitHub) 🐞
  6. YouTube Videos (YouTube) πŸ“Ί

Developer Documentation (Web Crawl) πŸ“–β€‹

The Airbyte Developer Documentation provides customized guidance for users, contributors, and cloud infrastructure management, including setup tutorials and contribution guidelines that we want to ingest into Kapa. We will use the Web Crawl source to achieve this.

Airbyte Docs

Step 1: Crawling​

Follow the steps below to configure a data source.

  1. Start URL: Navigate to the crawl configurations page and enter the main document URL as the start URL. Here we set it to https://docs.airbyte.io as the start URL.

  2. URLs to Include: For instance, include all URLs with https://docs.airbyte.com.

  3. Exclude URLs: Here, we want to exclude their entire archive links as they are irrelevant, using the substring /archive/.

  4. Performance Toggles: Use this section to index all links from the Airbyte docs sitemap and remove parameters from the links.

  5. Initiate Crawl: Kick off the crawl task by clicking the Crawl button.

Airbyte Developer Docs Crawling

Step 2: Parsing​

After the crawl task is complete, proceed to the Content Selection step. Provide appropriate CSS selectors to finalize the source setup.

  1. Selector: For instance, use article > .markdown here to target the main content area.

  2. Selectors to exclude: Here, use CSS selector a.hash-link to remove anchor links from main content.

  3. Convert: Click the Convert button to preview the page post-parsing.

  4. Initiate Conversion: If all looks good, proceed with the conversion task by clicking the Save button.

Airbyte Developer Docs Parsing

Tutorials (Web Crawl) πŸ•ΈοΈβ€‹

Airbyte maintains a large and up-to-date catalogue of technical tutorials. This is a great source for Kapa as user questions are often about specific set up questions like 'How do I integrate with Airflow' for which tutorials often have answers. To set up a Airbyte Tutorials source, follow the simple steps below:

  1. Start URL: Use https://airbyte.com/tutorials as the start URL for this source.

  2. URLs to Include: Specify https://airbyte.com/tutorials to index all tutorial pages.

  3. Performance Toggles: Enable the crawl sitemap option to index links from the Sitemap.

  4. Initiate Crawl: Initiate the crawl task by clicking the Crawl button.

  5. Content Selection: Once the crawl task is finished, finalize the content selection by using appropriate CSS selectors.

Airbyte Tutorials Crawling

Airbyte Tutorials Parsing

Community Forum (Discourse) πŸ’¬β€‹

The Airbyte Discourse Forum includes a wide range of user discussions on topics such as Q&A, Troubleshooting, Guides, and more, which Kapams to source answers from. The simple steps to configure the Discourse source are listed below:

  1. Enter the Discourse link in the URL text box to validate it as a valid Discourse site.

  2. Select the post age from the dropdown menu to limit posts based on relevancy. By default, all forum posts are included.

  3. Utilize the option Include only posts marked as solved to filter and include only solved posts.

  4. To finalize, utilize the Save button to initiate the task of fetching all discourse forum posts based on the provided options.

Airbyte Discourse Forum

Technical Blogs (Web Crawl) πŸ“β€‹

Here are the steps involved in setting up a source to fetch Airbyte Technical Blogs tagged with Data Insights:

  1. Start URL: Use https://airbyte.com/blog-categories/data-insights as the start URL for this source.

  2. URLs to Include: Specify https://airbyte.com/blog/ to index all blog posts.

  3. CSS/XPath selectors to include: For instance, we utilize the CSS selector #content > main.main-wrapper > .section_article .article_grid-top-wrapper > .article_grid-category-wrapper > a[href$='/data-insights'] to exclusively match blog posts tagged with Data Insights.

  4. Crawling and Parsing: Follow the subsequent steps to finalize the Crawling and Parsing conversion processes.

Airbyte Technical Blogs Crawling

Airbyte Technical Blogs Parsing

GitHub Issues (GitHub) πŸžβ€‹

Using GitHub Issues as a source will let Kapa surface issues from public repository. Here are the steps to set up a GitHub Issues source for Airbyte Repo to fetch open issues tagged with labels issue or type/bug:

  1. Connect your GitHub repository by completing the Owner and Name fields. Let's use Owner as airbytehq and Name as airbyte to establish the connection with the repo.

  2. Specify the Issue State by selecting Open here.

  3. Define Issue Age for relevance.

  4. Next, specify a list of labels in the Include Issue labels section by selecting the type/bug label.

  5. Once all set, click the Save button to initiate fetching all GitHub issues based on the config.

Airbyte GitHub Issues

YouTube Videos (YouTube) πŸ“Ίβ€‹

To configure an Airbyte YouTube source, which retrieves transcripts from YouTube channel playlists recommended by Kapa for user queries, follow two straightforward steps:

  1. YouTube Channel ID: Connect to the YouTube channel using the Channel ID.

  2. Channel Playlists: Select all relevant playlists to be ingested. By default, all playlists from the specific channel will be ingested.

Airbyte YouTube Playlists

Review & Ingest πŸš€β€‹

When a source is ready for review, the Review button is enabled in the Actions column of the source list page.

Review Source

If the staged changes look good, go ahead and ingest the source, giving access to Kapa.

Ingest Source

Additionally, Kapa offers the option to configure auto-refresh by scheduling a weekly cron job, which can be enabled within the source settings.

Auto Refresh Source