Full Project Setup Example
This guide will walk you through setting up a full project with Kapa. For this tutorial we will use the popular data integration platform Airbyte as an example as it has many of the frequently used sources. Plus, Airbyte is also a user of Kapa.
To get started, we will focus on the following sources:
- Developer Documentation (Web Crawl) π
- Tutorials (Web Crawl) πΈοΈ
- Community Forum (Discourse) π¬
- Technical Blogs (Web Crawl) π
- GitHub Issues (GitHub) π
- YouTube Videos (YouTube) πΊ
Developer Documentation (Web Crawl) πβ
The Airbyte Developer Documentation provides customized guidance for users, contributors, and cloud infrastructure management, including setup tutorials and contribution guidelines that we want to ingest into Kapa. We will use the Web Crawl
source to achieve this.
Step 1: Crawlingβ
Follow the steps below to configure a data source.
-
Start URL: Navigate to the crawl configurations page and enter the main document URL as the start URL. Here we set it to
https://docs.airbyte.io
as the start URL. -
URLs to Include: For instance, include all URLs with
https://docs.airbyte.com
. -
Exclude URLs: Here, we want to exclude their entire archive links as they are irrelevant, using the substring
/archive/
. -
Performance Toggles: Use this section to index all links from the Airbyte docs sitemap and remove parameters from the links.
-
Initiate Crawl: Kick off the crawl task by clicking the
Crawl
button.
Step 2: Parsingβ
After the crawl task is complete, proceed to the Content Selection
step. Provide appropriate CSS selectors to finalize the source setup.
-
Selector: For instance, use
article > .markdown
here to target the main content area. -
Selectors to exclude: Here, use CSS selector
a.hash-link
to remove anchor links from main content. -
Convert: Click the
Convert
button to preview the page post-parsing. -
Initiate Conversion: If all looks good, proceed with the conversion task by clicking the
Save
button.
Tutorials (Web Crawl) πΈοΈβ
Airbyte maintains a large and up-to-date catalogue of technical tutorials. This is a great source for Kapa as user questions are often about specific set up questions like 'How do I integrate with Airflow' for which tutorials often have answers. To set up a Airbyte Tutorials source, follow the simple steps below:
-
Start URL: Use
https://airbyte.com/tutorials
as the start URL for this source. -
URLs to Include: Specify
https://airbyte.com/tutorials
to index all tutorial pages. -
Performance Toggles: Enable the crawl sitemap option to index links from the Sitemap.
-
Initiate Crawl: Initiate the crawl task by clicking the
Crawl
button. -
Content Selection: Once the crawl task is finished, finalize the content selection by using appropriate CSS selectors.
Community Forum (Discourse) π¬β
The Airbyte Discourse Forum includes a wide range of user discussions on topics such as Q&A, Troubleshooting, Guides, and more, which Kapams to source answers from. The simple steps to configure the Discourse source are listed below:
-
Enter the Discourse link in the URL text box to validate it as a valid Discourse site.
-
Select the post age from the dropdown menu to limit posts based on relevancy. By default, all forum posts are included.
-
Utilize the option
Include only posts marked as solved
to filter and include only solved posts. -
To finalize, utilize the
Save
button to initiate the task of fetching all discourse forum posts based on the provided options.
Technical Blogs (Web Crawl) πβ
Here are the steps involved in setting up a source to fetch Airbyte Technical Blogs tagged with Data Insights
:
-
Start URL: Use
https://airbyte.com/blog-categories/data-insights
as the start URL for this source. -
URLs to Include: Specify
https://airbyte.com/blog/
to index all blog posts. -
CSS/XPath selectors to include: For instance, we utilize the CSS selector
#content > main.main-wrapper > .section_article .article_grid-top-wrapper > .article_grid-category-wrapper > a[href$='/data-insights']
to exclusively match blog posts tagged withData Insights
. -
Crawling and Parsing: Follow the subsequent steps to finalize the Crawling and Parsing conversion processes.
GitHub Issues (GitHub) πβ
Using GitHub Issues as a source will let Kapa surface issues from public repository. Here are the steps to set up a GitHub Issues source for Airbyte Repo to fetch open issues tagged with labels issue
or type/bug
:
-
Connect your GitHub repository by completing the
Owner
andName
fields. Let's useOwner
asairbytehq
andName
asairbyte
to establish the connection with the repo. -
Specify the
Issue State
by selectingOpen
here. -
Define
Issue Age
for relevance. -
Next, specify a list of labels in the
Include Issue labels
section by selecting thetype/bug
label. -
Once all set, click the
Save
button to initiate fetching all GitHub issues based on the config.
YouTube Videos (YouTube) πΊβ
To configure an Airbyte YouTube source, which retrieves transcripts from YouTube channel playlists recommended by Kapa for user queries, follow two straightforward steps:
-
YouTube Channel ID: Connect to the YouTube channel using the Channel ID.
-
Channel Playlists: Select all relevant playlists to be ingested. By default, all playlists from the specific channel will be ingested.
Review & Ingest πβ
When a source is ready for review, the Review
button is enabled in the Actions
column of the source list page.
If the staged changes look good, go ahead and ingest the source, giving access to Kapa.
Additionally, Kapa offers the option to configure auto-refresh by scheduling a weekly cron job, which can be enabled within the source settings.