Skip to main content

S3 Storage

kapa provides an integration to pull files from S3 storage as data sources. This is helpful if you need to give kapa access to a large amount of files. This can also be leveraged if these files are not public.

How do I permission my bucket so kapa can access it?

Kapa requires an Access Key and Secret Access Key to pull files from S3 storage. The user which these keys are associated with needs to have the following permissions for the S3 storage:

  • list object permissions
  • read object permissions

What file formats are supported?

  • This integration supports pulling
    • Markdown: .md
    • Text: .txt
    • Word: .docx
  • Files of different types are ignored.
  • You can mix different files types in your bucket.

What if my filetypes are not supported?

In that case you have to convert them to those formats. Fortunately there are lots of projects available to do that (e.g., Tesseract for PDFs). Reach out to the kapa team and they'd be happy to help design a solution.

How should I organize my files in the bucket?

There are no strict requirements on file structure. Kapa will start looking for files at the root of the bucket or if a bucket_prefix is specified at the prefix. It will discover all files including the ones under sub directories.

How can I map the files in my bucket to urls?

You can give kapa a url for each file in your bucket. When kapa references one of these files when answering a question it will point to that url. If you do not give kapa a mapping between the files in the bucket and urls it will not be able to point to a url when referencing them. You can give kapa a mapping by adding an index.json file to your bucket. This mapping file must have the following format:

[
{
"object_key" : "example_file_1.md",
"source_url": "https://docs.example.com/example_file_1.md",

},
{
"object_key" : "example_file_2.md",
"source_url": "https://docs.example.com/example_file_2.md",

},
...
]

The mapping files must be placed at the root of your bucket or if a bucket_prefix is specified directly under the bucket prefix.

Adding an index.json file to your bucket is optional and not all files have to be represented in the index.json.

When a file was successfully mapped to a url by kapa it will show the url in the review screen. Otherwise the file path is shown.