Supported Formats
Supported Document Formats
LongTrainer handles a wide range of document formats, enabling diverse data sources for your bot's knowledge base.
Supported File Types
| Format | Extensions | Loader |
|---|---|---|
.pdf |
PyPDFLoader |
|
| Word | .docx |
Docx2txtLoader |
| CSV | .csv |
CSVLoader |
| HTML | .html, .htm |
BSHTMLLoader |
| Markdown | .md, .markdown |
UnstructuredMarkdownLoader |
| Plain Text | .txt |
UnstructuredMarkdownLoader |
| Any format | All above + more | UnstructuredFileLoader (via use_unstructured=True) |
Adding Documents
From File Paths
# Auto-detected by extension
trainer.add_document_from_path("report.pdf", bot_id)
trainer.add_document_from_path("data.csv", bot_id)
trainer.add_document_from_path("notes.md", bot_id)
# Use UnstructuredLoader for any file type
trainer.add_document_from_path("presentation.pptx", bot_id, use_unstructured=True)
From Web Links
# URLs
trainer.add_document_from_link(["https://example.com/article"], bot_id)
# YouTube videos (transcript extraction)
trainer.add_document_from_link(["https://youtube.com/watch?v=..."], bot_id)
From Wikipedia
trainer.add_document_from_query("Artificial Intelligence", bot_id)
Pre-Loaded Documents
Pass pre-loaded LangChain Document objects directly:
from langchain_core.documents import Document
documents = [Document(page_content="Custom content", metadata={"source": "manual"})]
trainer.pass_documents(documents, bot_id)
Unstructured Data
When use_unstructured=True, LongTrainer uses LangChain's UnstructuredFileLoader which supports:
csv, doc, docx, epub, image, md, msg, odt, org, pdf, pptx, rtf, rst, tsv, xlsx
This requires system dependencies — see Installation.
Text Splitting
All documents are automatically split into chunks for FAISS indexing:
| Parameter | Default | Description |
|---|---|---|
chunk_size |
2048 |
Maximum characters per chunk |
chunk_overlap |
200 |
Overlap between consecutive chunks |
Configure these when creating the LongTrainer instance:
trainer = LongTrainer(chunk_size=1024, chunk_overlap=100)