DocRetrieval: Your Local, Privacy-Focused Knowledge Base
In the age of Large Language Models (LLMs), the ability to "chat" with your own data is the holy grail. However, most solutions require uploading your sensitive contracts, research papers, or personal journals to the cloud. DocRetrieval changes the game.
It is a robust, local RAG (Retrieval-Augmented Generation) system that runs entirely on your hardware. By leveraging the power of Ollama, LlamaIndex, and ChromaDB, it transforms your static folder of documents into an intelligent, queryable knowledge base—without a single byte leaving your machine.
Why DocRetrieval?
1. Deep PDF Understanding with Marker
Most RAG systems struggle with complex PDFs (columns, tables, math equations). DocRetrieval integrates marker-pdf, a deep learning pipeline that OCRs and converts PDFs into clean, structured Markdown. This ensures that your LLM actually "reads" the content correctly, rather than just seeing a garbled mess of text.
2. Absolute Privacy & Control
Your data is yours. DocRetrieval uses local embeddings (via HuggingFace models like BAAI/bge-large-en-v1.5) and local inference (via Ollama). Whether you are a law firm handling contracts or a developer organizing technical docs, you can rest assured that no third-party API is training on your secrets.
3. GPU-Accelerated Performance
Built with performance in mind, the system is optimized for NVIDIA GPUs. From the heavy lifting of OCR during ingestion to the rapid vector similarity search using ChromaDB, every step utilizes CUDA acceleration where possible, making it snappy even with large datasets.
Under the Hood: The Tech Stack
For the developers out there, DocRetrieval is built on a modern Python stack designed for extensibility:
- LlamaIndex: The orchestration framework managing the data flow between files, embeddings, and the LLM.
- ChromaDB: A high-performance, open-source vector database that stores your document embeddings for semantic search.
- Rich: Provides a beautiful, informative CLI experience with progress bars, spinners, and formatted output.
- Gradio: Powers the user-friendly web interface for those who prefer a browser over a terminal.
How to Use DocRetrieval
Getting started is as simple as cloning the repo and setting up your environment. Here is a typical workflow:
Step 1: Ingest Your Data
Point the CLI at your document folder. The system will recursively find PDFs, text files, and markdown docs, process them, and build the vector index.
doc-retrieval ingest ./my-documents
Watch as `rich` displays real-time progress of the OCR and chunking process.
Step 2: Ask Questions
Once indexed, you can query your knowledge base immediately from the command line:
doc-retrieval query "What are the termination conditions in the 2024 contract?"
Step 3: Interactive Chat & Web UI
For a more conversational experience, launch the interactive mode or the Gradio web server:
# Terminal Chat
doc-retrieval interactive
# Web Interface
doc-retrieval gradio
marker-pdf library to perform deep learning-based OCR. Unlike traditional extraction tools that simply pull text, Marker analyzes the layout to reconstruct tables into valid Markdown format before embedding them. This ensures the LLM can understand the structural relationship between rows and columns.
1. src/doc_retrieval/ingestion/processor.py (score: 0.892)
2. README.md (score: 0.815)
Conclusion: DocRetrieval bridges the gap between your private data and modern AI. It's open-source, powerful, and respects your privacy.
Check out the code, contribute, or star the project on GitHub:
View on GitHub