DocRetrieval: Your Local, Privacy-Focused Knowledge Base

Jan 21, 20265 min read

In the age of Large Language Models (LLMs), the ability to "chat" with your own data is the holy grail. However, most solutions require uploading your sensitive contracts, research papers, or personal journals to the cloud. DocRetrieval changes the game.

It is a robust, local RAG (Retrieval-Augmented Generation) system that runs entirely on your hardware. By leveraging the power of Ollama, LlamaIndex, and ChromaDB, it transforms your static folder of documents into an intelligent, queryable knowledge base—without a single byte leaving your machine.

Why DocRetrieval?

1. Deep PDF Understanding with Marker

Most RAG systems struggle with complex PDFs (columns, tables, math equations). DocRetrieval integrates marker-pdf, a deep learning pipeline that OCRs and converts PDFs into clean, structured Markdown. This ensures that your LLM actually "reads" the content correctly, rather than just seeing a garbled mess of text.

2. Absolute Privacy & Control

Your data is yours. DocRetrieval uses local embeddings (via HuggingFace models like BAAI/bge-large-en-v1.5) and local inference (via Ollama). Whether you are a law firm handling contracts or a developer organizing technical docs, you can rest assured that no third-party API is training on your secrets.

3. GPU-Accelerated Performance

Built with performance in mind, the system is optimized for NVIDIA GPUs. From the heavy lifting of OCR during ingestion to the rapid vector similarity search using ChromaDB, every step utilizes CUDA acceleration where possible, making it snappy even with large datasets.

Under the Hood: The Tech Stack

For the developers out there, DocRetrieval is built on a modern Python stack designed for extensibility:

  • LlamaIndex: The orchestration framework managing the data flow between files, embeddings, and the LLM.
  • ChromaDB: A high-performance, open-source vector database that stores your document embeddings for semantic search.
  • Rich: Provides a beautiful, informative CLI experience with progress bars, spinners, and formatted output.
  • Gradio: Powers the user-friendly web interface for those who prefer a browser over a terminal.

How to Use DocRetrieval

Getting started is as simple as cloning the repo and setting up your environment. Here is a typical workflow:

Step 1: Ingest Your Data

Point the CLI at your document folder. The system will recursively find PDFs, text files, and markdown docs, process them, and build the vector index.

doc-retrieval ingest ./my-documents

Watch as `rich` displays real-time progress of the OCR and chunking process.

Step 2: Ask Questions

Once indexed, you can query your knowledge base immediately from the command line:

doc-retrieval query "What are the termination conditions in the 2024 contract?"

Step 3: Interactive Chat & Web UI

For a more conversational experience, launch the interactive mode or the Gradio web server:

# Terminal Chat
doc-retrieval interactive
# Web Interface
doc-retrieval gradio
EXAMPLE OUTPUT
Q: How does the system handle PDF tables?
A: The system utilizes the marker-pdf library to perform deep learning-based OCR. Unlike traditional extraction tools that simply pull text, Marker analyzes the layout to reconstruct tables into valid Markdown format before embedding them. This ensures the LLM can understand the structural relationship between rows and columns.
Sources:
1. src/doc_retrieval/ingestion/processor.py (score: 0.892)
2. README.md (score: 0.815)

Conclusion: DocRetrieval bridges the gap between your private data and modern AI. It's open-source, powerful, and respects your privacy.

Check out the code, contribute, or star the project on GitHub:

View on GitHub