Getting Started with RAG Retriever

This guide will walk you through installing RAG Retriever and loading your first documentation.

Installation

Install RAG Retriever using pipx:
```
# On MacOS
brew install pipx
pipx install rag-retriever

# On Windows/Linux
python -m pip install --user pipx
pipx install rag-retriever
```
Core Features: The basic installation includes everything needed for:
- Web content crawling and indexing
- Basic PDF text extraction
- Markdown and text file processing
- Vector storage and semantic search
- Confluence space integration
- DuckDuckGo web search
- GitHub repository integration
- Basic image analysis and indexing
- JSON output formatting
- Configurable relevance scoring
- Local file and directory processing
Optional Features: If you need advanced features, install additional dependencies:

For OCR Support (scanned documents & image text extraction):
- MacOS: brew install tesseract
- Windows: Install Tesseract
For Advanced PDF Processing (complex layouts & tables):
- MacOS: brew install poppler
- Windows: Install Poppler
Note: Install these only if you need their specific features. The core functionality works without them.
Initialize the configuration:
```
rag-retriever --init
```
Add your OpenAI API key to the config file at ~/.config/rag-retriever/config.yaml:
```
api:
  openai_api_key: "sk-your-api-key-here"
```
Security Note: During installation, RAG Retriever automatically sets strict file permissions (600) on config.yaml to ensure it's only readable by you. This helps protect your API key.

Loading Your First Documentation

Let's load some documentation to test the setup. We'll try both web documentation and a GitHub repository:

Loading Web Documentation

rag-retriever --fetch https://www.happycoders.eu/java/java-23-features --max-depth 0

Loading a GitHub Repository

# Load a popular open-source repository
rag-retriever --github-repo https://github.com/openai/openai-quickstart-python.git

# You can also specify a branch and file types
rag-retriever --github-repo https://github.com/openai/openai-python.git --branch main --file-extensions .py .md

# Example with a larger repository
rag-retriever --github-repo https://github.com/langchain-ai/langchain.git --branch master --file-extensions .py .md

Processing Images

# Process a single image (e.g., architecture diagram)
rag-retriever --ingest-image diagrams/system-architecture.png

# Process all images in a directory
rag-retriever --ingest-image-directory docs/diagrams/

# Process an image from a URL
rag-retriever --ingest-image https://example.com/images/diagram.png

When processing images, RAG Retriever:

Analyzes the image content using AI vision models
Generates detailed textual descriptions
Makes visual content searchable alongside your documentation
Supports common image formats (PNG, JPG, JPEG, GIF, WEBP)
Can process both local files and image URLs

Note: Image processing settings like the vision model and token limits are configured in your config.yaml file. See the configuration guide for details.

You should see output similar to this:

INFO:rag_retriever.document_processor.github_loader:Loading GitHub repository: https://github.com/openai/openai-quickstart-python.git
INFO:rag_retriever.vectorstore.store:Processing 5 documents (total size: 17054 chars) into 12 chunks
INFO:rag_retriever.vectorstore.store:Successfully added chunks to vector store

Verifying the Content

Let's verify that the content was properly indexed by running search queries:

# Search web documentation
rag-retriever --query "Java 23 Markdown Documentation Comments JavaDoc syntax" --score-threshold 0.5

# Search GitHub repository content
rag-retriever --query "How to use the OpenAI API client" --score-threshold 0.5

The high relevance score (0.6636) indicates that the content was successfully indexed and is highly relevant to our query.

💡 TIP: While these examples focus on new technology features, RAG Retriever is valuable for any knowledge that isn't part of the LLM's training data. This includes:

Your organization's architecture decisions and patterns

Team-specific coding conventions and best practices

Internal tech stack preferences and standards

Project-specific implementation details

Private APIs or internal tools documentation

Company-specific business logic and requirements

Using with AI Coding Assistants

RAG Retriever is designed to work with various AI coding assistants. For detailed instructions on setting up and configuring your preferred AI coding assistant with RAG Retriever, please refer to our AI Assistant Setup Guide.

Next Steps

Load more documentation relevant to your projects:

# Web documentation
rag-retriever --fetch URL --max-depth DEPTH

# Local files
rag-retriever --ingest-file PATH
rag-retriever --ingest-directory PATH

# Web search (using DuckDuckGo)
rag-retriever --web-search "your search query" --results 5

# You can then fetch content from the web search results using --fetch
rag-retriever --fetch https://found-url-from-search.com --max-depth 0

# Load from Confluence (requires configuration in ~/.config/rag-retriever/config.yaml)
rag-retriever --confluence --space-key TEAM

# Clean up vector store if needed
rag-retriever --clean

Explore all available options:

# Core options
--init                Initialize user configuration files in standard locations
--fetch URL          URL to fetch and index
--max-depth N        Maximum depth for recursive URL loading (default: 2)
--query STRING       Search query to find relevant content
--limit N            Maximum number of results to return
--score-threshold N  Minimum relevance score threshold
--truncate           Truncate content in search results (default: show full content)
--json              Output results in JSON format
--clean             Clean (delete) the vector store
--verbose           Enable verbose output for troubleshooting

# File ingestion options
--ingest-file PATH          Path to a local markdown or text file to ingest
--ingest-directory PATH     Path to a directory containing markdown and text files to ingest

# Image processing options
--ingest-image PATH         Path to an image file or URL to analyze and ingest
--ingest-image-directory PATH  Path to a directory containing images to analyze and ingest

# Web search options
--web-search STRING     Perform a web search using DuckDuckGo
--results N            Number of results to return for web search (default: 5)

# GitHub options
--github-repo URL     URL of the GitHub repository to load
--branch STRING       Specific branch to load from the repository
--file-extensions EXT [EXT ...]  Specific file extensions to load (e.g., .py .md .js)

# Confluence options
--confluence          Load content from Confluence using configured settings
--space-key STRING    Confluence space key to load content from
--parent-id STRING    Confluence parent page ID to start loading from

Review the full configuration guide for detailed setup options

Troubleshooting

If you're experiencing issues:

Verify that content was successfully loaded using --fetch or --ingest commands
Check the configuration guide for proper setup
Use the --verbose flag for detailed logging output
Make sure your OpenAI API key is correctly configured
For AI assistant integration issues, refer to the AI Assistant Setup Guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting-started.md

getting-started.md

Getting Started with RAG Retriever

Installation

Loading Your First Documentation

Loading Web Documentation

Loading a GitHub Repository

Processing Images

Verifying the Content

Using with AI Coding Assistants

Next Steps

Troubleshooting

Files

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started with RAG Retriever

Installation

Loading Your First Documentation

Loading Web Documentation

Loading a GitHub Repository

Processing Images

Verifying the Content

Using with AI Coding Assistants

Next Steps

Troubleshooting