RAG Configuration¶

Customize Retrieval-Augmented Generation (RAG) settings for improved data extraction accuracy specific to your use case.

RAG is Used by Materials Data Identifier Agent

RAG is automatically used during data extraction to retrieve relevant context from article text before querying the LLM to understand whether the article has material compositions and corresponding property values for screening. The parameters below allow you to customize RAG behavior based on your specific requirements, such as article length, domain-specific models, or extraction complexity.

Configuration Parameters¶

Chunking Parameters¶

These parameters control how articles are split into chunks for vector storage.

`chunk_size` (int)¶

Size of text chunks in characters for creating vector database embeddings.

`chunk_overlap` (int)¶

Number of overlapping characters between consecutive chunks to maintain context continuity.

Default Values

chunk_size = 1000
chunk_overlap = 25

Embedding Model Parameters¶

`embedding_model` (str)¶

Name of the embedding model to use for creating vector databases for RAG.

Supported Providers:

Provider	Format	Example
HuggingFace	`huggingface:model-name`	`huggingface:thellert/physbert_cased`
Sentence Transformers	`sentence-transformers:model-name`	`sentence-transformers:all-mpnet-base-v2`
OpenAI	• `openai:model-name` • `model-name` (default behavior)	• `openai:text-embedding-3-small` • `text-embedding-3-small`

Default Value

embedding_model = "huggingface:thellert/physbert_cased"

Recommended Models for Scientific Text

PhysBERT: huggingface:thellert/physbert_cased - A specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature
MatBERT: huggingface:pranav-s/MaterialsBERT - A fine-tuned version of PubMedBERT on a dataset of 2.4 million materials science abstracts
MatSciBERT: huggingface:m3rg-iitd/matscibert - A material domain language model for text mining and information extraction

Retrieval Parameters¶

These parameters control how relevant context is retrieved during extraction.

`rag_db_path` (str)¶

Custom path to store or load the vector databases of property-mentioned articles for RAG processing.

`rag_top_k` (int)¶

Number of most relevant text chunks to retrieve from the vector database for context.

`rag_max_tokens` (int)¶

Maximum number of tokens for RAG model responses.

Default Values

rag_db_path = "db"
rag_top_k = 3
rag_max_tokens = 512

RAG Chat Model Parameters¶

`rag_chat_model` (str)¶

Chat model to use for RAG-based context retrieval and synthesis.

`rag_base_url` (str)¶

Custom base URL for RAG chat model API (useful for local or custom deployments).

Default Values

rag_chat_model = "gpt-4o-mini"
rag_base_url = None

Configuration Examples¶

Using OpenAI¶

API Key Required: Set OPENAI_API_KEY in your .env file.

from comproscanner import ComProScanner

scanner = ComProScanner(output_dir="output")

# Process articles with custom chunking
scanner.process_articles(
    property_keywords={
        "exact_keywords": ["d33"],
        "substring_keywords": [" d 33 "]
    },
    rag_db_path="embeddings/piezo",
    chunk_size=800,
    chunk_overlap=50,
    embedding_model="openai:text-embedding-3-small"
)

# Extract with GPT-4o
scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_db_path="embeddings/piezo",
    rag_chat_model="gpt-4o",
    rag_max_tokens=1024,
    rag_top_k=5,
)

Using Google Gemini¶

API Key Required: Set GEMINI_API_KEY in your .env file.

scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_db_path="embeddings/piezo",
    rag_chat_model="gemini-2.0-flash",
    rag_max_tokens=1024,
    rag_top_k=4,
)

Using Anthropic Claude¶

API Key Required: Set ANTHROPIC_API_KEY in your .env file.

scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_db_path="embeddings/piezo",
    rag_chat_model="claude-3-5-sonnet-20241022",
    rag_max_tokens=2048,
    rag_top_k=4,
)

Using Local Ollama¶

scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_db_path="embeddings/piezo",
    rag_chat_model="ollama/llama3.1",
    rag_base_url="http://localhost:11434",
    rag_max_tokens=512,
    rag_top_k=3,
)

Using Together AI¶

API Key Required: Set TOGETHER_API_KEY in your .env file.

scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_db_path="embeddings/piezo",
    rag_chat_model="together_ai/meta-llama/Llama-3-70b-chat-hf",
    rag_max_tokens=1024,
    rag_top_k=4,
)

Using OpenRouter¶

API Key Required: Set OPENROUTER_API_KEY in your .env file.

scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_db_path="embeddings/piezo",
    rag_chat_model="openrouter/meta-llama/llama-3-70b-instruct",
    rag_max_tokens=1024,
    rag_top_k=4,
)

Using Cohere¶

API Key Required: Set COHERE_API_KEY in your .env file.

scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_db_path="embeddings/piezo",
    rag_chat_model="cohere/command-r-plus",
    rag_max_tokens=1024,
    rag_top_k=4,
)

Using Fireworks AI¶

API Key Required: Set FIREWORKS_API_KEY in your .env file.

scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_db_path="embeddings/piezo",
    rag_chat_model="fireworks_ai/accounts/fireworks/models/llama-v3-8b-instruct",
    rag_max_tokens=1024,
    rag_top_k=4,
)

Using Domain-Specific Embeddings¶

# PhysBERT for physics/materials science
scanner.process_articles(
    property_keywords=property_keywords,
    embedding_model="huggingface:thellert/physbert_cased",
    chunk_size=1000,
    chunk_overlap=50
)

# MatBERT for materials science
scanner.process_articles(
    property_keywords=property_keywords,
    embedding_model="huggingface:pranav-s/MaterialsBERT",
    chunk_size=1000,
    chunk_overlap=50
)

Dependencies¶

Install required packages based on your chosen providers:

OpenAI¶

pip install langchain-openai

Google Gemini¶

pip install langchain-google-genai

Anthropic Claude¶

pip install langchain-anthropic

Ollama¶

pip install langchain-ollama

# Install Ollama locally
# Visit: https://ollama.ai/download

Other Providers¶

# Together AI
pip install langchain-together

# Cohere
pip install langchain-cohere

# For HuggingFace embeddings
pip install sentence-transformers

Next Steps¶

Explore Data Extraction
Review Evaluation Methods

RAG Configuration¶

Configuration Parameters¶

Chunking Parameters¶

chunk_size (int)¶

chunk_overlap (int)¶

Embedding Model Parameters¶

embedding_model (str)¶

Retrieval Parameters¶

rag_db_path (str)¶

rag_top_k (int)¶

rag_max_tokens (int)¶

RAG Chat Model Parameters¶

rag_chat_model (str)¶

rag_base_url (str)¶

Configuration Examples¶

Using OpenAI¶

Using Google Gemini¶

Using Anthropic Claude¶

Using Local Ollama¶

Using Together AI¶

Using OpenRouter¶

Using Cohere¶

Using Fireworks AI¶

Using Domain-Specific Embeddings¶

Dependencies¶

OpenAI¶

Google Gemini¶

Anthropic Claude¶

Ollama¶

Other Providers¶

Next Steps¶

`chunk_size` (int)¶

`chunk_overlap` (int)¶

`embedding_model` (str)¶

`rag_db_path` (str)¶

`rag_top_k` (int)¶

`rag_max_tokens` (int)¶

`rag_chat_model` (str)¶

`rag_base_url` (str)¶