RAG Configuration¶
Customize Retrieval-Augmented Generation (RAG) settings for improved data extraction accuracy specific to your use case.
RAG is Used by Materials Data Identifier Agent
RAG is automatically used during data extraction to retrieve relevant context from article text before querying the LLM to understand whether the article has material compositions and corresponding property values for screening. The parameters below allow you to customize RAG behavior based on your specific requirements, such as article length, domain-specific models, or extraction complexity.
Configuration Parameters¶
Chunking Parameters¶
These parameters control how articles are split into chunks for vector storage.
chunk_size (int)¶
Size of text chunks in characters for creating vector database embeddings.
chunk_overlap (int)¶
Number of overlapping characters between consecutive chunks to maintain context continuity.
Default Values
chunk_size = 1000chunk_overlap = 25
Embedding Model Parameters¶
embedding_model (str)¶
Name of the embedding model to use for creating vector databases for RAG.
Supported Providers:
| Provider | Format | Example |
|---|---|---|
| HuggingFace | huggingface:model-name | huggingface:thellert/physbert_cased |
| Sentence Transformers | sentence-transformers:model-name | sentence-transformers:all-mpnet-base-v2 |
| OpenAI | • openai:model-name• model-name (default behavior) | • openai:text-embedding-3-small • text-embedding-3-small |
Default Value
embedding_model = "huggingface:thellert/physbert_cased"
Recommended Models for Scientific Text
- PhysBERT:
huggingface:thellert/physbert_cased- A specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature - MatBERT:
huggingface:pranav-s/MaterialsBERT- A fine-tuned version of PubMedBERT on a dataset of 2.4 million materials science abstracts - MatSciBERT:
huggingface:m3rg-iitd/matscibert- A material domain language model for text mining and information extraction
Retrieval Parameters¶
These parameters control how relevant context is retrieved during extraction.
rag_db_path (str)¶
Custom path to store or load the vector databases of property-mentioned articles for RAG processing.
rag_top_k (int)¶
Number of most relevant text chunks to retrieve from the vector database for context.
rag_max_tokens (int)¶
Maximum number of tokens for RAG model responses.
Default Values
rag_db_path = "db"rag_top_k = 3rag_max_tokens = 512
RAG Chat Model Parameters¶
rag_chat_model (str)¶
Chat model to use for RAG-based context retrieval and synthesis.
rag_base_url (str)¶
Custom base URL for RAG chat model API (useful for local or custom deployments).
Default Values
rag_chat_model = "gpt-4o-mini"rag_base_url = None
Configuration Examples¶
Using OpenAI¶
API Key Required: Set OPENAI_API_KEY in your .env file.
from comproscanner import ComProScanner
scanner = ComProScanner(output_dir="output")
# Process articles with custom chunking
scanner.process_articles(
property_keywords={
"exact_keywords": ["d33"],
"substring_keywords": [" d 33 "]
},
rag_db_path="embeddings/piezo",
chunk_size=800,
chunk_overlap=50,
embedding_model="openai:text-embedding-3-small"
)
# Extract with GPT-4o
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_db_path="embeddings/piezo",
rag_chat_model="gpt-4o",
rag_max_tokens=1024,
rag_top_k=5,
)
Using Google Gemini¶
API Key Required: Set GEMINI_API_KEY in your .env file.
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_db_path="embeddings/piezo",
rag_chat_model="gemini-2.0-flash",
rag_max_tokens=1024,
rag_top_k=4,
)
Using Anthropic Claude¶
API Key Required: Set ANTHROPIC_API_KEY in your .env file.
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_db_path="embeddings/piezo",
rag_chat_model="claude-3-5-sonnet-20241022",
rag_max_tokens=2048,
rag_top_k=4,
)
Using Local Ollama¶
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_db_path="embeddings/piezo",
rag_chat_model="ollama/llama3.1",
rag_base_url="http://localhost:11434",
rag_max_tokens=512,
rag_top_k=3,
)
Using Together AI¶
API Key Required: Set TOGETHER_API_KEY in your .env file.
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_db_path="embeddings/piezo",
rag_chat_model="together_ai/meta-llama/Llama-3-70b-chat-hf",
rag_max_tokens=1024,
rag_top_k=4,
)
Using OpenRouter¶
API Key Required: Set OPENROUTER_API_KEY in your .env file.
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_db_path="embeddings/piezo",
rag_chat_model="openrouter/meta-llama/llama-3-70b-instruct",
rag_max_tokens=1024,
rag_top_k=4,
)
Using Cohere¶
API Key Required: Set COHERE_API_KEY in your .env file.
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_db_path="embeddings/piezo",
rag_chat_model="cohere/command-r-plus",
rag_max_tokens=1024,
rag_top_k=4,
)
Using Fireworks AI¶
API Key Required: Set FIREWORKS_API_KEY in your .env file.
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_db_path="embeddings/piezo",
rag_chat_model="fireworks_ai/accounts/fireworks/models/llama-v3-8b-instruct",
rag_max_tokens=1024,
rag_top_k=4,
)
Using Domain-Specific Embeddings¶
# PhysBERT for physics/materials science
scanner.process_articles(
property_keywords=property_keywords,
embedding_model="huggingface:thellert/physbert_cased",
chunk_size=1000,
chunk_overlap=50
)
# MatBERT for materials science
scanner.process_articles(
property_keywords=property_keywords,
embedding_model="huggingface:pranav-s/MaterialsBERT",
chunk_size=1000,
chunk_overlap=50
)
Dependencies¶
Install required packages based on your chosen providers:
OpenAI¶
Google Gemini¶
Anthropic Claude¶
Ollama¶
Other Providers¶
# Together AI
pip install langchain-together
# Cohere
pip install langchain-cohere
# For HuggingFace embeddings
pip install sentence-transformers
Next Steps¶
- Explore Data Extraction
- Review Evaluation Methods