ComProScanner Documentation¶
Welcome¶
ComProScanner is a comprehensive Python package designed to extract composition-property relationships from scientific articles, particularly focused on materials science. It provides tools for metadata collection, article processing from various publishers, extraction of composition-property data, evaluation of extraction performance, and visualization of results.
Key Features¶
Automated Data Extraction
Extract composition-property relationships in structured format from scientific literature automatically using AI-powered agents.
Multi-Publisher Support
Process articles from Elsevier, Wiley, Springer, IOP with TDM API integration, and support local PDF for all publishers.
Comprehensive Evaluation
Built-in semantic and agentic evaluation methods to assess extraction quality in a faster automated manner.
Rich Visualization
Create beautiful charts, graphs, and knowledge graphs from extracted data and evaluation results out of the box.
Quick Start¶
Get started with ComProScanner in just a few steps:
Installation¶
Basic Usage¶
from comproscanner import ComProScanner
# Initialize with property of interest
scanner = ComProScanner(main_property_keyword="piezoelectric")
# Collect metadata
scanner.collect_metadata(
base_queries=["piezoelectric", "piezoelectricity"],
)
# Define property keywords for filtering
property_keywords = {
"exact_keywords": ["d33"],
"substring_keywords": [" d 33 "]
}
# Process articles
scanner.process_articles(
property_keywords=property_keywords,
source_list=["elsevier", "springer"]
)
# Extract data
scanner.extract_composition_property_data(
main_extraction_keyword="d33"
)
Workflow Overview¶
ComProScanner follows a sequential workflow:
Basic Flowchart¶
graph LR
A[Metadata Collection] --> B[Article Processing]
B --> C[Data Extraction]
C --> D[Post Processing]
D --> E[Visualization]
D --> F[Evaluation] Workflow Diagram¶

- Metadata Collection - Find relevant scientific articles from Scopus database
- Article Processing - Extract full text articles from various publishers to prepare for data extraction
- Data Extraction - Use multiple AI agents to extract structured data from collected articles
- Post Processing - Evaluate extracted data and create charts for visualization along with data cleaning
Core Modules¶
-
Metadata Extractor
Function for collecting and filtering article metadata from Scopus database.
Learn more -
Article Processor
Function for processing articles from different publishers: Elsevier, Wiley, IOP, Springer, and local PDFs (any publisher).
Learn more -
Composition-Property Extractor
Function for AI agents-powered extraction flow for composition-property and synthesis data in a structured format.
Learn more -
Evaluator
Functions for automated semantic and agentic evaluation methods to assess extraction quality.
Learn more -
Visualizer
Functions for visualizing extracted data and evaluation results out of the box.
Learn more
What's Next?¶
-
Getting Started
Learn the basics and get ComProScanner up and running quickly. -
User Guide
Comprehensive guides for all features and capabilities. -
Advanced Configuration
Advanced features like RAG configuration and custom flows.
Paper¶
Read the details of ComProScanner in the following preprint: arXiv:2510.20362
Citation¶
If you use ComProScanner in your research, please cite:
@misc{roy2025comproscannermultiagentbasedframework,
title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
year={2025},
eprint={2510.20362},
archivePrefix={arXiv},
primaryClass={physics.comp-ph},
url={https://arxiv.org/abs/2510.20362},
}
Community & Support¶
- GitHub: slimeslab/ComProScanner
- PyPI: comproscanner
- Issues: Report a bug
- Email: contact@aritraroy.live
License¶
ComProScanner is licensed under the MIT License.