Skip to content

ComProScanner Documentation

ComProScanner Logo

Python Version License: MIT PyPI version

Welcome

ComProScanner is a comprehensive Python package designed to extract composition-property relationships from scientific articles, particularly focused on materials science. It provides tools for metadata collection, article processing from various publishers, extraction of composition-property data, evaluation of extraction performance, and visualization of results.

Key Features

Automated Data Extraction

Extract composition-property relationships in structured format from scientific literature automatically using AI-powered agents.

Multi-Publisher Support

Process articles from Elsevier, Wiley, Springer, IOP with TDM API integration, and support local PDF for all publishers.

Comprehensive Evaluation

Built-in semantic and agentic evaluation methods to assess extraction quality in a faster automated manner.

Rich Visualization

Create beautiful charts, graphs, and knowledge graphs from extracted data and evaluation results out of the box.

Quick Start

Get started with ComProScanner in just a few steps:

Installation

pip install comproscanner

Basic Usage

from comproscanner import ComProScanner

# Initialize with property of interest
scanner = ComProScanner(main_property_keyword="piezoelectric")

# Collect metadata
scanner.collect_metadata(
    base_queries=["piezoelectric", "piezoelectricity"],
)

# Define property keywords for filtering
property_keywords = {
    "exact_keywords": ["d33"],
    "substring_keywords": [" d 33 "]
}

# Process articles
scanner.process_articles(
    property_keywords=property_keywords,
    source_list=["elsevier", "springer"]
)

# Extract data
scanner.extract_composition_property_data(
    main_extraction_keyword="d33"
)

Workflow Overview

ComProScanner follows a sequential workflow:

Basic Flowchart

graph LR
    A[Metadata Collection] --> B[Article Processing]
    B --> C[Data Extraction]
    C --> D[Post Processing]
    D --> E[Visualization]
    D --> F[Evaluation]

Workflow Diagram

Workflow Diagram

  1. Metadata Collection - Find relevant scientific articles from Scopus database
  2. Article Processing - Extract full text articles from various publishers to prepare for data extraction
  3. Data Extraction - Use multiple AI agents to extract structured data from collected articles
  4. Post Processing - Evaluate extracted data and create charts for visualization along with data cleaning

Core Modules

  • Metadata Extractor
    Function for collecting and filtering article metadata from Scopus database.
    Learn more

  • Article Processor
    Function for processing articles from different publishers: Elsevier, Wiley, IOP, Springer, and local PDFs (any publisher).
    Learn more

  • Composition-Property Extractor
    Function for AI agents-powered extraction flow for composition-property and synthesis data in a structured format.
    Learn more

  • Evaluator
    Functions for automated semantic and agentic evaluation methods to assess extraction quality.
    Learn more

  • Visualizer
    Functions for visualizing extracted data and evaluation results out of the box.
    Learn more

What's Next?

Paper

Read the details of ComProScanner in the following preprint: arXiv:2510.20362

Citation

If you use ComProScanner in your research, please cite:

@misc{roy2025comproscannermultiagentbasedframework,
      title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
      author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
      year={2025},
      eprint={2510.20362},
      archivePrefix={arXiv},
      primaryClass={physics.comp-ph},
      url={https://arxiv.org/abs/2510.20362},
}

Community & Support

License

ComProScanner is licensed under the MIT License.