Data Extraction¶
The data extraction module uses CrewAI framework with specialized agents to extract composition-property relationships and synthesis data in a structured manner.
Basic Usage¶
Parameters¶
Required Parameters¶
main_extraction_keyword (str)¶
The specific property to extract from the articles, e.g., "d33" for piezoelectric coefficient.
Optional Parameters¶
start_row (int)¶
Row number from the metadata CSV file to start processing from (for resuming).
num_rows (int)¶
Number of rows to process the articles for.
is_test_data_preparation (bool)¶
Flag to indicate if test data preparation is to be performed. When True, the function will prepare test data by collecting DOIs with composition-property data.
test_doi_list_file (str)¶
Path to a text file containing the test DOIs. Required if is_test_data_preparation is True. This file will store DOIs that contain composition-property data for evaluation purposes.
total_test_data (int)¶
Total number of test articles to collect when is_test_data_preparation is True. The function will stop processing once this many DOIs with composition data are found.
is_only_consider_test_doi_list (bool)¶
Flag to indicate if only the test DOI list should be considered for processing. Should be set to True if the test_doi_list_file already contains the required number of test DOIs and you want to process only those DOIs.
test_random_seed (int)¶
Random seed for test data preparation to ensure same DOIs are selected for reproducibility.
checked_doi_list_file (str)¶
Path to a text file containing list of DOIs which have been processed already. Used to avoid reprocessing the same papers.
json_results_file (str)¶
Path to the JSON results file where extracted data will be saved.
csv_results_file (str)¶
Path to the CSV results file where extracted data will be saved if is_save_csv is True.
is_save_csv (bool)¶
Flag to indicate if the results should be saved in CSV format in addition to JSON.
is_extract_synthesis_data (bool)¶
Flag to indicate if the synthesis data (methods, precursors, characterization techniques) should be extracted along with composition-property data.
is_save_relevant (bool)¶
Flag to indicate if only papers with composition-property data should be saved. If True, only saves papers that contain composition data. If False, saves all processed papers regardless of whether they contain composition data.
is_data_clean (bool)¶
Flag to indicate if the extracted data should be cleaned after processing. When True, applies data cleaning strategies to improve data quality.
cleaning_strategy (str)¶
The cleaning strategy to use when is_data_clean is True. Options are "full" (with periodic element validation) or "basic" (without periodic element validation).
materials_data_identifier_query (str)¶
Custom query to identify if materials data is present in the paper. Must be designed to expect a 'yes/no' answer. If not provided, defaults to a query asking about material chemical composition and the corresponding property value.
model (str)¶
Name of the LLM model to use for extraction. Supports various providers (OpenAI, Anthropic, Google, etc.).
api_base (str)¶
Base URL for standard API endpoints when using custom API services.
base_url (str)¶
Base URL for the model service, used for custom or local model deployments.
api_key (str)¶
API key for the model service. Can also be set via environment variables for specific providers.
output_log_folder (str)¶
Base folder path to save detailed logs for each processed paper. Logs will be saved in {output_log_folder}/{doi}/ subdirectory. Logs will be in JSON format if is_log_json is True, otherwise plain text.
is_log_json (bool)¶
Flag to indicate if logs should be saved in JSON format. If True, logs will be structured as JSON objects. If False, logs will be plain text.
task_output_folder (str)¶
Base folder path to save output files for each processed paper. Output files will be saved in {task_output_folder}/{doi}/ subdirectory.
verbose (bool)¶
Flag to enable verbose output in the terminal during processing.
temperature (float)¶
Sampling temperature parameter for text generation - controls randomness. Lower values (0.0-0.3) make output more deterministic, higher values (0.7-1.0) make it more creative and diverse.
top_p (float)¶
Nucleus sampling parameter for text generation - controls diversity by considering only the top p probability mass. Lower values focus on high-probability tokens, higher values allow more diversity.
timeout (int)¶
Request timeout in seconds for API calls to the LLM.
frequency_penalty (float)¶
Frequency penalty for text generation to reduce repetition. Higher values discourage repetition, while lower values allow it.
max_tokens (int)¶
Maximum number of tokens for LLM completion responses.
rag_db_path (str)¶
Custom path to the vector database used for Retrieval-Augmented Generation (RAG) tool.
embedding_model (str)¶
Name of the embedding model to use for reading vector database for RAG.
rag_chat_model (str)¶
Name of the chat model to use for RAG responses during extraction.
rag_max_tokens (int)¶
Maximum number of tokens for RAG chat model responses.
rag_top_k (int)¶
Number of top relevant documents to retrieve from the vector database for RAG.
rag_base_url (str)¶
Base URL for the RAG model service, used for custom or local model deployments.
**flow_optional_args (dict)¶
Optional arguments for the MaterialsFlow class to customize extraction behavior by giving additional notes, examples, and allowed methods/techniques.
Default Values
start_row = 0num_rows = All rowsis_test_data_preparation = Falsetest_doi_list_file = Nonetotal_test_data = 50is_only_consider_test_doi_list = Falsetest_random_seed = 42checked_doi_list_file = "checked_dois.txt"json_results_file = "results.json"csv_results_file = "results.csv"is_extract_synthesis_data = Trueis_save_csv = Falseis_save_relevant = Trueis_data_clean = Falsecleaning_strategy = "full"materials_data_identifier_query = "Is there any material chemical composition and corresponding {main_property_keyword} value mentioned in the paper? Give one word answer. Either yes or no."model = "gpt-4o-mini"api_base = Nonebase_url = Noneapi_key = Noneoutput_log_folder = Noneis_log_json = Falsetask_output_folder = Noneverbose = Truetemperature = 0.1top_p = 0.9timeout = 60frequency_penalty = Nonemax_tokens = 2048rag_db_path = "db"embedding_model = "huggingface:thellert/physbert_cased"rag_chat_model = "gpt-4o-mini"rag_max_tokens = 512rag_top_k = 3rag_base_url = Noneflow_optional_args = {}
Extraction Agents¶
The extraction process involves five specialized agents working in sequence to identify and extract relevant data from the articles based on the specified property keyword.
1. Materials Data Identifier (1️⃣)¶
Purpose: Materials Data Identifier determines if article text contains target material composition and property data.
Default Query:
Is there any material chemical composition and corresponding {main_property_keyword} value mentioned in the paper? Give one word answer. Either yes or no.
Output: Yes/No
Used Tools:
RAG Tool
Retrieval-Augmented Generation (RAG) is used to query the vector database of property-mentioned articles which were created during article processing to provide relevant context to the LLM for accurate identification.
2. Composition-Property Data Extractor (2️⃣) & Composition-Property Data Formatter (3️⃣)¶
Purpose: Composition-Property Data Extractor extracts compositions and property values along with their corresponding unit and material family from the article text and finally Composition-Property Data Formatter formats the extracted data into structured JSON similar to the following example.
Output Format:
{
"composition_data": {
"compositions_property_values": {
"Eu1.90Dy0.10Ge2O7": 0.66,
"Eu1.90La0.10Ge2O7": 0.36,
"Eu1.90Ho0.10Ge2O7": 0.62
},
"property_unit": "pC/N",
"family": "RE2B2O7"
}
}
Used Tools:
MaterialParser Tool
MaterialParser Tool is used by the Composition-Property Data Formatter agent. Material-parser is a deep learning model, developed by Foppiano et al., specifically designed for parsing chemical compositions with multiple fractions denoted as variables e.g., \(Na_{(1-x)}Li_xTiO_3\) where x = 0.1, 0.3, and 0.4. This tool incorporates the material-parser model to accurately extract and standardize complex chemical compositions with variable fractions into the final compositions. For e.g., the previous example would be parsed into three distinct compositions: Na(0.9)Li(0.1)TiO3, Na(0.7)Li(0.3)TiO3, and Na(0.6)Li(0.4)TiO3.
3. Synthesis Data Extractor (4️⃣) & Synthesis Data Formatter (5️⃣)¶
Purpose: Synthesis Data Extractor extracts synthesis related data including method, precursors, steps, and characterization techniques from the article text and finally Synthesis Data Formatter formats the extracted data into structured JSON similar to the following example.
Output Format:
{
"synthesis_data": {
"method": "solid-state reaction",
"precursors": ["Eu2O3", "GeO2", "Dy2O3", "La2O3", "Ho2O3"],
"steps": [
"Starting materials Eu2O3, GeO2, Dy2O3, La2O3 and Ho2O3 were combined in stoichiometric ratios with each dopant at 5 mol%.",
"Samples were first heated at 800°C for 2 hours in pure alumina crucibles under open atmosphere.",
"Materials were then heated to 1150°C for 10 hours followed by slow cooling.",
"Resulting materials were ground into powder for further characterization.",
"Ceramic discs were formed from obtained powder materials with 1 mm thickness and 10 mm diameter.",
"Ceramic discs were compacted using uniaxial pressing under 250 MPa pressure with 2 wt% of 5 wt% PVA aqueous solution as binder.",
"Samples were heated at 600°C for 30 minutes to eliminate organic additives.",
"Sintering was conducted at 1400°C for 4 hours.",
"Silver paste was applied to disc surfaces and fired at 650°C for 1 hour to form surface electrodes.",
"Electric field of 9-18 kV/mm was applied in silicon oil bath at 120°C for 30 minutes followed by 24-hour aging."
],
"characterization_techniques": [
"TG/DTA",
"XRD",
"SEM",
"EDX",
"photoluminescence spectroscopy",
"LCR meter",
"d33 meter"
]
}
}
Extraction Workflow Diagram¶

Flow Optional Arguments¶
Customize extraction behavior by providing additional examples, notes, and allowed methods/techniques via flow_optional_args dictionary where values are formatted strings or lists of strings.:
flow_optional_args = {
"expected_composition_property_example": f"""
{{
"compositions":
{{
"Ba0.99Ca0.01Ti0.68Zr0.32O3": 375,
"Ba0.98Ca0.02Ti0.78Zr0.22O3": 350,
"Ba0.97Ca0.03Ti0.88Zr0.12O3": 325,
"Ba0.96Ca0.04Ti0.98Zr0.02O3": 300
}},
"property_unit": "pC/N",
"family": "BaTiO3"
}}""",
expected_variable_composition_property_example: f"""
{{
"compositions":
{{
"0.5NaNbO3": 375,
"(1-x)Na0.2K2(x)Bi0.5TiO3 - (y)NaNbO3 where x=0, y=0.5": 350,
"(1-x)Na0.2K2(x)Bi0.5TiO3 - (y)NaNbO3 where x=0.1, y=0.4": 325,
"(1-x)Na0.2K2(x)Bi0.5TiO3 - (y)NaNbO3 where x=0.2, y=0.3": 375,
"(1-x)Na0.2K2(x)Bi0.5TiO3 - (y)NaNbO3 where x=0.3, y=0.1": 425
}},
"property_unit": "pC/N",
"family": "NaNbO3"
}}"""
"composition_property_extraction_task_notes": [
"Write complete chemical formulas",
"Include crystal structure if mentioned",
"Note measurement conditions"
],
"synthesis_extraction_task_notes": [
"Use short method names",
"List all precursors",
"Include processing temperatures"
],
"allowed_synthesis_methods": [
"Solid-state reaction",
"Sol-gel",
"Hydrothermal",
"Chemical vapor deposition"
],
"allowed_characterization_techniques": [
"XRD",
"SEM",
"TEM",
"FTIR"
]
}
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
**flow_optional_args
)
Allowed Entities for **flow_optional_args
expected_composition_property_example (str): Example of expected composition-property JSON format for compositions and target properties. The string should be properly formatted similar to the example provided above.
expected_variable_composition_property_example (str): Example of expected variable composition-property JSON format for compositions with variable components and target properties. The string should be properly formatted similar to the example provided above.
composition_property_extraction_agent_notes (list): Notes for the extraction agent to consider when performing the extraction.
composition_property_extraction_task_notes (list): Notes for the extraction task to consider when performing the extraction by the composition-property data extraction agent.
composition_property_formatting_agent_notes (list): Notes for the formatting agent to consider when formatting the extracted data.
composition_property_formatting_task_notes (list): Notes for the formatting task to consider when formatting the extracted composition-property data by the composition-property data formatting agent.
synthesis_extraction_agent_notes (list): Notes for the synthesis data extraction agent to consider when performing the extraction.
synthesis_extraction_task_notes (list): Notes for the synthesis data extraction task to consider when performing the extraction by the synthesis data extraction agent.
synthesis_formatting_agent_notes (list): Notes for the synthesis data formatting agent to consider when formatting the extracted data.
synthesis_formatting_task_notes (list): Notes for the synthesis data formatting task to consider when formatting the extracted synthesis data by the synthesis data formatting agent.
allowed_synthesis_methods (list): List of allowed synthesis methods to guide the extraction process. If specified, only these methods should be considered during extraction.
allowed_characterization_techniques (list): List of allowed characterization techniques to guide the extraction process. If specified, only these techniques should be considered during extraction.
Article Specific Metadata Collection¶
Once the data extraction is complete, article-specific metadata such as DOI, title, authors, journal, publication year, publisher, open-access related information, and keywords are collected and included in the final output JSON/CSV files along with the extracted data using Scopus API or OA.Works API.
{
"article_metadata": {
"doi": "10.1016/j.apradiso.2024.111655",
"title": "Novel smart materials with high curie temperatures: Eu1.90Dy0.10Ge2O7, Eu1.90La0.10Ge2O7 and Eu1.90Ho0.10Ge2O7",
"journal": "Applied Radiation and Isotopes",
"year": "2025",
"isOpenAccess": false,
"authors": [
{
"name": "Esra Öztürk",
"affiliation_id": "60020484",
"affiliation_name": "Hacettepe Üniversitesi",
"affiliation_country": "Turkey"
},
{
"name": "Nilgun Kalaycioglu Ozpozan",
"affiliation_id": "122321412",
"affiliation_name": "Erciyes Ün.",
"affiliation_country": "Türkiye"
},
{
"name": "Volkan Kalem",
"affiliation_id": "60193845",
"affiliation_name": "Konya Technical University",
"affiliation_country": "Turkey"
}
],
"keywords": ["Curie"]
}
}
Final Output Example¶
{
"10.1016/j.apradiso.2024.111655": {
"composition_data": {
"compositions_property_values": {
"Eu1.90Dy0.10Ge2O7": 0.66,
"Eu1.90La0.10Ge2O7": 0.36,
"Eu1.90Ho0.10Ge2O7": 0.62
},
"property_unit": "pC/N",
"family": "RE2B2O7"
},
"synthesis_data": {
"method": "solid-state reaction",
"precursors": ["Eu2O3", "GeO2", "Dy2O3", "La2O3", "Ho2O3"],
"steps": [
"Starting materials Eu2O3, GeO2, Dy2O3, La2O3 and Ho2O3 were combined in stoichiometric ratios with each dopant at 5 mol%.",
"Samples were first heated at 800°C for 2 hours in pure alumina crucibles under open atmosphere.",
"Materials were then heated to 1150°C for 10 hours followed by slow cooling.",
"Resulting materials were ground into powder for further characterization.",
"Ceramic discs were formed from obtained powder materials with 1 mm thickness and 10 mm diameter.",
"Ceramic discs were compacted using uniaxial pressing under 250 MPa pressure with 2 wt% of 5 wt% PVA aqueous solution as binder.",
"Samples were heated at 600°C for 30 minutes to eliminate organic additives.",
"Sintering was conducted at 1400°C for 4 hours.",
"Silver paste was applied to disc surfaces and fired at 650°C for 1 hour to form surface electrodes.",
"Electric field of 9-18 kV/mm was applied in silicon oil bath at 120°C for 30 minutes followed by 24-hour aging."
],
"characterization_techniques": [
"TG/DTA",
"XRD",
"SEM",
"EDX",
"photoluminescence spectroscopy",
"LCR meter",
"d33 meter"
]
},
"article_metadata": {
"doi": "10.1016/j.apradiso.2024.111655",
"title": "Novel smart materials with high curie temperatures: Eu1.90Dy0.10Ge2O7, Eu1.90La0.10Ge2O7 and Eu1.90Ho0.10Ge2O7",
"journal": "Applied Radiation and Isotopes",
"year": "2025",
"isOpenAccess": false,
"authors": [
{
"name": "Esra Öztürk",
"affiliation_id": "60020484",
"affiliation_name": "Hacettepe Üniversitesi",
"affiliation_country": "Turkey"
},
{
"name": "Nilgun Kalaycioglu Ozpozan",
"affiliation_id": "122321412",
"affiliation_name": "Erciyes Ün.",
"affiliation_country": "Türkiye"
},
{
"name": "Volkan Kalem",
"affiliation_id": "60193845",
"affiliation_name": "Konya Technical University",
"affiliation_country": "Turkey"
}
],
"keywords": ["Curie"]
}
}
// More articles...
}
Next Steps¶
- Learn about Evaluation
- Explore Visualization
- Configure Advanced RAG