Skip to content

Data Cleaning

The data cleaning module helps remove entries based on abbreviations, periodic elements and resolve arithmetic expressions, fractional compositions, etc. along with bracket standardization in extracted chemical formulas.

Basic Usage

from comproscanner import ComProScanner

# Initialize scanner
scanner = ComProScanner(main_property_keyword="piezoelectric")

# Clean extracted data
scanner.clean_data(
    json_results_file="extracted_results.json"
)

Parameters

Required Parameters

json_results_file (str)

Path to the JSON results file containing extracted data that needs to be cleaned.

Optional Parameters

is_save_separate_results (bool)

Whether to save separate cleaned results files.

cleaned_json_results_file (str)

Path to the cleaned JSON results file with articles having relevant composition-property data.

is_save_composition_property_file (bool)

Whether to save composition-property values to a separate file as a dictionary.

composition_property_file (str)

Path to the cleaned composition-property file containing a dictionary of composition-property data.

cleaning_strategy (str)

The cleaning strategy to be used. It can be either full or basic. While comprehensive cleaning including abbreviation removal, arithmetic resolution, bracket standardization, etc., are done for both strategies, the full strategy ensures entries with only periodic elements in the composition.

Default Values

is_save_separate_results = True
cleaned_json_results_file = "cleaned_results.json"
is_save_composition_property_file = True
composition_property_file = "composition_property.json"
cleaning_strategy = "full"

Next Steps