Data Cleaning¶
The data cleaning module helps remove entries based on abbreviations, periodic elements and resolve arithmetic expressions, fractional compositions, etc. along with bracket standardization in extracted chemical formulas.
Basic Usage¶
from comproscanner import ComProScanner
# Initialize scanner
scanner = ComProScanner(main_property_keyword="piezoelectric")
# Clean extracted data
scanner.clean_data(
json_results_file="extracted_results.json"
)
Parameters¶
Required Parameters¶
json_results_file (str)¶
Path to the JSON results file containing extracted data that needs to be cleaned.
Optional Parameters¶
is_save_separate_results (bool)¶
Whether to save separate cleaned results files.
cleaned_json_results_file (str)¶
Path to the cleaned JSON results file with articles having relevant composition-property data.
is_save_composition_property_file (bool)¶
Whether to save composition-property values to a separate file as a dictionary.
composition_property_file (str)¶
Path to the cleaned composition-property file containing a dictionary of composition-property data.
cleaning_strategy (str)¶
The cleaning strategy to be used. It can be either full or basic. While comprehensive cleaning including abbreviation removal, arithmetic resolution, bracket standardization, etc., are done for both strategies, the full strategy ensures entries with only periodic elements in the composition.
Default Values
is_save_separate_results = Truecleaned_json_results_file = "cleaned_results.json"is_save_composition_property_file = Truecomposition_property_file = "composition_property.json"cleaning_strategy = "full"
Next Steps¶
- Learn about Evaluation
- Explore Visualization
- Configure Advanced RAG