Data Cleaning¶
The data cleaning module helps remove entries based on abbreviations, periodic elements and resolve arithmetic expressions, fractional compositions, etc. along with bracket standardization in extracted chemical formulas. It also includes advanced composition cleaning features to transform raw compositions into standardized, resolved forms.
Basic Usage¶
from comproscanner import ComProScanner
# Initialize scanner
scanner = ComProScanner(main_property_keyword="piezoelectric")
# Clean extracted data
scanner.clean_data(
json_results_file="extracted_results.json"
)
Parameters¶
Required Parameters¶
json_results_file (str)¶
Path to the JSON results file containing extracted data that needs to be cleaned.
Optional Parameters¶
is_save_separate_results (bool)¶
Whether to save separate cleaned results files.
cleaned_json_results_file (str)¶
Path to the cleaned JSON results file with articles having relevant composition-property data.
is_save_composition_property_file (bool)¶
Whether to save composition-property values to a separate file as a dictionary.
composition_property_file (str)¶
Path to the cleaned composition-property file containing a dictionary of composition-property data.
cleaning_strategy (str)¶
The cleaning strategy to be used. It can be either full or basic. While comprehensive cleaning including abbreviation removal, arithmetic resolution, bracket standardization, etc., are done for both strategies, the full strategy ensures entries with only periodic elements in the composition.
apply_advanced_cleaning (bool)¶
Flag to indicate if advanced composition cleaning transformations should be applied. When True, applies all advanced cleaning processes including Miller indices removal, coefficient expansion, normalization, and zero-coefficient element removal. When False, returns basic cleaned compositions only.
Default Values
is_save_separate_results = Truecleaned_json_results_file = "cleaned_results.json"is_save_composition_property_file = Truecomposition_property_file = "composition_property.json"cleaning_strategy = "full"apply_advanced_cleaning = True
Cleaning Process Flow¶
The data cleaning process follows this workflow:
graph TD
A[Start: Raw Extracted Data] --> B[Basic Validation]
B --> C[Element Validation]
C --> D[Unicode Conversion]
D --> E[Arithmetic Resolution]
E --> F{apply_advanced_cleaning?}
F -->|True| G[Advanced Cleaning Pipeline]
F -->|False| M[Basic Cleaned Data]
G --> H[Miller Indices Removal]
H --> I[Coefficient Expansion]
I --> J[Coefficient Normalization]
J --> K[Zero-Coefficient Removal]
K --> L[Resolved Composition Data]
M --> N[End: Cleaned Results]
L --> N
style A fill:#e1f5ff
style N fill:#e7f5e1
style G fill:#fff4e1
style H fill:#ffe1f5
style I fill:#ffe1f5
style J fill:#ffe1f5
style K fill:#ffe1f5 Process Stages¶
1. Basic Validation¶
Removes invalid keys, abbreviations, and special characters from compositions.
2. Element Validation¶
Verifies compositions contain only valid periodic elements (for full strategy only).
3. Unicode Conversion¶
Converts subscript Unicode characters to regular digits for standardization.
4. Arithmetic Resolution¶
Evaluates mathematical expressions and fractional compositions.
5. Advanced Cleaning Pipeline¶
When apply_advanced_cleaning=True, the following sub-processes are executed sequentially:
Miller Indices Removal¶
Removes crystal plane notations like (002), (111), (100), etc. from chemical formulas.
Coefficient Expansion¶
Expands coefficient patterns in chemical formulas including:
- Leading coefficients: Multiplies all elements inside parentheses by leading coefficient
- Trailing coefficients: Multiplies all elements inside parentheses by trailing coefficient
- Parenthetical coefficients: Expands nested brackets with complex coefficient multiplication
Coefficient Normalization¶
Removes trailing zeros from element coefficients for cleaner representation.
Zero-Coefficient Removal¶
Removes elements with coefficient values of 0 or 0.0 from formulas.
Advanced Cleaning Examples¶
Miller Indices Removal¶
Removes crystal plane notations from chemical formulas:
| Input Formula | Output Formula |
|---|---|
AlN (002) | AlN |
ZnO (101) | ZnO |
Coefficient Expansion¶
Leading Coefficient Expansion¶
Multiplies all elements inside parentheses by the coefficient before the opening bracket:
| Input Formula | Output Formula |
|---|---|
0.7(K0.48Na0.52NbO3) | K0.336Na0.364NbO3 |
(0.15)Dy2O3 | Dy0.3O0.45 |
Trailing Coefficient Expansion¶
Multiplies all elements inside parentheses by the coefficient after the closing bracket:
| Input Formula | Output Formula |
|---|---|
(K0.5Na0.5)(0.97)NbO3 | K0.485Na0.485NbO3 |
(Bi0.5Na0.5)0.94TiO3 | Bi0.47Na0.47TiO3 |
Parenthetical Coefficient Expansion¶
Handles nested brackets and complex coefficient multiplication:
| Input Formula | Output Formula |
|---|---|
[(K0.5Na0.5)0.96Bi0.04]NbO3 | K0.48Na0.48Bi0.04NbO3 |
[Ba0.85Ca0.15]0.99TiO3 | Ba0.8415Ca0.1485TiO3 |
Coefficient Normalization¶
Removes trailing zeros from element coefficients:
| Input Formula | Output Formula |
|---|---|
Pb0.90La0.10 | Pb0.9La0.1 |
Zr0.200Ti0.800O2 | Zr0.2Ti0.8O2 |
Zero-Coefficient Element Removal¶
Removes elements with zero coefficients:
| Input Formula | Output Formula |
|---|---|
BaTiZr0O3 | BaTiO3 |
K0.5Na0.5Nb0O3 | K0.5Na0.5O3 |
Original vs Resolved Compositions
The advanced cleaning process transforms raw extracted compositions into standardized, resolved forms. Both versions can be preserved for traceability in custom implementations using the DataCleaner class directly with the apply_advanced_cleaning parameter. This allows you to maintain both the original extracted composition (for reference and validation) and the fully resolved composition (for analysis and database storage).
Next Steps¶
- Learn about Evaluation
- Explore Visualization
- Configure Advanced RAG