Skip to content

Commit

Permalink
automate figshare data file article CRUD
Browse files Browse the repository at this point in the history
- update keywords in pyproject.toml to better reflect growing benchmark scope
- add script for EDA of 103 phononDB PBE structures in data/phonons/phonondb_103_pbe_eda.py
- data-files.yml add MD5 checksums for data integrity
- data.py add new data file path for phononDB structures
- better error handling Figshare upload scripts
- more figshare module unit tests
  • Loading branch information
janosh committed Jan 14, 2025
1 parent f644ed2 commit afb09d5
Show file tree
Hide file tree
Showing 16 changed files with 761 additions and 178 deletions.
6 changes: 3 additions & 3 deletions contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,10 @@ assert len(wbm_init_atoms) == 256_963
1. **`e_correction_per_atom_mp2020`**: [`MaterialsProject2020Compatibility`] energy corrections in eV/atom.
1. **`e_correction_per_atom_mp_legacy`**: Legacy [`MaterialsProjectCompatibility`] energy corrections in eV/atom. Having both old and new corrections allows updating predictions from older models like MEGNet that were trained on MP formation energies treated with the old correction scheme.
1. **`e_above_hull_mp2020_corrected_ppd_mp`**: Energy above hull distances in eV/atom after applying the MP2020 correction scheme. The convex hull in question is the one spanned by all ~145k Materials Project `ComputedStructureEntries`. Matbench Discovery takes these as ground truth for material stability. Any value above 0 is assumed to be an unstable/metastable material.
1. **`site_stats_fingerprint_init_final_norm_diff`**: The norm of the difference between the initial and final site fingerprints. This is a volume-independent measure of how much the structure changed during DFT relaxation. Uses the `matminer` [`SiteStatsFingerprint`](https://github.com/hackingmaterials/matminer/blob/33bf112009b67b108f1008b8cc7398061b3e6db2/matminer/featurizers/structure/sites.py#L21-L33) (v0.8.0).
1. **`site_stats_fingerprint_init_final_norm_diff`**: The norm of the difference between the initial and final site fingerprints. This is a volume-independent measure of how much the structure changed during DFT relaxation. Uses the `matminer` [`SiteStatsFingerprint`](https://github.com/hackingmaterials/matminer/blob/33bf1120/matminer/featurizers/structure/sites.py#L21-L33) (v0.8.0).

[`MaterialsProject2020Compatibility`]: https://github.com/materialsproject/pymatgen/blob/02a4ca8aa0277b5f6db11f4de4fdbba129de70a5/pymatgen/entries/compatibility.py#L823
[`MaterialsProjectCompatibility`]: https://github.com/materialsproject/pymatgen/blob/02a4ca8aa0277b5f6db11f4de4fdbba129de70a5/pymatgen/entries/compatibility.py#L766
[`MaterialsProject2020Compatibility`]: https://github.com/materialsproject/pymatgen/blob/02a4ca8aa/pymatgen/entries/compatibility.py#L823
[`MaterialsProjectCompatibility`]: https://github.com/materialsproject/pymatgen/blob/02a4ca8aa/pymatgen/entries/compatibility.py#L766

## 📥   Direct Download

Expand Down
51 changes: 51 additions & 0 deletions data/phonons/phonondb_103_pbe_eda.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
"""Exploratory data analysis of the 103 structures in Togo's phononDB PBE dataset.
Used for the anharmonic phonon analysis, specifically the thermal conductivity kappa.
See https://github.com/atztogo/phonondb/blob/bba206/README.md#url-links-to-phono3py-
finite-displacement-method-inputs-of-103-compounds-on-mdr-at-nims-pbe for details.
"""

# %%
from collections import defaultdict

import ase.io
import moyopy
import moyopy.interface
import pymatviz as pmv

from matbench_discovery.data import DataFiles

__date__ = "2025-01-14"


# %%
atoms_list = ase.io.read(DataFiles.phonondb_pbe_structures.path, index=":")


# %% visually inspect first 12 structures
fig = pmv.structure_3d_plotly(atoms_list[:12], n_cols=3, scale=0.5)
fig.show()


# %%
elem_counts: dict[str, int] = defaultdict(int)
for atoms in atoms_list:
for symb in atoms.symbols:
elem_counts[symb] += 1


# %%
fig = pmv.ptable_heatmap_plotly(elem_counts, fmt=".0f")
fig.show()


# %% plot spacegroup distribution
spg_nums: dict[str, int] = {}
for atoms in atoms_list:
moyo_cell = moyopy.interface.MoyoAdapter.from_atoms(atoms).data
moyo_data = moyopy.MoyoDataset(moyo_cell)
spg_nums[atoms.info["material_id"]] = moyo_data.number

fig = pmv.spacegroup_sunburst(spg_nums.values(), show_counts="value+percent")
fig.show()
10 changes: 5 additions & 5 deletions data/wbm/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The **WBM dataset** was published in [Predicting stable crystalline compounds using chemical similarity][wbm paper] (nat comp mat, Jan 2021). The authors generated 257,487 structures through single-element substitutions on Materials Project (MP) source structures. The replacement element was chosen based on chemical similarity determined by a matrix data-mined from the [Inorganic Crystal Structure Database (ICSD)](https://icsd.products.fiz-karlsruhe.de).

The resulting novel structures were relaxed using MP-compatible VASP inputs (i.e. using `pymatgen`'s [`MPRelaxSet`](https://github.com/materialsproject/pymatgen/blob/c4998d92525921c3da0aec0f94ed1429c6786c42/pymatgen/io/vasp/MPRelaxSet.yaml)) and identical POTCARs in an attempt to create a database of Materials Project compatible novel crystals. Any degradation in model performance from training to test set should therefore largely be a result of extrapolation error rather than covariate shift in the underlying data.
The resulting novel structures were relaxed using MP-compatible VASP inputs (i.e. using `pymatgen`'s [`MPRelaxSet`](https://github.com/materialsproject/pymatgen/blob/c4998d92/pymatgen/io/vasp/MPRelaxSet.yaml)) and identical POTCARs in an attempt to create a database of Materials Project compatible novel crystals. Any degradation in model performance from training to test set should therefore largely be a result of extrapolation error rather than covariate shift in the underlying data.

The authors performed 5 rounds of element substitution starting from structures on the Materials Project convex hull, each time relaxing all generated structures and adding those found to lie on the convex hull back to the pool of parent structures for the next iteration of element substitution. In total, ~20k or close to 10% were found to lie on the Materials Project convex hull.

Expand All @@ -19,7 +19,7 @@ Each iteration has varying numbers of materials which are counted by the 2nd int
The full set of processing steps used to curate the WBM test set from the raw data files (downloaded from [URLs listed below](#--links-to-wbm-files)) can be found in [`data/wbm/compile_wbm_test_set.py`](https://github.com/janosh/matbench-discovery/blob/-/data/wbm/compile_wbm_test_set.py). Processing steps taken:

- re-format material IDs: `step_1-0->wbm-1-1`, `step_1-1->wbm-1-2`, ...
- correctly align initial structures to DFT-relaxed [`ComputedStructureEntries`](https://github.com/materialsproject/pymatgen/blob/02a4ca8aa0277b5f6db11f4de4fdbba129de70a5/pymatgen/entries/computed_entries.py#L536) (the initial structure files had 6 extra structures inserted towards the end of step 3 which had no corresponding IDs in the summary file)
- correctly align initial structures to DFT-relaxed [`ComputedStructureEntries`](https://github.com/materialsproject/pymatgen/blob/02a4ca8aa/pymatgen/entries/computed_entries.py#L536) (the initial structure files had 6 extra structures inserted towards the end of step 3 which had no corresponding IDs in the summary file)
- remove 6 pathological structures (with 0 volume)
- remove formation energy outliers below -5 and above 5 eV/atom (502 and 22 crystals respectively out of 257,487 total, including an anomaly of 500 structures at exactly -10 eV/atom)

Expand All @@ -29,7 +29,7 @@ The full set of processing steps used to curate the WBM test set from the raw da
<img src="./figs/hist-wbm-e-form-per-atom.svg" alt="WBM formation energy histogram indicating outlier cutoffs">
</slot>

- apply the [`MaterialsProject2020Compatibility`](https://github.com/materialsproject/pymatgen/blob/02a4ca8aa0277b5f6db11f4de4fdbba129de70a5/pymatgen/entries/compatibility.py#L823) energy correction scheme to the formation energies
- apply the [`MaterialsProject2020Compatibility`](https://github.com/materialsproject/pymatgen/blob/02a4ca8aa/pymatgen/entries/compatibility.py#L823) energy correction scheme to the formation energies
- compute energy to the Materials Project convex hull constructed from all MP `ComputedStructureEntries` queried on 2023-02-07 ([database release 2022.10.28](https://docs.materialsproject.org/changes/database-versions#v2022.10.28))

Invoking the script `python compile_wbm_test_set.py` will auto-download and regenerate the WBM test set files from scratch. If you find
Expand Down Expand Up @@ -70,9 +70,9 @@ The number of materials in each iteration of element substitution before and aft
| **before** | 61,466 | 52,755 | 79,160 | 40,314 | 23,268 | 256,963 |
| **after** | 54,209 | 45,979 | 66,528 | 34,531 | 14,241 | 215,488 |

[`get_protostructure_label_from_spglib`]: https://github.com/CompRhys/aviary/blob/a8da6c468a2407fd14687de327fe181c5de0169f/aviary/wren/utils.py#L140
[`get_protostructure_label_from_spglib`]: https://github.com/CompRhys/aviary/blob/a8da6c46/aviary/wren/utils.py#L140
[`aviary`]: https://github.com/CompRhys/aviary
[`compile_wbm_test_set.py`]: https://github.com/janosh/matbench-discovery/blob/eec1e22c69bc1b0183d7f9138f9e60d1ae733e09/data/wbm/compile_wbm_test_set.py#L587
[`compile_wbm_test_set.py`]: https://github.com/janosh/matbench-discovery/blob/eec1e22c6/data/wbm/compile_wbm_test_set.py#L587

## 🔗 &thinsp; Links to WBM Files

Expand Down
20 changes: 19 additions & 1 deletion matbench_discovery/data-files.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ all_mp_tasks:
url: https://figshare.com/ndownloader/files/43350447
path: mp/2023-03-16-all-mp-tasks.json.gz
description: Complete copy of the Materials Project database on 2023-03-16 (14 GB) (release [v2022.10.28](https://docs.materialsproject.org/changes/database-versions#v2022.10.28))
md5: e301824b358fb5cecfb5e1f769f025dd

alignn_checkpoint:
url: https://figshare.com/ndownloader/files/40344436
Expand All @@ -12,23 +13,27 @@ mp_computed_structure_entries:
url: https://figshare.com/ndownloader/files/40344436
path: mp/2023-02-07-mp-computed-structure-entries.json.gz
description: JSON-Serialized `pymatgen` [`ComputedStructureEntries`] containing all Materials Project DFT-relaxed structures and corresponding final energies as of 2023-02-07
md5: 76fc748db6b175bb80de4c276d27c235

mp_elemental_ref_entries:
url: https://figshare.com/ndownloader/files/40387775
path: mp/2023-02-07-mp-elemental-reference-entries.json.gz
description: Minimum energy `ComputedEntry` for each element in MP
md5: 6e93b6f38d6e27d6c811d3cafb23a070

mp_energies:
url: https://figshare.com/ndownloader/files/49083124
path: mp/2023-01-10-mp-energies.csv.gz
description: Materials Project formation energies and energies above convex hull in eV/atom as a fast-to-load CSV file
md5: 888579e287c8417e2202330c40e1367f

mp_patched_phase_diagram:
url: https://figshare.com/ndownloader/files/48241624
path: mp/2023-02-07-ppd-mp.pkl.gz
description: "[`PatchedPhaseDiagram`] constructed from all MP `pymatgen` `ComputedStructureEntries`"
md5: 60d19d691fa1d338aa496a40a9641bef

mp_trj:
mp_trj_json_gz:
title: Materials Project Trajectory (MPtrj) Dataset
url: https://figshare.com/ndownloader/files/43302033
figshare: https://figshare.com/articles/dataset/23713842
Expand All @@ -39,36 +44,49 @@ mp_trj_extxyz:
url: https://figshare.com/ndownloader/files/49034296
path: mp/2024-09-03-mp-trj.extxyz.zip
description: ~1.6 M Materials Project DFT relaxation frames (subsampled to reduce redundancy and faulty data) converted to `ase`-compatible extended XYZ format including energies (eV), forces (eV/Å) and stress (eV/ų).
md5: 7f433171e4e5f2ef9304dccd42d5488f

wbm_computed_structure_entries:
url: https://figshare.com/ndownloader/files/40344463
path: wbm/2022-10-19-wbm-computed-structure-entries.json.bz2
description: JSON-Serialized `pymatgen` [`ComputedStructureEntries`] containing all WBM DFT-relaxed structures and corresponding final energies
md5: 481959b65f28150ae6ee7297ddeba538

wbm_relaxed_atoms:
url: https://figshare.com/ndownloader/files/48169600
path: wbm/2024-08-04-wbm-relaxed-atoms.extxyz.zip
description: WBM relaxed structures as `ase` Atoms in extended XYZ format
md5: 4726643ac0dfbab69a4284454c891e68

wbm_initial_structures:
url: https://figshare.com/ndownloader/files/40344466
path: wbm/2022-10-19-wbm-init-structs.json.bz2
description: Unrelaxed WBM structures in `pymatgen` `Structure` format
md5: ff2c40a3a7bf65468852b67f0dbc67df

wbm_initial_atoms:
url: https://figshare.com/ndownloader/files/48169597
path: wbm/2024-08-04-wbm-initial-atoms.extxyz.zip
description: Unrelaxed WBM structures as `ase` Atoms in extended XYZ format
md5: 2a292211ca6acb30ed8416178d644098

wbm_cses_plus_init_structs:
url: https://figshare.com/ndownloader/files/40344469
path: wbm/2022-10-19-wbm-computed-structure-entries+init-structs.json.bz2
description: Both unrelaxed and DFT-relaxed WBM structures, the latter stored with their final VASP energies as `pymatgen` [`ComputedStructureEntries`]
md5: eaabe984d070156cc50a8a075cd5e315

wbm_summary:
url: https://figshare.com/ndownloader/files/44225498
path: wbm/2023-12-13-wbm-summary.csv.gz
description: Computed material properties only, no structures. Available properties are VASP energy, formation energy, energy above the convex hull, volume, band gap, number of sites per unit cell, and more.
md5: fb23d85dab61fab001b92cb0ac4f8f3d

phonondb_pbe_structures:
url: https://figshare.com/ndownloader/files/51680888
path: phonons/2024-11-09-phononDB-PBE-103-structures.extxyz
description: 103 phononDB structures run by Togo with PBE settings received in private communication. See https://github.com/atztogo/phonondb/blob/bba206/README.md#url-links-to-phono3py-finite-displacement-method-inputs-of-103-compounds-on-mdr-at-nims-pbe for details.
md5: a396d4c517fa6d57defeffc6c83f0118

_links: |
[`PatchedPhaseDiagram`]: https://github.com/materialsproject/pymatgen/blob/v2023.5.10/pymatgen/analysis/phase_diagram.py#L1480-L1814
Expand Down
11 changes: 9 additions & 2 deletions matbench_discovery/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,7 @@ class DataFiles(Files):
mp_elemental_ref_entries = "mp/2023-02-07-mp-elemental-reference-entries.json.gz"
mp_energies = "mp/2023-01-10-mp-energies.csv.gz"
mp_patched_phase_diagram = "mp/2023-02-07-ppd-mp.pkl.gz"
mp_trj_json_gz = "mp/2022-09-16-mp-trj.json.gz"
mp_trj_extxyz = "mp/2024-09-03-mp-trj.extxyz.zip"
# snapshot of every task (calculation) in MP as of 2023-03-16 (14 GB)
all_mp_tasks = "mp/2023-03-16-all-mp-tasks.zip"
Expand All @@ -305,14 +306,20 @@ class DataFiles(Files):
)
wbm_summary = "wbm/2023-12-13-wbm-summary.csv.gz"
alignn_checkpoint = "2023-06-02-pbenner-best-alignn-model.pth.zip"
mp_trj = "mp/2022-09-16-mp-trj.json"
phonondb_pbe_structures = "phonons/2024-11-09-phononDB-PBE-103-structures.extxyz"

@functools.cached_property
def yaml(self) -> dict[str, dict[str, str]]:
"""YAML data associated with the file."""
yaml_path = f"{PKG_DIR}/data-files.yml"

with open(yaml_path) as file:
return yaml.safe_load(file)
yaml_data = yaml.safe_load(file)

if self.name not in yaml_data:
raise ValueError(f"{self.name=} not found in {yaml_path}")

return yaml_data

@property
def url(self) -> str:
Expand Down
2 changes: 1 addition & 1 deletion matbench_discovery/energy.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def get_elemental_ref_entries(
)

# tested to agree with TRI's MP reference energies
# https://github.com/TRI-AMDD/CAMD/blob/1c965cba636531e542f4821a555b98b2d81ed034/camd/utils/data.py#L134
# https://github.com/TRI-AMDD/CAMD/blob/1c965cba636/camd/utils/data.py#L134
mp_elemental_ref_energies = {
elem: entry.energy_per_atom for elem, entry in mp_elem_ref_entries.items()
}
Expand Down
75 changes: 71 additions & 4 deletions matbench_discovery/figshare.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,35 @@
import json
import os
from collections.abc import Mapping, Sequence
from typing import Any
from typing import Any, Final

import requests
from tqdm import tqdm

from matbench_discovery import ROOT

ENV_PATH = f"{ROOT}/site/.env"
BASE_URL = "https://api.figshare.com/v2"
ENV_PATH: Final[str] = f"{ROOT}/site/.env"
BASE_URL: Final[str] = "https://api.figshare.com/v2"

# Maps modeling tasks to their Figshare article IDs. New figshare articles will be
# created if the ID is None. Be sure to paste the new article ID into the
# ARTICLE_IDS dict below! It'll be printed by this script.
ARTICLE_URL_PREFIX: Final = "https://figshare.com/articles/dataset"
DOWNLOAD_URL_PREFIX: Final = "https://figshare.com/ndownloader/files"
ARTICLE_IDS: Final[dict[str, int | None]] = {
"model_preds_discovery": 28187990,
"model_preds_geo_opt": 28187999,
"model_preds_phonons": None,
"data_files": 22715158,
}

# category IDs can be found at https://api.figshare.com/v2/categories
CATEGORIES: Final[dict[int, str]] = {
25162: "Structure and dynamics of materials",
25144: "Inorganic materials (incl. nanomaterials)",
25186: "Cheminformatics and Quantitative Structure-Activity Relationships",
}


FIGSHARE_TOKEN = os.getenv("FIGSHARE_TOKEN")
if not FIGSHARE_TOKEN and os.path.isfile(ENV_PATH):
Expand Down Expand Up @@ -96,7 +116,7 @@ def get_file_hash_and_size(
return md5.hexdigest(), size


def upload_file_to_figshare(article_id: int, file_path: str) -> int:
def upload_file(article_id: int, file_path: str) -> int:
"""Upload a file to Figshare and return the file ID.
Args:
Expand Down Expand Up @@ -129,3 +149,50 @@ def upload_file_to_figshare(article_id: int, file_path: str) -> int:
# Complete upload
make_request("POST", f"{endpoint}/{file_info['id']}")
return file_info["id"]


def article_exists(article_id: int | str) -> bool:
"""Check if a Figshare article exists and is accessible.
Args:
article_id (int | str): The ID or URL of the article to check.
Returns:
bool: True if the article exists and is accessible, False otherwise.
"""
article_url = (
f"{BASE_URL}/account/articles/{article_id}"
if isinstance(article_id, int)
else article_id
)
try:
make_request("GET", article_url)
except requests.HTTPError as exc:
if exc.response.status_code == 404:
return False
exc.add_note(f"{article_url=}")
raise
else:
return True


def list_article_files(article_id: int) -> list[dict[str, Any]]:
"""Get a list of files in a Figshare article.
Args:
article_id (int): ID of the article to list files from.
Returns:
list[dict[str, Any]]: List of file information dictionaries. Each dictionary
contains keys like 'name', 'id', 'size', 'computed_md5', etc.
Empty list if article doesn't exist.
Raises:
requests.HTTPError: If the request fails for any reason other than 404.
"""
try:
return make_request("GET", f"{BASE_URL}/account/articles/{article_id}/files")
except requests.HTTPError as exc:
if exc.response.status_code == 404:
return []
raise
18 changes: 6 additions & 12 deletions matbench_discovery/structure.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,6 @@ def analyze_symmetry(
"""
import moyopy

sym_key_map = {
"number": Key.spg_num,
"hall_number": Key.hall_num,
"site_symmetry_symbols": MbdKey.international_spg_name,
"wyckoffs": Key.wyckoff_symbols,
}

results: dict[str, dict[str, str | int | list[str]]] = {}
iterator = structures.items()
if pbar:
Expand All @@ -79,12 +72,12 @@ def analyze_symmetry(
)

for struct_key, struct in iterator:
cell = moyopy.Cell(
moyo_cell = moyopy.Cell(
struct.lattice.matrix, struct.frac_coords, struct.atomic_numbers
)

sym_data = moyopy.MoyoDataset(
cell, symprec=symprec, angle_tolerance=angle_tolerance
moyo_cell, symprec=symprec, angle_tolerance=angle_tolerance
)

if sym_data is None:
Expand All @@ -96,9 +89,10 @@ def analyze_symmetry(
hall_symbol_entry = moyopy.HallSymbolEntry(hall_number=sym_data.hall_number)

sym_info = {
new_key: getattr(sym_data, old_key)
for old_key, new_key in sym_key_map.items()
} | {
Key.spg_num: sym_data.number,
Key.hall_num: sym_data.hall_number,
MbdKey.international_spg_name: sym_data.site_symmetry_symbols,
Key.wyckoff_symbols: sym_data.wyckoffs,
Key.n_sym_ops: sym_ops.num_operations,
Key.n_rot_syms: len(sym_ops.rotations),
Key.n_trans_syms: len(sym_ops.translations),
Expand Down
Loading

0 comments on commit afb09d5

Please sign in to comment.