Support large dataset preprocessing #452

chiang-yuan · 2024-06-06T23:55:07Z

This PR tries to resolve OOM error and improve performance when loading very large dataset like MPTrj (1.58M) or even bigger ones. To use this file, mpi4py is needed.

Additional file preprocessing_data_mpi.py is added to ensure back compatibility and the refactoring is reduced to minimum, but ideally preprocessing_data.py could be replaced with the new file as long as we consider the import dependency on mpi4py

ilyes319 · 2024-06-21T14:13:01Z

Hey @chiang-yuan thank you. Is that ready to be merged?

chiang-yuan · 2024-06-22T15:39:46Z

it still needs some refactoring. It seems like only modifying the ase read part is not enough. I will refactor all the hdf5 file writing part as well but it might take sometime...

argparse support schedulefree support specifying betas fix single layer logging Add optional dependency and a loading test

Co-Authored-By: Nils Goennheimer <[email protected]>

…lso being filtered out

chiang-yuan added 5 commits June 6, 2024 16:46

add preprocess_data_mpi.py

8e7bdd7

move import to top

252df53

fix chunkify, broadcast over ranks

87c83db

cure dead rank, first successful parallel read

7c043e1

bugfix: printing only for rank 0

cf56258

ilyes319 changed the base branch from main to develop June 21, 2024 14:12

chiang-yuan and others added 22 commits July 1, 2024 22:47

bugfix: printing only for rank 0

6196ed5

remove n_atoms factor

6ebfd11

updated tests

a5b1891

fix preprocess data local error

9d4b5de

fix mace_off float32 finetuning

d3ff5c0

change numpy<2.0 in setup.cfg

47999ee

fix import sort and linting

e004bb9

fix universal loss stress by default

76ea313

add wandb_dir in argparser, changed wandb directory related parts

c261ed8

initial

695a8ae

argparse support schedulefree support specifying betas fix single layer logging Add optional dependency and a loading test

change tempdir to tempfile to be consistent with other tests

3f29950

Bump Actions versions

e0c427a

Add initial release workflow

975f3cf

Add template for DOI badge

628ca17

upade the readme installation instructions

34fbefb

formatting in the readme installation

14e3c22

add more details on the GPU cuda installation

a55e913

remove DOI from badge

41980c3

add finetuning to readme TOC

72a7a3c

add support for hessian in calculator

ad900db

Co-Authored-By: Nils Goennheimer <[email protected]>

fix torchscript compilation for hessian

4c8245f

Co-Authored-By: Nils Goennheimer <[email protected]>

prepare the model for hessian

986f540

ilyes319 and others added 4 commits July 1, 2024 22:47

fix compilation hessian

dfb5f97

clean up hessian

f5bc75b

add fullgraph option in compilation

b28b574

BUG: one-atom configs not marked with IsolatedAtom config_type were a…

ce69072

…lso being filtered out

chiang-yuan mentioned this pull request Sep 12, 2024

Loosen the dependency to avoid conflict with other package #588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support large dataset preprocessing #452

Support large dataset preprocessing #452

chiang-yuan commented Jun 6, 2024

ilyes319 commented Jun 21, 2024

chiang-yuan commented Jun 22, 2024

Support large dataset preprocessing #452

Are you sure you want to change the base?

Support large dataset preprocessing #452

Conversation

chiang-yuan commented Jun 6, 2024

ilyes319 commented Jun 21, 2024

chiang-yuan commented Jun 22, 2024