Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some elements dropped while encoding to mod_pettifor representation #21

Open
sgbaird opened this issue Oct 22, 2022 · 3 comments
Open
Assignees

Comments

@sgbaird
Copy link

sgbaird commented Oct 22, 2022

The following produces a list of 118 unique elements (disclaimer: contains unrealistic entries):

np.unique([elem.symbol for elem in list(_data.keys())])
array(['Ac', 'Ag', 'Al', 'Am', 'Ar', 'As', 'At', 'Au', 'B', 'Ba', 'Be',
       'Bh', 'Bi', 'Bk', 'Br', 'C', 'Ca', 'Cd', 'Ce', 'Cf', 'Cl', 'Cm',
       'Cn', 'Co', 'Cr', 'Cs', 'Cu', 'Db', 'Ds', 'Dy', 'Er', 'Es', 'Eu',
       'F', 'Fe', 'Fl', 'Fm', 'Fr', 'Ga', 'Gd', 'Ge', 'H', 'He', 'Hf',
       'Hg', 'Ho', 'Hs', 'I', 'In', 'Ir', 'K', 'Kr', 'La', 'Li', 'Lr',
       'Lu', 'Lv', 'Mc', 'Md', 'Mg', 'Mn', 'Mo', 'Mt', 'N', 'Na', 'Nb',
       'Nd', 'Ne', 'Nh', 'Ni', 'No', 'Np', 'O', 'Og', 'Os', 'P', 'Pa',
       'Pb', 'Pd', 'Pm', 'Po', 'Pr', 'Pt', 'Pu', 'Ra', 'Rb', 'Re', 'Rf',
       'Rg', 'Rh', 'Rn', 'Ru', 'S', 'Sb', 'Sc', 'Se', 'Sg', 'Si', 'Sm',
       'Sn', 'Sr', 'Ta', 'Tb', 'Tc', 'Te', 'Th', 'Ti', 'Tl', 'Tm', 'Ts',
       'U', 'V', 'W', 'Xe', 'Y', 'Yb', 'Zn', 'Zr'], dtype='<U2')

However, when encoding these in the "mod_pettifor" representation, there are 103 unique values:

mod_petti = [encode(k, "mod_pettifor") for k in _data.keys()]
mod_petti_comp = dict(zip(mod_petti, _data.values()))

mod_petti_comp
dict_keys([23, 25, 93, 90, 101, 96, 59, 8, 69, 1, 16, 51, 80, 82, 12, 33, 64, 92, 26, 52, 55, 48, 20, 50, 70, 100, 53, 54, 71, 2, 94, 27, 57, 87, 65, 32, 24, 75, 79, 61, 63, 83, 43, 14, 10, 72, 15, 7, 86, 28, 3, 89, 62, 19, 22, 13, 81, 60, 67, 30, 0, 99, 56, 38, 34, 29, 21, 4, 31, 17, 36, 95, 66, 58, 74, 68, 85, 49, 45, 18, 73, 47, 77, 44, 91, 46, 98, 40, 37, 39, 78, 84, 76, 41, 88, 5, 97, 9, 6, 35, 42, 102, 11])

Not sure if #15 is related.

This is a blocker for using matbench-genmetrics with xtal2png+imagen-pytorch in sparks-baird/xtal2png#204, but not super time-sensitive. The fact that it's producing values from all 118 periodic elements despite not all elements being represented in the training dataset (pretty sure) is a concern from the generative modeling standpoint.

For context, the script I'm running is https://github.com/sparks-baird/matbench-genmetrics/blob/main/scripts/load_imagen_pytorch_generated.py.

@kjappelbaum
Copy link
Owner

kjappelbaum commented Oct 22, 2022

yes, there are certain elements that have non-unique codings in some of the encodings (therefore the warning #15). I can look into making a version of the mod-pettifor that removes this issue.

TBH, I didn't so far look into whether it is a bug or expected behavior.

@kjappelbaum kjappelbaum self-assigned this Oct 22, 2022
@sgbaird
Copy link
Author

sgbaird commented Oct 28, 2022

Worked around it in the code. I just needed to remove the "symbols" column from the DataFrame I made. I wasn't using the "symbols" data anyway.

https://github.com/sparks-baird/matbench-genmetrics/blob/76dc21948b4a61eaa3224c56e289543aabacd985/src/matbench_genmetrics/utils/featurize.py#L66-L72

    mod_petti_df = pd.DataFrame(
        dict(symbol=_data.keys(), mod_petti=mod_petti_comp.keys(), contribution=mod_petti_comp.values()),
    ).sort_values("mod_petti")

changed to:

    mod_petti_df = pd.DataFrame(
        dict(mod_petti=mod_petti_comp.keys(), contribution=mod_petti_comp.values()),
    ).sort_values("mod_petti")

@kjappelbaum
Copy link
Owner

sorry for coming back to this so late.

Do you have a preferred way of solving this? I also do not like that

"Rf": 0,
"Db": 0,
"Sg": 0,
"Bh": 0,
"Hs": 0,
"Mt": 0,
"Ds": 0,
"Rg": 0,
"Cn": 0,
"Nh": 0,
"Fl": 0,
"Mc": 0,
"Lv": 0,
"Ts": 0,
"Og": 0,
"Uue": 0
all code to the same value as He. The question is only what to replace them with. I see the following options:

  • Leave as is (will raise the warning and users need to think how they deal with it)
  • Remove duplicated entries (will except if element has not a defined encoding, one could catch this with some fill value)
  • Replace the values ourselves with something else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants