Some elements dropped while encoding to mod_pettifor representation #21

sgbaird · 2022-10-22T18:24:20Z

The following produces a list of 118 unique elements (disclaimer: contains unrealistic entries):

np.unique([elem.symbol for elem in list(_data.keys())])
array(['Ac', 'Ag', 'Al', 'Am', 'Ar', 'As', 'At', 'Au', 'B', 'Ba', 'Be',
       'Bh', 'Bi', 'Bk', 'Br', 'C', 'Ca', 'Cd', 'Ce', 'Cf', 'Cl', 'Cm',
       'Cn', 'Co', 'Cr', 'Cs', 'Cu', 'Db', 'Ds', 'Dy', 'Er', 'Es', 'Eu',
       'F', 'Fe', 'Fl', 'Fm', 'Fr', 'Ga', 'Gd', 'Ge', 'H', 'He', 'Hf',
       'Hg', 'Ho', 'Hs', 'I', 'In', 'Ir', 'K', 'Kr', 'La', 'Li', 'Lr',
       'Lu', 'Lv', 'Mc', 'Md', 'Mg', 'Mn', 'Mo', 'Mt', 'N', 'Na', 'Nb',
       'Nd', 'Ne', 'Nh', 'Ni', 'No', 'Np', 'O', 'Og', 'Os', 'P', 'Pa',
       'Pb', 'Pd', 'Pm', 'Po', 'Pr', 'Pt', 'Pu', 'Ra', 'Rb', 'Re', 'Rf',
       'Rg', 'Rh', 'Rn', 'Ru', 'S', 'Sb', 'Sc', 'Se', 'Sg', 'Si', 'Sm',
       'Sn', 'Sr', 'Ta', 'Tb', 'Tc', 'Te', 'Th', 'Ti', 'Tl', 'Tm', 'Ts',
       'U', 'V', 'W', 'Xe', 'Y', 'Yb', 'Zn', 'Zr'], dtype='<U2')

However, when encoding these in the "mod_pettifor" representation, there are 103 unique values:

mod_petti = [encode(k, "mod_pettifor") for k in _data.keys()]
mod_petti_comp = dict(zip(mod_petti, _data.values()))

mod_petti_comp
dict_keys([23, 25, 93, 90, 101, 96, 59, 8, 69, 1, 16, 51, 80, 82, 12, 33, 64, 92, 26, 52, 55, 48, 20, 50, 70, 100, 53, 54, 71, 2, 94, 27, 57, 87, 65, 32, 24, 75, 79, 61, 63, 83, 43, 14, 10, 72, 15, 7, 86, 28, 3, 89, 62, 19, 22, 13, 81, 60, 67, 30, 0, 99, 56, 38, 34, 29, 21, 4, 31, 17, 36, 95, 66, 58, 74, 68, 85, 49, 45, 18, 73, 47, 77, 44, 91, 46, 98, 40, 37, 39, 78, 84, 76, 41, 88, 5, 97, 9, 6, 35, 42, 102, 11])

Not sure if #15 is related.

This is a blocker for using matbench-genmetrics with xtal2png+imagen-pytorch in sparks-baird/xtal2png#204, but not super time-sensitive. The fact that it's producing values from all 118 periodic elements despite not all elements being represented in the training dataset (pretty sure) is a concern from the generative modeling standpoint.

For context, the script I'm running is https://github.com/sparks-baird/matbench-genmetrics/blob/main/scripts/load_imagen_pytorch_generated.py.

The text was updated successfully, but these errors were encountered:

kjappelbaum · 2022-10-22T18:33:33Z

yes, there are certain elements that have non-unique codings in some of the encodings (therefore the warning #15). I can look into making a version of the mod-pettifor that removes this issue.

TBH, I didn't so far look into whether it is a bug or expected behavior.

sgbaird · 2022-10-28T15:47:34Z

Worked around it in the code. I just needed to remove the "symbols" column from the DataFrame I made. I wasn't using the "symbols" data anyway.

https://github.com/sparks-baird/matbench-genmetrics/blob/76dc21948b4a61eaa3224c56e289543aabacd985/src/matbench_genmetrics/utils/featurize.py#L66-L72

    mod_petti_df = pd.DataFrame(
        dict(symbol=_data.keys(), mod_petti=mod_petti_comp.keys(), contribution=mod_petti_comp.values()),
    ).sort_values("mod_petti")

changed to:

    mod_petti_df = pd.DataFrame(
        dict(mod_petti=mod_petti_comp.keys(), contribution=mod_petti_comp.values()),
    ).sort_values("mod_petti")

remove symbol column: kjappelbaum/element-coder#21 (comment)

kjappelbaum · 2022-11-02T16:41:54Z

sorry for coming back to this so late.

Do you have a preferred way of solving this? I also do not like that

element-coder/src/element_coder/data/raw/mod_petti.json

Lines 105 to 120 in fa6a025

    
           "Rf": 0, 
        
           "Db": 0, 
        
           "Sg": 0, 
        
           "Bh": 0, 
        
           "Hs": 0, 
        
           "Mt": 0, 
        
           "Ds": 0, 
        
           "Rg": 0, 
        
           "Cn": 0, 
        
           "Nh": 0, 
        
           "Fl": 0, 
        
           "Mc": 0, 
        
           "Lv": 0, 
        
           "Ts": 0, 
        
           "Og": 0, 
        
           "Uue": 0

all code to the same value as He. The question is only what to replace them with. I see the following options:

Leave as is (will raise the warning and users need to think how they deal with it)
Remove duplicated entries (will except if element has not a defined encoding, one could catch this with some fill value)
Replace the values ourselves with something else

kjappelbaum self-assigned this Oct 22, 2022

sgbaird added a commit to sparks-baird/matbench-genmetrics that referenced this issue Oct 28, 2022

Workaround for non-unique mappings in mod_petti representation

ecb4e6b

remove symbol column: kjappelbaum/element-coder#21 (comment)

sgbaird mentioned this issue Oct 28, 2022

Workaround for non-unique mappings in mod_petti representation sparks-baird/matbench-genmetrics#66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some elements dropped while encoding to mod_pettifor representation #21

Some elements dropped while encoding to mod_pettifor representation #21

sgbaird commented Oct 22, 2022 •

edited

Loading

kjappelbaum commented Oct 22, 2022 •

edited

Loading

sgbaird commented Oct 28, 2022

kjappelbaum commented Nov 2, 2022

Some elements dropped while encoding to mod_pettifor representation #21

Some elements dropped while encoding to mod_pettifor representation #21

Comments

sgbaird commented Oct 22, 2022 • edited Loading

kjappelbaum commented Oct 22, 2022 • edited Loading

sgbaird commented Oct 28, 2022

kjappelbaum commented Nov 2, 2022

sgbaird commented Oct 22, 2022 •

edited

Loading

kjappelbaum commented Oct 22, 2022 •

edited

Loading