This repository contains the code for the paper:
Hussain, Z., Mata, R., Newell, B. R., & Wulff, D. U. (2024). Probing the contents of semantic representations from text, behavior, and brain data using the psychNorms metabase. arXiv. https://arxiv.org/abs/2412.04936
@misc{hussain2024probingcontentssemanticrepresentations,
title={Probing the contents of semantic representations from text, behavior, and brain data using the psychNorms metabase},
author={Zak Hussain and Rui Mata and Ben R. Newell and Dirk U. Wulff},
year={2024},
eprint={2412.04936},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.04936},
}
- To set up the environment, you can use the
environment.yml
file in the root directory of this repository. - Before running any other code, make sure to run
code/setup.py
to download/generate the necessary data files. Please note that to reduce the download size of the representations, we have already subsetted them to their intersection with the psychNorms dataset. - For licensing reasons, you will need to manually download
SWOW-EN.R100.csv
intodata/free_assoc/
from the Small World of Words. - To obtain the representations that we trained ourselves, you will need to run the notebooks in
code/embed_training/
. - Analyses (
code/rsa
andcode/rca
) can then be run in the order implied by the numbering of the notebooks. - Finally, figures can be generated by running the notebooks in
code/figures/
.
The original sources of the representations are as follows:
Text:
CBOW_GoogleNews
('GoogleNews-vectors-negative300.bin.gz')fastText_CommonCrawl
('crawl-300d-2M.vec.zip')fastText_Wiki_News
('wiki-news-300d-1M.vec.zip)fastTextSub_OpenSub
('English, en, OpenSubtitles')GloVe_CommonCrawl
('glove.840B.300d.zip')GloVe_Twitter
('glove.twitter.27B.zip')GloVe_Wikipedia
('glove.6B.zip')LexVec_CommonCrawl
('Word Vectors (2.2GB)')morphoNLM
('HSMN+csmRNN')spherical_text_Wikipedia
('300-d')
Brain:
microarray
('results/tungsten/word_projections.pickle')EEG_speech
('cognival-vectors/eeg_speech/naturalspeech_scaled.txt')EEG_text
('cognival-vectors/eeg_text/zuco_scaled.txt')fMRI_speech_hyper_align
('cognival-vectors/fmri/harry-potter/1000-random-voxels/', further processed with 'hyper alignment')fMRI_text_hyper_align
('cognival-vectors/fmri/alice/', further processed with 'hyper alignment')eye_tracking
('cognival-vectors/eye-tracking/all_scaled.txt')
Behavior:
PPMI_SVD_SWOW
('SWOW-EN18', further processed with PPMI and SVD transformations)SGSoftMaxInput_SWOW
('SWOW-EN18', further processed with Skip-Gram Softmax embedding algorithm)SGSoftMaxOutput_SWOW
('SWOW-EN18', further processed with Skip-Gram Softmax embedding algorithm)PPMI_SVD_SouthFlorida
('Appendix A. The normed cues, their targets and related information', further processed with PPMI and SVD transformations)PPMI_SVD_EAT
('ea-thesaurus.json', further processed with PPMI and SVD transformations)THINGS
('spose_embedding_49d_sorted.txt' and 'items1854names.tsv')feature_overlap
('double_words.csv')norms_sensorimotor
('Lancaster_sensorimotor_norms_for_39707_words.csv')compo_attribs
('word_ratings.zip')SVD_sim_rel
: 'AG203', 'BakerVerb', 'MartinezAldana', 'MC30', 'MEN3000', 'RG65', 'SimLex999', 'SimVerb3500', 'SL7576sem', 'SL7576vis', 'WP300', 'YP130', 'Atlasify240', 'GM30', 'MT287', 'MT771', 'Rel122', 'RW2034', 'WordSim353', 'Zie25', 'Zie30' (datasets were combined, min-max scaled and then processed with SVD transformation).
Note: compo_attribs
has been renamed to 'experiential attributes' in the paper and figures to be consistent with the terminolgy in the psychNorms metabase.
Information on the norms used in our analysis can be found in the psychNorms repository, and
in the metadata file in data/psychNorms/psychNorms_metadata.csv
.