The goals of this study were to answer the following questions:
- Do demographics play a major role in selecting the winner of the Nobel Prize in Physics?
- Which demographic factors have the biggest influence on the outcome?
- Who are the most likely winners of The Nobel Prize in Physics 2018?
To try to answer these questions, we collected demographic data on almost a thousand world-renowned physicists from DBpedia.
From the data, two sets of binary features were constructed. The first was a relatively large dimensionality feature set from the original demographic data. The second was a reduced dimensionality feature set, constructed from the original features, using the corex topic modeling approach.
Furthermore, we split the data into training, validation and test sets for learning, model selection and assessment of generalization performance, respectively. As the Nobel Prize in Physics cannot be awarded posthumously and we needed to develop a model to predict laureates, the data was sampled to create a training set that consisted of deceased physicists and test and validation sets that consisted of living physicists. A classifier two-sample hypothesis test was used to formally detect that this sample selection bias introduced a covariate shift between the training and validation / test sets. We tried to correct for the covariate shift during learning by reweighting training samples according to their importance using the Kullback-Leibler Importance Estimation Procedure (KLIEP). Logistic regression, support vector machine and random forest classifiers were trained using both feature sets, with and without importance weighting, in order to predict Physics Nobel Laureates.
A new performance measure known as normalized area under the Matthews Correlation Coefficient curve,
An optimal threshold of 0.513 corresponding to the maximum value of the MCC, where the true negative rate (TNR) is higher than the true positive rate (TPR), was chosen as the operating point of the model. This threshold was chosen as minimizing false positives (i.e. maximizing the TNR) is more important than minimizing false negatives (i.e. maximizing the TPR) when classifying the physicists as laureates and non-laureates.
The logistic regression classifier achieved an MCC of 0.36 when evaluated on the test data, which indicates that the classifier performs much better than both random chance (MCC = 0) and the "naive" baseline classifier (MCC = 0.19). From this the conclusion is that there are significant underlying patterns in the demographic data that correlate with being a Physics Nobel Laureate. However, we avoided making strong statements about the classifier's performance in absolute terms and concluded that we would not be willing to make recommendations to the Nobel Committee based on its predictions. Essentially, the number of false postives was too high to make any substantial claims about biases that may be present when deciding Nobel Physics Prize Winners.
In spite of this, we looked at the demographic factors that had the biggest influence on the logistic regression model classifying a physicist as a laureate. Being an experimental physicist was by far the most influential feature. The next two most influential features were having at least one physics laureate doctoral student and living for at least 65-79 years. Other interesting influential features were being a citizen of France or Switzerland, working at Bell Labs or The University of Cambridge, being an alumnus in Asia and having at least two alma mater.
Finally, we used the logistic regression model to predict the most likely winners of 2018 Nobel Physics Prize. We were unable to correctly predict the winners as they were never in the original list of physicists scraped from Wikipedia. However, we found that the actual winners (Gerard Mourou, Arthur Ashkin and Donna Strickland) possessed several of the most important demographic factors identified by the logistic regression model.
This study has shown that machine learning is a promising approach to identifying underlying patterns in Physics Nobel Laureate demographic data. However, being able to provide insight into biases present in the selection process would require a predictive model based on the demographics of nominators, The Royal Swedish Academy of Sciences members, experts who assess the nominees' work and the nominees themselves.
Existence of the Nomination Archive proves that the Nobel Committee collects such data. Although they have their reasons for not making this data public until 50 years later, we would recommend that internally they explore the possibility of using it to build predictive models. Such models may be able to provide significant insight into any biases that may be present in the selection process.