This repository provides a set of evidences that the original procedure proposed in Trapp et al. for inference epigenetic age in single cell methylation data contains mistakes and generates wrong results. I specifically considered one of the most strange results which is called "ground zero", i.e. achieving the global minimum in epigenetic age during embryogenesis. I show that the ground zero is actually an artifact of the inherently wrong procedure of maximum likelihood estimation (MLE) proposed by the authors. In short, authors proposed to find polynomial likelihood function by applying brute force algorithm manually defined interval of search while the global solution can be found. I show the global solution for MLE problem provides other results than the authors proposed in their paper. The concrete procedure is provided in the notebook proof
.
Problem of single cell age inference is reduced to the maximum likelihood estimation problem, namely:
where
the weight
Thus, weights and biases, obtained after multiple regression models fitting on bulk dataset, were applied to predict cell age in the single cell dataset. Combining the equations (1) and (2) have:
The objective (3) is not a whole procedure of cell age inference. Two problems have to be resolved before:
- Probability values are not bounded in the proposed linear function approximation;
- The probabilites are not conditioned by the concrete single cell methylation profile.
For the first problem authors proposed to bound a linear function by a manually predetermined constants, e.g.:
\begin{equation*}
\begin{align*}
if\ p \geq 1 \rightarrow p = 1 - \epsilon \
if\ p \leq 0 \rightarrow p = \epsilon
\end{align*}
\end{equation*}
where
For the second problem authors proposed to use the following conditioning on single cell methylation profile for the probabilities by modifying equation (2):
where
Authors solve this problem by brute force approach, i.e. they split the interval of search (
Below I propose to consider the corrected procedure which avoids splitting the interval of search into K uniformly distributed values. Instead, I find global maximum of likelihood function on the interval by definition using nice properties of polynomial function. Moreover, I show that procedure of bounding probabilities leads to wrong results in case we expand the interval of search to the negative ages (what was done in the original paper) - I show how we can define this interval automatically based on the properties of monomials of polynomial function.
I want to thank Evgeny Efimov for data collection and preprocessing.
For now just put the link to this repo :3