Skip to content

1. Introduction

Abdurrahman Abul-Basher edited this page Jun 3, 2021 · 30 revisions

About

leADS (multi-label learning based on active dataset subsampling) a simple framework, that leverages the idea of subsampling pathway data to reduce the negative impact of training loss due to imbalances in the distribution of pathways in the dataset. Specifically, leADS (Fig. \ref{fig:workflow}A.) performs training in three iterative steps:

  1. Building an acquisition model. At the very first iteration, an empty set is initialized with randomly selected data from a given pathway dataset (Fig. \ref{fig:workflow}a-b). Then, an ensemble consisting of g members is constructed (Fig. \ref{fig:workflow}c), where each member g in an ensemble is trained on a randomly selected portion of the training data.
  2. Dataset sub-sampling. During this step, a subset of pathway data is selected using one of the following four acquisition functions: entropy, mutual information, variation ratios, and normalized propensity scored precision at k (nPSP@k). For each function, the top per% examples are retrieved, where per% (\in (0, 100]) is a prespecified hyperparameter indicating the subsampling proportion (Fig. \ref{fig:workflow}d).
  3. Train using sub-sampled data. Examples from the previous step are used to train leADS as the multi-label 1-vs-All approach (similar to mlLGPR (Fig. \ref{fig:workflow}e).

The three steps are repeated $\tau (\in \mathbb{Z}_{\geq 1})$ times (Fig. \ref{fig:workflow}b-f), where at each iteration some examples collected from the previous iteration $q-1$ are discarded at random (Fig. \ref{fig:workflow}A.g) to give chance for examples not selected in the top per% to be used for the round $q$. Once the training is accomplished (Fig. \ref{fig:workflow}h): i)- the reduced pathway data with per% examples is produced; and ii)- the trained model is stored which then can be applied to predict metabolic pathways from a newly sequenced genome. leADS was evaluated on the pathway prediction task using 10 multi-organism pathway datasets, where the experiments revealed that leADS achieved very compelling and competitive performances against the state-of-the-art pathway inference algorithms. For more information about leADS, please visit our paper.

Citing

If you find leADS useful in your research, please consider citing the following paper:

M. A. Basher, Abdur Rahman and Hallam, Steven J.. "Multi-label pathway prediction based on active dataset subsampling.", bioRxiv (2020).

Contact information

For any inquiries or issues, please contact Abdurrahman Abul-Basher at: [email protected]

Clone this wiki locally