Skip to content

1. Introduction

Aditi Nagaraj Nallan edited this page Jun 21, 2021 · 30 revisions

About

leADS (multi-label learning based on Active Dataset Subsampling) is a simple framework, that leverages the idea of subsampling pathway data to reduce the negative impact of training loss due to imbalances in the distribution of pathways in a dataset.

Fig. 1: leADS workflow

Specifically, leADS (Fig. 1) performs training in three iterative steps (see Training):

  1. Building an acquisition model: At the very first iteration, an empty set is initialized with randomly selected data from a given pathway dataset (Fig. 1. a-b). Then, an ensemble consisting of g members is constructed (Fig. 1. c), where each member g in the ensemble is trained on a randomly selected portion of the training data.
  2. Dataset sub-sampling: During this step, a subset of pathway data is selected using one of the following four acquisition functions: entropy, mutual information, variation ratios, and normalized propensity scored precision at k (nPSP@k). For each function, the top per% examples are retrieved, where per% (\in (0, 100]) is a prespecified hyperparameter indicating the subsampling proportion (Fig. 1. d).
  3. Train using sub-sampled data: Examples from the previous step are used to train leADS (Fig. 1. e) as the multi-label 1-vs-All approach (similar to mlLGPR).

The three steps described above are repeated 𝜏 times (Fig. 1. b-f), where, during each iteration, some examples collected from the previous iteration q-1 are discarded at random (Fig. 1. g) to provide a chance for the examples not selected in the top per% to be used for the round q. Once the training is complete (Fig. 1. h), two output files are generated- 1) the reduced pathway data with per% examples, and 2) the trained model. These can then be applied to predict metabolic pathways from a newly sequenced genome (see Tutorial on pathway prediction).

leADS was evaluated on the pathway prediction task using 10 multi-organism pathway datasets (see Download files), where the experiments revealed that leADS achieved a compelling and competitive performance against the state-of-the-art pathway inference algorithms (see Evaluation). For more information about leADS, please visit our paper.

Citing

If you find leADS useful in your research, please consider citing the following paper:

M. A. Basher, Abdur Rahman and Hallam, Steven J.. "Multi-label pathway prediction based on active dataset subsampling", bioRxiv (2020).

Contact information

For any inquiries or issues, please contact Abdurrahman Abul-Basher at: [email protected]

Clone this wiki locally