Skip to content

1. Introduction

nevetsmallah edited this page Oct 13, 2021 · 30 revisions

About

leADS (multi-label learning based on Active Dataset Subsampling) is a machine learning framework, that leverages the idea of subsampling pathway data to reduce the negative impact of training loss due to imbalances in the distribution of pathways in a dataset.

Fig. 1: leADS workflow

Specifically, leADS (Fig. 1) performs training in three iterative steps (see Training):

  1. Building an acquisition model: At the very first iteration, an empty set is initialized with randomly selected data from a given pathway dataset (Fig 1.1). Then, an ensemble consisting of g members is constructed, where each member g in the ensemble is trained on a randomly selected subset of the data.
  2. Dataset sub-sampling: During this step, a subset of pathway data is selected using one of the following four acquisition functions: entropy, mutual information, variation ratios, and normalized propensity scored precision at k (nPSP@k) (Fig 1.2). For each function, the top per% examples are retrieved, where per% (\in (0, 100]) is a prespecified hyperparameter indicating the subsampling proportion.
  3. Train using sub-sampled data: Examples from the previous step are used to train leADS as the multi-label 1-vs-All approach (similar to mlLGPR) (Fig 1.3).

The three steps described above are repeated 𝜏 times, where, during each iteration, some examples collected from the previous iteration q-1 are discarded at random to enable examples not selected in the top per% to be used for the round q. Once the training is complete, two output files are generated- 1) the reduced pathway data with per% examples, and 2) the trained model. These can then be applied to predict metabolic pathways from a newly sequenced genome (see Tutorial on pathway prediction).

leADS was evaluated on the pathway prediction task using 10 multi-organism pathway datasets (see Download files), where the experiments revealed that leADS achieved a compelling and competitive performance against the state-of-the-art pathway inference algorithms (see Evaluation). For more information please consult our paper.

Citing

If you find leADS useful in your research, please cite the following paper [TO BE UPDATED]:

M. A. Basher, Abdur Rahman and Hallam, Steven J.. "Multi-label pathway prediction based on active dataset subsampling", bioRxiv (2020).

Contact information

For any inquiries or issues on source code, please contact Steven Hallam and Abdurrahman Abul-Basher at: [email protected] and [email protected]

For any inquiries or issues on documentation, please contact Aditi N Nallan at: [email protected]

Clone this wiki locally