Skip to content

Commit

Permalink
create bundle for the soothie challenge
Browse files Browse the repository at this point in the history
  • Loading branch information
Nicolas HOMBERG committed Jul 31, 2024
1 parent f1d0479 commit f8cf49d
Show file tree
Hide file tree
Showing 29 changed files with 1,793 additions and 7 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ sudo docker login -u hombergn
#upload on dockerhub
sudo docker push hombergn/hadaca3_light:latest
sudo docker push hombergn/hadaca3_pyr:latest
#Single command to build and push.
sudo docker build -t hombergn/hadaca3_light . && sudo docker push hombergn/hadaca3_light:latest
Expand Down
4 changes: 3 additions & 1 deletion bundle/overview.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
Health data challenge (HADACA) is a serie of data challenge aiming to contribute to scientific crowdsourced benchmarking in the field of data analysis in health.
Health data challenge (HADACA) is a series of data challenge aiming to contribute to scientific crowdsourced benchmarking in the field of data analysis in health.

The aim of a scientific data challenge is to improve the state-of-the-art from a quantitative reference point. In the field of methodological development for health data analysis, HADACA is seeking to provide a formal comparison of performance between new algorithms and state-of-the-art methods.

To carry out these methodological assessments, HADACA brings together scientists from a variety of disciplines to tackle a specific challenge. During the week-long conference, participants brainstorm and work together to solve the problem posed by the organisers. Teams compete against each other and then share their solution publicly, so that all the participants can move on to the next stage together. Contrarily to classical workshops, HADACA challenges result in guidelines and scientific publications which are of use to the community. Offering authorship to competing teams, along with participation in manuscript design and writing, is a strong incentive that provides international visibility and recognition to participants.

HADACA challenges is a reccurent event, 1st edition occurred in 2018 in partnership with the Data Institute of University Grenoble-Alpes, 2nd edition occurred in 2019 in partnership with the Ligue contre le Cancer and sponsored by the EIT Health. 3rd edition organization was delayed by the COVID pandemic. It is now scheduled to November 2024, in partnership with the M4DI project, an axis of the PEPR Santé Numérique of the Plan Innovation Santé 2030.

The official website: [hadaca3.sciencesconf.org](https://hadaca3.sciencesconf.org/)
3 changes: 1 addition & 2 deletions ingestion_program/sub_ingestion.R
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,7 @@ total_time <- 0

predi_list = list()
for (dataset_name in 1:nb_datasets){
# dir_name = dir_name = paste0(input,.Platform$file.sep,"input_data", .Platform$file.sep,"input_data_",toString( dataset_name),.Platform$file.sep)
dir_name = dir_name = paste0(input,.Platform$file.sep,"input_data_",toString( dataset_name),.Platform$file.sep)
dir_name = paste0(input,.Platform$file.sep,"input_data_",toString( dataset_name),.Platform$file.sep)
print(paste0("generating prediction for dataset:",toString(dataset_name) ))


Expand Down
1 change: 1 addition & 0 deletions phase-0-smoothie/bundle/FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
To complete
Binary file added phase-0-smoothie/bundle/HADACA3_com.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions phase-0-smoothie/bundle/aim-of-the-challenge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
## Introduction

Cellular heterogeneity in biological samples is a key factor that determines disease progression, but also influences biomedical analysis of samples and patient classification.

At the molecular level, the cellular composition of tissues is difficult to assess and quantify, as it is hidden within the bulk molecular profiles of samples (average profile of millions of cells), with all cells present in the tissue contributing to the recorded signal. Despite great promise, conventional computational approaches to quantifying cellular heterogeneity from mixtures of cells have encountered difficulties in providing robust and biologically relevant estimates.

Here, our focus will be on reference-based approaches, which are gaining increasing popularity. While each method presents its own set of advantages and limitations, all are inherently constrained by the quality of the reference data employed. We hypothesize that existing algorithms could be enhanced by leveraging multimodal data integration to improve the quality of references.

The objective of the HADACA3 challenge will be **to enhance existing cell-type deconvolution models by integrating multimodal datasets as reference data.**

## Program

**Phase 0:** Toy deconvolution challenge to test the Codabench framework and familiarize with the platform. A toy dataset respresenting smoothie is used.

**Phase 1:** Second toy deconvolution challenge aiming to handle a dataset close to chalenge target dataset.

**Phase 2:** Estimation of cell type heterogeneity from pancreatic adenocarcinoma matching bulk methylomes and transcriptomes using the following references profiles, for five different cell types (endothelial cells, fibroblasts, immune cells, cancer cells basal-like, cancer cells classic-like):

- bulk RNA-seq references
- DNAm references
- single-cell RNA-seq profiles

**Phase 3:** Auto-migration from phase 2 best methods and evalution on previously unseen validation dataset.
11 changes: 11 additions & 0 deletions phase-0-smoothie/bundle/baseline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Baseline can be found on the `submission_script.R` contained in the `starting_kit`.

## NNLS baseline

We propose a baseline that executes the following steps:

[1] Run the NNLS deconvolution algorithm on RNA mix using bulk RNA-seq reference data to generate an estimate of the proportion matrix.

[2] Run the NNLS deconvolution algorithm on methylation mix using bulk methylation reference data to generate an estimate of the proportion matrix.

[3] Average the two estimates to generate a prediction of the proportion matrix.
84 changes: 84 additions & 0 deletions phase-0-smoothie/bundle/competition.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
version: '2'
title: Smoothie decovolution
# docker_image: hombergn/hadaca3_light
docker_image: hombergn/hadaca3_pyr
queue: null
description: HADACA3
registration_auto_approve: true
enable_detailed_results: true
image: HADACA3_com.png
terms: terms.md
make_programs_available: true
make_input_data_available: true
pages:
- title: Overview
file: overview.md
- title: Aim of the challenge
file: aim-of-the-challenge.md
- title: How do I start?
file: how-do-i-start.md
- title: Data
file: data.md
- title: Baseline
file: baseline.md
- title: Submission
file: submission.md
- title: Evaluation and scoring
file: evaluation-and-scoring.md
- title: FAQ
file: FAQ.md
- title: Organization
file: organization.md
- title: Sponsors
file: sponsors.md
tasks:
- index: 0
name: LIGHT Cell type proportion estimation from transcriptome data1
description: LIGHT COMETH data challenge
reference_data: ground_truth.zip #rename to ground_truth!
input_data: input_data.zip
scoring_program: scoring_program.zip
ingestion_program: ingestion_program.zip
solutions: []
phases:
- index: 0
name: Cell type proportion estimation from transcriptome and methylome data, using multimodal references.
description: good luck
start: '2024-03-20 '
# end: '2024-07-20 11:00'
max_submissions_per_day: 5
max_submissions: 100
execution_time_limit: 600
auto_migrate_to_this_phase: false
hide_output: false
starting_kit: starting_kit.zip
tasks:
- 0

# Fact sheets to add more information in the leaderboard
fact_sheet: {
"method_name": {
"key": "method_name",
"type": "text",
"title": "Method name",
"selection": "",
"is_required": "false",
"is_on_leaderboard": "true"
}
}
leaderboards:
- index: 0
title: Scores
key: score
hidden: false
columns:
- title: Accuracy_mean
key: Accuracy_mean
index: 1
sorting: desc
hidden: false
- title: Excecution Time global
key: Time
index: 2
sorting: desc
hidden: false
76 changes: 76 additions & 0 deletions phase-0-smoothie/bundle/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
## Data source

Public data : *to be completed later*

Private data : *to be completed later*

## Data generation

The cell-type proportion matrices (ground truth) are simulated by a Dirichlet distribution.
Simulated mixes were obtained using a convolution of the cell-type proportion matrix with the reference matrix of the corresponding omic data.
Finally, Gaussian noise was added to the matrix of convoluted methylation profiles.

# R function to generate cell-type proportion matrices using Dirichlet distribution
ground_truth = gtools::rdirichlet(n = n, alpha = alpha) # with n the number of sample to generate, and alpha a vector of targeted proportion
# Convolution of references and proportion
mix_rna = reference_rna %*% ground_truth
# Function to generate gaussian noise
noise = matrix(rnorm(prod(dim(data)), mean = mean, sd = sd), nrow = nrow(data)) # with data corresponding to the simulated mixes, and mean and standard deviation (sd) representing the parameters of the noise

## Data description

### Phase 1 :

*to be completed later*

### Phase 2 :

- mixes_data.rds, a list of matching DNAmethylation and RNAseq bulk data, for 30 samples

# read mixes data
mixes = readRDS("mixes_data.rds")
dim(mixes$mix_rna)
[1] 18749 30
dim(mixes$mix_met)
[1] 824678 30


- reference_data.rds, a list of 2 bulk references : RNA and Met and 1 single cell count data and associated metadata .

# read reference data
references = readRDS("reference_data.rds")

# format of bulk RNA references
colnames(reference$ref_bulkRNA)
[1] "endo" "fibro" "immune" "classic" "basal"
dim(reference$ref_bulkRNA)
[1] 18749 5

# format of methylome references
> colnames(reference$ref_met)
[1] "endo" "fibro" "immune" "classic" "basal"
> dim(reference$ref_met)
[1] 824678 5

# format of scRNAseq references
> dim(reference$scRNAseq$counts) # 23376 gene expression for 20146 cells
[1] 23376 20146
> dim(reference$scRNAseq$metadata) # cell labels
[1] 20146 1
> table(reference$scRNAseq$metadata[,1])
[1] basal classic endo fibro immune
2036 2178 8874 3946 3112


### Phase 3 :

The validation dataset of phase 3 are ketp private to avoid overfitting.

## Data Download

To download the dataset for this project, follow these steps :

- Go on the challenge page,
- Go the *Get started* menu,
- Click on the *Files* tab,
- Download the `starting_kit`.
3 changes: 3 additions & 0 deletions phase-0-smoothie/bundle/evaluation-and-scoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## How is the scoring metric computed?

The **score** (discriminating metric) is computed on the estimated proportion matrix. The metric is a combination of row-correlation, column-correlation and mean absolute error between the ground_truth and the estimate provided by the participants.
13 changes: 13 additions & 0 deletions phase-0-smoothie/bundle/how-do-i-start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## Overview of the processs

[1] Go to the challenge webpage on Codabench platform, navigate to the *Get started* menu, and select the *Files* tab. Download the `starting_kit` corresponding to the phase you are participating in.

[2] Get familiar with the dataset description (*Data* tab) and the baseline methods suggested by the organizer (*Baseline* tab).

[3] Make a first submission using the baseline method (follow the guidelines in the *Submission* tab)

[4] Check you score (more information on the scoring and ranking pipelines can be found on the *Evaluation and scoring* tab).

[5] Improve you method by editing the `submission_script.R` contained in the `starting_kit` and resubmit your program to enhance your score.


37 changes: 37 additions & 0 deletions phase-0-smoothie/bundle/organization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
HADACA 3rd ed. is organized in closed collaboration between the **M4DI project, an axis of the PEPR santé Numérique**, and the **ITMO Cancer Aviesan (2009-2023, project ACACIA)**.

M4DI is dedicated to Methods and models for multimodal and multi-scale data integration.

ACACIA (Artificial intelligence on multi-omics data to study tumor heterogeneity and develop clinical classifiers) is funded by the ITMO Cancer Aviesan call "Interdisciplinary approaches to oncogenic processes and therapeutic perspectives: Contributions of mathematics and informatics to oncology (MIC)".

## Scientific committee :

The role ot the scientific committee is to precisely define the scientific question, select appropriate datasets and evaluation metrics, and to design the challenge.

- A. Baudot (researcher in computational biology, MMG, Marseille, France),
- Y. Blum (researcher in computational biology, IGDR, Rennes, France),
- D. Causeur (professor in statistics, ), IRMAR, Rennes, France),
- S. Dejean (researcher in statistic, IMT, Toulouse, France),
- C. Lecellier (researcher in genomics, IGMM, Montpellier, France),
- M. Richard (researcher in computational biology, TIMC, Grenoble, France),
- P. Roy (professor in biostatistics, LBBE, Lyon, France),
- M. Térézol (researcher in bioinformatics, MMG, Marseille, France).


In addition, 3 external expert have joined the committee :

- Carl Herrmann (Professor in computational biology, Heidelberg University, Germany),
- Lionel Spinelli (researcher in computational biology),
- Franck Picard (researcher in statistics, LBMC, Lyon).

## Technical committee :

The organization and implementation of the data challenge is coordinated by :

- M. Richard
- Y. Blum
- F. Chuffart (IR bioinfo INSERM)
- L. Lamothe (IR bioinfo CNRS),
- N. Homberg (IR bioinfo, Univ Grenoble-Alpes),
- M. Térézol (IE bioinfo CNRS) and
- H. Barbot (PhD student, co-supervised by M.Richard and Y. Blum).
9 changes: 9 additions & 0 deletions phase-0-smoothie/bundle/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
This toy decovolution challenge is an introduction to Health data challenge (HADACA) which is a series of data challenge aiming to contribute to scientific crowdsourced benchmarking in the field of data analysis in health.

The aim of a scientific data challenge is to improve the state-of-the-art from a quantitative reference point. In the field of methodological development for health data analysis, HADACA is seeking to provide a formal comparison of performance between new algorithms and state-of-the-art methods.

To carry out these methodological assessments, HADACA brings together scientists from a variety of disciplines to tackle a specific challenge. During the week-long conference, participants brainstorm and work together to solve the problem posed by the organisers. Teams compete against each other and then share their solution publicly, so that all the participants can move on to the next stage together. Contrarily to classical workshops, HADACA challenges result in guidelines and scientific publications which are of use to the community. Offering authorship to competing teams, along with participation in manuscript design and writing, is a strong incentive that provides international visibility and recognition to participants.

HADACA challenges is a reccurent event, 1st edition occurred in 2018 in partnership with the Data Institute of University Grenoble-Alpes, 2nd edition occurred in 2019 in partnership with the Ligue contre le Cancer and sponsored by the EIT Health. 3rd edition organization was delayed by the COVID pandemic. It is now scheduled to December 2024, in partnership with the M4DI project, an axis of the PEPR Santé Numérique of the Plan Innovation Santé 2030.

The official website: [hadaca3.sciencesconf.org](https://hadaca3.sciencesconf.org/)
11 changes: 11 additions & 0 deletions phase-0-smoothie/bundle/sponsors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
With support from ITMO Cancer of Aviesan within the framework of the 2021-2030 Cancer Control Strategy, on funds administered by Inserm

With support from M4DI, PEPR Santé Numérique

With support from LabEx PERSYVAL-2 (Grenoble-Alpes University)

With support from RT Math Bio Santé (CNRS)

With support from GRICAD Mesocentre(Grenoble-Alpes University)

With support from RIS (CNRS MITI)
59 changes: 59 additions & 0 deletions phase-0-smoothie/bundle/submission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
## How to generate a prediction of the data?

[1] On your local machine, unzip the starting_kit.zip. Then open R in the starting_kit directory, (e.g. open submission_script.R with RStudio).

The unziped starting_kit directory contains now:

- A `submission_script.R` -> *to modify and to use to submit your code*
- The `reference_data.rds` -> *reference data, i.e. typical molecular profiles of expected cell types*
- The `mixes_data.rds` -> *mixes from which you will estimate cell type proportions (matching RNA and DNA methylation data)*

[2] In the R console launch the following command (or run the `submission_script.R` in RStudio):

source("submission_script.R")

[3] The code of the `submission_script.R` generates the files:
- `zip_program` -> *for code submission, script format*
-` zip_results` -> *for result submission, table format*

Edit the `submission_script.R` to replace the baseline method by the method of your choice.

Edit the code inside the following chunk (i.e. the `program` function):
##
## YOUR CODE BEGINS HERE
##

##
## YOUR CODE ENDS HERE
##



## How to submit your results ?

Now, let’s submit your code (`zip_program`) or your result (`zip_results` ) in the *My Submission* menu of the challenge.

On the *My Submission* webpage, the STATUS of your submission will go through the following steps :
-> Submitting > Submitted > Running > Finished

## How to see your score ?

To view your score, go to the challenge page and navigate to the Leaderboard or Results section. Here, you can see how your submission ranks and compare your score with other participants.

[1] Go on *My Submission* menu

[2] When the status of your submission is finished ( don't forget to refresh the page to update the status), click on the green button 'add to leaderboard' to see your score

By clicking on your submission in the submissions summary table, you will access to:

- details of your submission (downloaded)
-> submitted files,
-> prediction results (ingestion output)
-> scoring results (scoring outputs)
- some execution logs

- a submission metadata edition menu

[3] Check the leaderboard in the *Results* menu

8 changes: 8 additions & 0 deletions phase-0-smoothie/bundle/terms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
By participating to this challenge, you accept to publicly share your submissions.

You may submit 5 submissions every day and 100 in total.

This challenge is governed by the general [ChaLearn contest rules](https://www.causality.inf.ethz.ch/GeneralChalearnContestRuleTerms.html).



Loading

0 comments on commit f8cf49d

Please sign in to comment.