create bundle for the soothie challenge

bcm-uga · Jul 31, 2024 · f8cf49d · f8cf49d
1 parent f1d0479
commit f8cf49d
Show file tree

Hide file tree

Showing 29 changed files with 1,793 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -105,6 +105,7 @@ sudo docker login -u  hombergn
 
 #upload on dockerhub
 sudo docker push hombergn/hadaca3_light:latest
+sudo docker push hombergn/hadaca3_pyr:latest
 
 #Single command to build and push. 
 sudo docker build -t hombergn/hadaca3_light .  && sudo docker push hombergn/hadaca3_light:latest

diff --git a/bundle/overview.md b/bundle/overview.md
@@ -1,7 +1,9 @@
-Health data challenge (HADACA) is a serie of data challenge aiming to contribute to scientific crowdsourced benchmarking in the field of data analysis in health.
+Health data challenge (HADACA) is a series of data challenge aiming to contribute to scientific crowdsourced benchmarking in the field of data analysis in health.
 
 The aim of a scientific data challenge is to improve the state-of-the-art from a quantitative reference point. In the field of methodological development for health data analysis, HADACA is seeking to provide a formal comparison of performance between new algorithms and state-of-the-art methods.
 
 To carry out these methodological assessments, HADACA brings together scientists from a variety of disciplines to tackle a specific challenge. During the week-long conference, participants brainstorm and work together to solve the problem posed by the organisers. Teams compete against each other and then share their solution publicly, so that all the participants can move on to the next stage together. Contrarily to classical workshops, HADACA challenges result in guidelines and scientific publications which are of use to the community. Offering authorship to competing teams, along with participation in manuscript design and writing, is a strong incentive that provides international visibility and recognition to participants.
 
 HADACA challenges is a reccurent event, 1st edition occurred in 2018 in partnership with the Data Institute of University Grenoble-Alpes, 2nd edition occurred in 2019 in partnership with the Ligue contre le Cancer and sponsored by the EIT Health. 3rd edition organization was delayed by the COVID pandemic. It is now scheduled to November 2024, in partnership with the M4DI project, an axis of the PEPR Santé Numérique of the Plan Innovation Santé 2030.
+
+The official website: [hadaca3.sciencesconf.org](https://hadaca3.sciencesconf.org/)
diff --git a/ingestion_program/sub_ingestion.R b/ingestion_program/sub_ingestion.R
@@ -54,8 +54,7 @@ total_time <- 0
 
 predi_list = list()
 for (dataset_name in 1:nb_datasets){
-  # dir_name = dir_name = paste0(input,.Platform$file.sep,"input_data", .Platform$file.sep,"input_data_",toString( dataset_name),.Platform$file.sep)
-  dir_name = dir_name = paste0(input,.Platform$file.sep,"input_data_",toString( dataset_name),.Platform$file.sep)
+  dir_name = paste0(input,.Platform$file.sep,"input_data_",toString( dataset_name),.Platform$file.sep)
   print(paste0("generating prediction for dataset:",toString(dataset_name) ))
 
 

diff --git a/phase-0-smoothie/bundle/FAQ.md b/phase-0-smoothie/bundle/FAQ.md
@@ -0,0 +1 @@
+To complete
diff --git a/phase-0-smoothie/bundle/HADACA3_com.png b/phase-0-smoothie/bundle/HADACA3_com.png
diff --git a/phase-0-smoothie/bundle/aim-of-the-challenge.md b/phase-0-smoothie/bundle/aim-of-the-challenge.md
@@ -0,0 +1,24 @@
+## Introduction
+
+Cellular heterogeneity in biological samples is a key factor that determines disease progression, but also influences biomedical analysis of samples and patient classification. 
+
+At the molecular level, the cellular composition of tissues is difficult to assess and quantify, as it is hidden within the bulk molecular profiles of samples (average profile of millions of cells), with all cells present in the tissue contributing to the recorded signal. Despite great promise, conventional computational approaches to quantifying cellular heterogeneity from mixtures of cells have encountered difficulties in providing robust and biologically relevant estimates.
+
+Here, our focus will be on reference-based approaches, which are gaining increasing popularity. While each method presents its own set of advantages and limitations, all are inherently constrained by the quality of the reference data employed. We hypothesize that existing algorithms could be enhanced by leveraging multimodal data integration to improve the quality of references.
+
+The objective of the HADACA3 challenge will be **to enhance existing cell-type deconvolution models by integrating multimodal datasets as reference data.**
+
+## Program
+
+**Phase 0:** Toy deconvolution challenge to test the Codabench framework and familiarize with the platform. A toy dataset respresenting smoothie is used.   
+
+**Phase 1:** Second toy deconvolution challenge aiming to handle a dataset close to chalenge target dataset.
+
+		
+**Phase 2:** Estimation of cell type heterogeneity from pancreatic adenocarcinoma matching bulk methylomes and transcriptomes using the following references profiles, for five different cell types (endothelial cells, fibroblasts, immune cells, cancer cells basal-like, cancer cells classic-like):
+
+- bulk RNA-seq references
+- DNAm references 
+- single-cell RNA-seq profiles 
+
+**Phase 3:** Auto-migration from phase 2 best methods and evalution on previously unseen validation dataset.
diff --git a/phase-0-smoothie/bundle/baseline.md b/phase-0-smoothie/bundle/baseline.md
@@ -0,0 +1,11 @@
+Baseline can be found on the `submission_script.R` contained in the `starting_kit`.
+
+## NNLS baseline
+
+We propose a baseline that executes the following steps:
+
+[1] Run the NNLS deconvolution algorithm on RNA mix using bulk RNA-seq reference data to generate an estimate of the proportion matrix.
+
+[2] Run the NNLS deconvolution algorithm on methylation mix using bulk methylation reference data to generate an estimate of the proportion matrix.
+
+[3] Average the two estimates to generate a prediction of the proportion matrix.
diff --git a/phase-0-smoothie/bundle/competition.yaml b/phase-0-smoothie/bundle/competition.yaml
@@ -0,0 +1,84 @@
+version: '2'
+title: Smoothie decovolution
+# docker_image: hombergn/hadaca3_light
+docker_image: hombergn/hadaca3_pyr
+queue: null
+description: HADACA3
+registration_auto_approve: true
+enable_detailed_results: true
+image: HADACA3_com.png
+terms: terms.md
+make_programs_available: true
+make_input_data_available: true
+pages:
+- title: Overview
+  file: overview.md
+- title: Aim of the challenge
+  file: aim-of-the-challenge.md
+- title: How do I start?
+  file: how-do-i-start.md
+- title: Data
+  file: data.md
+- title: Baseline
+  file: baseline.md
+- title: Submission
+  file: submission.md
+- title: Evaluation and scoring
+  file: evaluation-and-scoring.md
+- title: FAQ
+  file: FAQ.md
+- title: Organization 
+  file: organization.md
+- title: Sponsors
+  file: sponsors.md  
+tasks:
+- index: 0
+  name: LIGHT Cell type proportion estimation from transcriptome data1
+  description: LIGHT COMETH data challenge
+  reference_data: ground_truth.zip  #rename to ground_truth! 
+  input_data: input_data.zip
+  scoring_program: scoring_program.zip
+  ingestion_program: ingestion_program.zip
+solutions: []
+phases:
+- index: 0
+  name: Cell type proportion estimation from transcriptome and methylome data, using multimodal references.
+  description: good luck 
+  start: '2024-03-20 '
+  # end: '2024-07-20 11:00'
+  max_submissions_per_day: 5
+  max_submissions: 100
+  execution_time_limit: 600
+  auto_migrate_to_this_phase: false
+  hide_output: false
+  starting_kit: starting_kit.zip
+  tasks:
+  - 0
+
+# Fact sheets to add more information in the leaderboard
+fact_sheet: {
+    "method_name": {
+        "key": "method_name",
+        "type": "text",
+        "title": "Method name",
+        "selection": "",
+        "is_required": "false",
+        "is_on_leaderboard": "true"
+    }
+}
+leaderboards:
+- index: 0
+  title: Scores
+  key: score
+  hidden: false
+  columns:
+  - title: Accuracy_mean
+    key: Accuracy_mean
+    index: 1
+    sorting: desc
+    hidden: false  
+  - title: Excecution Time global
+    key: Time
+    index: 2
+    sorting: desc
+    hidden: false
diff --git a/phase-0-smoothie/bundle/data.md b/phase-0-smoothie/bundle/data.md
@@ -0,0 +1,76 @@
+## Data source
+
+Public data :  *to be completed later*
+
+Private data :  *to be completed later*
+
+## Data generation
+
+The cell-type proportion matrices (ground truth) are simulated by a Dirichlet distribution. 
+Simulated mixes were obtained using a convolution of the cell-type proportion matrix with the reference matrix of the corresponding omic data. 
+Finally, Gaussian noise was added to the matrix of convoluted methylation profiles.
+
+       # R function to generate cell-type proportion matrices using Dirichlet distribution
+       ground_truth = gtools::rdirichlet(n = n, alpha = alpha) # with n the number of sample to generate, and alpha a vector of targeted proportion
+       # Convolution of references and proportion
+       mix_rna = reference_rna %*% ground_truth
+       # Function to generate gaussian noise
+       noise = matrix(rnorm(prod(dim(data)), mean = mean, sd = sd), nrow = nrow(data)) # with data corresponding to the simulated mixes, and mean and standard deviation (sd) representing the parameters of the noise
+
+## Data description
+
+### Phase 1 : 
+
+ *to be completed later*
+
+### Phase 2 : 
+
+- mixes_data.rds, a list of matching DNAmethylation and RNAseq bulk data, for 30 samples
+
+       # read mixes data
+       mixes = readRDS("mixes_data.rds")
+       dim(mixes$mix_rna)
+       [1] 18749    30
+       dim(mixes$mix_met)
+       [1] 824678     30
+
+
+- reference_data.rds, a list of 2 bulk references : RNA and Met and 1 single cell count data and associated metadata  .
+
+       # read reference data
+       references = readRDS("reference_data.rds")
+
+       # format of bulk RNA references
+       colnames(reference$ref_bulkRNA)
+       [1] "endo"    "fibro"   "immune"  "classic" "basal"
+       dim(reference$ref_bulkRNA)
+       [1] 18749     5
+
+       # format of methylome references
+       > colnames(reference$ref_met)
+       [1] "endo"    "fibro"   "immune"  "classic" "basal"  
+       > dim(reference$ref_met)
+       [1] 824678      5
+
+       # format of scRNAseq references
+       > dim(reference$scRNAseq$counts) # 23376 gene expression for 20146 cells
+       [1] 23376      20146
+       > dim(reference$scRNAseq$metadata) # cell labels
+       [1] 20146      1
+       > table(reference$scRNAseq$metadata[,1])
+       [1] basal classic    endo   fibro  immune 
+           2036    2178    8874    3946    3112 
+
+
+### Phase 3 : 
+
+The validation dataset of phase 3 are ketp private to avoid overfitting.
+
+## Data Download
+
+To download the dataset for this project, follow these steps :
+
+ - Go on the challenge page,
+ - Go the *Get started* menu,
+ - Click on the *Files* tab,
+ - Download the `starting_kit`.
diff --git a/phase-0-smoothie/bundle/evaluation-and-scoring.md b/phase-0-smoothie/bundle/evaluation-and-scoring.md
@@ -0,0 +1,3 @@
+## How is the scoring metric computed?
+
+The **score** (discriminating metric) is computed on the estimated proportion matrix. The metric is a combination of row-correlation, column-correlation and mean absolute error between the ground_truth and the estimate provided by the participants.
diff --git a/phase-0-smoothie/bundle/how-do-i-start.md b/phase-0-smoothie/bundle/how-do-i-start.md
@@ -0,0 +1,13 @@
+## Overview of the processs
+
+[1] Go to the challenge webpage on Codabench platform, navigate to the *Get started* menu, and select the *Files* tab. Download the `starting_kit` corresponding to the phase you are participating in.
+
+[2] Get familiar with the dataset description (*Data* tab) and the baseline methods suggested by the organizer (*Baseline* tab).
+
+[3] Make a first submission using the baseline method (follow the guidelines in the *Submission* tab)
+
+[4] Check you score (more information on the scoring and ranking pipelines can be found on the *Evaluation and scoring* tab).
+
+[5] Improve you method by editing the `submission_script.R` contained in the `starting_kit` and resubmit your program to enhance your score.
+
+
diff --git a/phase-0-smoothie/bundle/organization.md b/phase-0-smoothie/bundle/organization.md
@@ -0,0 +1,37 @@
+HADACA 3rd ed. is organized in closed collaboration between the **M4DI project, an axis of the PEPR santé Numérique**, and the **ITMO Cancer Aviesan (2009-2023, project ACACIA)**. 
+
+M4DI is dedicated to Methods and models for multimodal and multi-scale data integration. 
+
+ACACIA (Artificial intelligence on multi-omics data to study tumor heterogeneity and develop clinical classifiers) is funded by the ITMO Cancer Aviesan call "Interdisciplinary approaches to oncogenic processes and therapeutic perspectives: Contributions of mathematics and informatics to oncology (MIC)".  
+
+## Scientific committee : 
+
+The role ot the scientific committee is to precisely define the scientific question, select appropriate datasets and evaluation metrics, and to design the challenge. 
+
+-  A. Baudot (researcher in computational biology, MMG, Marseille, France),
+-  Y. Blum (researcher in computational biology, IGDR, Rennes, France),
+-  D. Causeur (professor in statistics, ), IRMAR, Rennes, France),
+-  S. Dejean (researcher in statistic, IMT, Toulouse, France),
+-  C. Lecellier (researcher in genomics, IGMM, Montpellier, France),
+-  M. Richard (researcher in computational biology, TIMC, Grenoble, France),
+-  P. Roy (professor in biostatistics, LBBE, Lyon, France),
+-  M. Térézol (researcher in bioinformatics, MMG, Marseille, France).
+
+
+In addition, 3 external expert have joined the committee : 
+
+- Carl Herrmann (Professor in computational biology, Heidelberg University, Germany),
+- Lionel Spinelli (researcher in computational biology),
+- Franck Picard (researcher in statistics, LBMC, Lyon).
+
+## Technical committee : 
+
+The organization and implementation of the data challenge is coordinated by :
+
+- M. Richard 
+- Y. Blum
+- F. Chuffart (IR bioinfo INSERM)
+- L. Lamothe (IR bioinfo CNRS),
+- N. Homberg (IR bioinfo, Univ Grenoble-Alpes),
+- M. Térézol (IE bioinfo CNRS) and
+- H. Barbot (PhD student, co-supervised by M.Richard and Y. Blum).
diff --git a/phase-0-smoothie/bundle/overview.md b/phase-0-smoothie/bundle/overview.md
@@ -0,0 +1,9 @@
+This toy decovolution challenge is an introduction to Health data challenge (HADACA) which is a series of data challenge aiming to contribute to scientific crowdsourced benchmarking in the field of data analysis in health.
+
+The aim of a scientific data challenge is to improve the state-of-the-art from a quantitative reference point. In the field of methodological development for health data analysis, HADACA is seeking to provide a formal comparison of performance between new algorithms and state-of-the-art methods.
+
+To carry out these methodological assessments, HADACA brings together scientists from a variety of disciplines to tackle a specific challenge. During the week-long conference, participants brainstorm and work together to solve the problem posed by the organisers. Teams compete against each other and then share their solution publicly, so that all the participants can move on to the next stage together. Contrarily to classical workshops, HADACA challenges result in guidelines and scientific publications which are of use to the community. Offering authorship to competing teams, along with participation in manuscript design and writing, is a strong incentive that provides international visibility and recognition to participants.
+
+HADACA challenges is a reccurent event, 1st edition occurred in 2018 in partnership with the Data Institute of University Grenoble-Alpes, 2nd edition occurred in 2019 in partnership with the Ligue contre le Cancer and sponsored by the EIT Health. 3rd edition organization was delayed by the COVID pandemic. It is now scheduled to December 2024, in partnership with the M4DI project, an axis of the PEPR Santé Numérique of the Plan Innovation Santé 2030.
+
+The official website: [hadaca3.sciencesconf.org](https://hadaca3.sciencesconf.org/)
diff --git a/phase-0-smoothie/bundle/sponsors.md b/phase-0-smoothie/bundle/sponsors.md
@@ -0,0 +1,11 @@
+With support from ITMO Cancer of Aviesan within the framework of the 2021-2030 Cancer Control Strategy, on funds administered by Inserm
+
+With support from M4DI, PEPR Santé Numérique
+
+With support from LabEx PERSYVAL-2 (Grenoble-Alpes University)
+
+With support from RT Math Bio Santé (CNRS)
+
+With support from GRICAD Mesocentre(Grenoble-Alpes University)
+
+With support from RIS (CNRS MITI)
diff --git a/phase-0-smoothie/bundle/submission.md b/phase-0-smoothie/bundle/submission.md
@@ -0,0 +1,59 @@
+## How to generate a prediction of the data?
+
+[1] On your local machine, unzip the starting_kit.zip. Then open R in the starting_kit directory, (e.g. open submission_script.R with RStudio).
+
+The unziped starting_kit directory contains now:
+
+- A `submission_script.R` -> *to modify and to use to submit your code*
+- The `reference_data.rds` -> *reference data, i.e. typical molecular profiles of expected cell types*
+- The `mixes_data.rds` -> *mixes from which you will estimate cell type proportions (matching RNA and DNA methylation data)*
+
+[2] In the R console launch the following command (or run the `submission_script.R` in RStudio):
+
+		source("submission_script.R")
+
+[3] The code of the  `submission_script.R`  generates the files:
+- `zip_program`  -> *for code submission, script format*
+-` zip_results`  -> *for result submission, table format*
+
+Edit the `submission_script.R` to replace the baseline method by the method of your choice. 
+
+Edit the code inside the following chunk (i.e. the `program` function):     
+		## 
+		## YOUR CODE BEGINS HERE 
+		##
+
+		##
+		## YOUR CODE ENDS HERE
+		## 
+
+
+
+## How to submit your results ?
+
+Now, let’s submit your code (`zip_program`) or your result (`zip_results` ) in the *My Submission* menu of the challenge.
+
+On the  *My Submission* webpage,  the STATUS of your submission will go through the following steps :
+ -> Submitting > Submitted > Running > Finished
+
+## How to see your score ?
+
+To view your score, go to the challenge page and navigate to the Leaderboard or Results section. Here, you can see how your submission ranks and compare your score with other participants.
+
+[1] Go on *My Submission* menu 
+
+[2] When the status of your submission is finished ( don't forget to refresh the page to update the status), click on the green button 'add to leaderboard' to see your score
+
+By clicking on your submission in the submissions summary table, you will access to:
+
+  - details of your submission (downloaded)
+	-> submitted files, 
+	-> prediction results (ingestion output) 
+	-> scoring results (scoring outputs) 
+			
+  - some execution logs
+
+  - a submission metadata edition menu
+
+[3] Check the leaderboard in the *Results*  menu
+
diff --git a/phase-0-smoothie/bundle/terms.md b/phase-0-smoothie/bundle/terms.md
@@ -0,0 +1,8 @@
+By participating to this challenge, you accept to publicly share your submissions.
+
+You may submit 5 submissions every day and 100 in total.
+
+This challenge is governed by the general [ChaLearn contest rules](https://www.causality.inf.ethz.ch/GeneralChalearnContestRuleTerms.html).
+
+
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		## How is the scoring metric computed?

		The score (discriminating metric) is computed on the estimated proportion matrix. The metric is a combination of row-correlation, column-correlation and mean absolute error between the ground_truth and the estimate provided by the participants.