Critical Assessment of Function Annotation (CAFA), is a community-wide challenge designed to provide a large-scale assessment of computational methods dedicated to predicting protein function.
More information can be found at http://biofunctionprediction.org/cafa/ as well as the CAFA2 paper (Jiang et al, 2016)
This toolset provides an assessment for CAFA submissions based on precision and recall.
For bug reports, comments or questions, please email nzhou[AT]iastate.edu.
- Python 2.7 or Python 3
- Python packages can be downloaded from their sites or installed from repositories:
$ sudo apt install python-biopython python-yaml python-matplotlib python-seaborn
We provide two main functions to assist in the evaluation of GO-term prediction within the scope of CAFA, the main assessment function and the plot function.
assess_main.py
- Only input needed is the configuration file
config.yaml
, where the following four parameters are specified in the first sectionassess
. - First parameter
file
: prediction file formatted according to CAFA3 formats - Second parameter
obo
: path of the gene ontology obo file. The latest version can be downloaded here. Note that the obo file used here should not be older than the one used in the prediction. The obo files used in both CAFA2 and CAFA3 are provided in the./precrec/
folder. - Third parameter
benchmark
: directory of the benchmark folder. Specific formats are required for the benchmark folder, including two sub-directories: groundtruth and lists. Please refer to auxiliary functionbenchmark_folder.py
for the creation of this folder, as well as the genral creation of benchmarks. Benchmarks from CAFA2 and CAFA3 are given in this repository./precrec/benchmark
- Fourth parameter
results
: Folder where results are saved. Apr_rc
folder will be created within the results folder. - Note that only the first section
assess
of the configuration file is used here, the rest of the configuration file can be ignored for this function
- Only input needed is the configuration file
plot.py
- Only input needed is the configuration file
config.yaml
, where the following parameters are specified in the second sectionplot
. - First parameter
results
: the results from theassess_main.py
function. - Second parameter
title
: title of the plot. Optional. - Third parameter
smooth
: whether the precision-recall curves should be smoothed. Input 'Y' or 'N'. - Fourth parameter(s)
fileN
: name of the result file to be plotted. Can add up to 12 files. These results will be drawn on the same plot. - Example: if the prediction file is
ZZZ_1_9606.txt
, the result file in the results folder will beZZZ_1_9606_results.txt
. Only inputZZZ_1_9606
in the above parameter for plotting.
- Only input needed is the configuration file
CAFA3 released its protein targets in September 2016. Each protein target has a unique CAFA3 ID. To run the above assessment function, each protein should be represented by its CAFA3 ID. However, the benchmark proteins generated by the benchmark creation tool are identified by UniProt Accession IDs. Therefore, we here provide functions to convert between UniProt IDs and CAFA3 IDs. We also provide a function that converts benchmark files generated by the benchmark creation tool to a benchmark folder that can feed into this program.
-
benchmark_folder.py
- Refer to
python benchmark_folder.py -h
for syntax of using this function by itself. - If using our benchmark creation tool, then the
benchmark_pipeline.sh
file is a good example of how to generate a benchmark folder forassess_main.py
from the raw benchmarks. - Input your own folder names and different gaf file names in the blanks left in
benchmark_pipeline.sh
.
- Refer to
-
./ID_conversion/ID_conversion.py
- Two functions are written in this python script, one converts UniProt Accessions to CAFA3 IDs, the other function converts the other way around.
- First function
uniprotac_to_cafaid(taxon, uniprotacs)
. - Second function
cafaid_to_uniprot(taxon, cafaids)
. - Refer to comments in the script
./ID_conversion/ID_conversion.py
and third example below for usage.
./assess_main.py config.yaml
./plot.py config.yaml
./ID_conversion/ID_conversion.py ./ID_conversion/example_uniprot_accession_8355.txt 8355 ./ID_conversion/example_output.txt
./benchmark_pipeline.sh
Zhou, N., Jiang, Y., Bergquist, T.R. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 20, 244 (2019) doi:10.1186/s13059-019-1835-8
Jiang, Y., Oron, T., Clark, W. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol 17, 184 (2016) doi:10.1186/s13059-016-1037-6