Merge branch 'feature/CELE-78' into feature/CELE-99

MetaCell · Sep 26, 2024 · 4bd83d5 · 4bd83d5
2 parents d104bf2 + 5fcc821
commit 4bd83d5
Show file tree

Hide file tree

Showing 41 changed files with 3,162 additions and 210 deletions.
diff --git a/extraction/requirements.txt b/extraction/requirements.txt
diff --git a/ingestion/README.md b/ingestion/README.md
@@ -0,0 +1,37 @@
+# C-Elegans Utility CLI Tool
+
+## Installation
+
+Currently you can only install from source.
+This should change in the future as soon as the CLI is ready for user testing.
+
+Clone the repository and `cd` in to the `ingestion` directory.
+Be sure to setup a virtual environment if that is you intention.
+Then run:
+
+```console
+pip install .
+```
+
+You should now have the CLI available to run. Try it out by running:
+
+```console
+celegans --help
+```
+
+## Usage
+
+```TODO```
+
+## Development
+
+Setup a virtual environment with conda or equivalent so you have a clean python to work with.
+To install the project dependencies and development packages, run:
+
+```console
+pip install -e ".[dev]"
+```
+
+You should now be able to run the CLI, make changes to it and see it reflected in the script entrypoint output.
+
+Before pushing any code to the remote repository, be sure to run the code formater and unit tests.
diff --git a/format-ingestion.md → ingestion/format-ingestion.md b/format-ingestion.md → ingestion/format-ingestion.md
@@ -1,4 +1,49 @@
-# Format of data ingested in the database
+# Data ingest specification
+
+This document describes the requirements and expectations of all the data ingested in to the C-Elegans application.
+
+- [Dataset Identifier](#dataset-identifier)
+- [EM data](#em-data)
+- [3D data](#3d-data)
+- [Format of data ingested in the database](#format-of-data-ingested-in-the-database)
+  - [Format of `neurons.json`](#format-of-neuronsjson)
+  - [Format of `datasets.json`](#format-of-datasetsjson)
+  - [Format of `connections/xxx.json`](#format-of-connectionsxxxjson)
+  - [Format of `annotations/xxx.json`](#format-of-annotationsxxxjson)
+- [Bucket Storage](#bucket-storage)
+
+## Dataset Identifier
+
+All ingested data is contextualized within a dataset identifier.
+The identifier will segregate the data in the database and in the GCP bucket, ensuring that the data is easily indexed and managed.
+
+It is important to note that the dataset identifier is related to all data in the databse, so it has to match that of the ids in those files.
+
+> [!WARNING]  
+> The dataset identifier should not contain spaces or special characters.
+
+## Segmentations
+
+Segmentation files are json files that encode positions on neuron labels.
+They MUST follow the file path naming scheme: `**/*s<slice>.json`, where slice is a positive integer.
+
+## EM data
+
+Electromagnetic data MUST follow the file path namming scheme: `**/<slice>/<y>_<x>_<z>.jpg`, where `slice`, `x`, `y` and `z` are positive integers.
+
+Files MUST be `jpg` images with the same width and height dimentions.
+These images are tiled by zoom level at double the resolution of the previous zoom.
+
+<!-- TODO: understand the impact of varying metersPerUnit (e.g 2nm voxels) in the map projection -->
+
+## 3D data
+
+For the 3D data, we upload all STL files following a format of `<neuron name>-*.stl`, like in <https://github.com/zhenlab-ltri/catmaid-data-explorer/tree/3d-viewer/server/3d-models>.
+
+> [!NOTE]  
+> Synapsys are not currently being uploaded.
+
+## Format of data ingested in the database
 
 The management script is able to ingest data represented in a JSON format.
 Different files are necessary:
@@ -10,7 +55,7 @@ Different files are necessary:
 
 Those files are automatically exported from third-party tool and shouldn't be edited manually.
 
-## Format of `neurons.json`
+### Format of `neurons.json`
 
 This file defines a list of JSON object as root structure:
 
@@ -41,7 +86,7 @@ Each JSON object represents a neuron with this schema:
 ```
 
 
-## Format of `datasets.json`
+### Format of `datasets.json`
 
 This file defines a list of JSON object as root structure.
 
@@ -73,7 +118,10 @@ Each JSON object represents a specific dataset with this schema:
 }
 ```
 
-## Format of `connections/xxx.json`
+> [!WARNING]  
+> It is important to note that the datasets `id` defined in `datasets.json` MUST match with the [Dataset Identifier](#dataset-identifier) specified through the ingestion process so data can be correlated.
+
+### Format of `connections/xxx.json`
 
 The `connections` directory encodes the information about the different connections by dataset.
 Each file in this directory is named after the `id` of a dataset present in the `datasets.json` file, e.g.: a dataset defined using the `id` `white_1986_jsh` will defines each of the connections of the dataset in the file `connections/white_1986_jsh.json`.
@@ -95,7 +143,7 @@ The schema is the following:
 
 For each of those objects: `ids`, `post_tid`, `pre_tid` and `syn` need to have the same number of elements when `ids` is present.
 
-## Format of `annotations/xxx.json`
+### Format of `annotations/xxx.json`
 
 The `annotations` directory encodes annotations about the different part (`head` or `complete`) following the naming convention `part.annotations.json`, e.g.: the annotations for the `head` are located in `annotations/head.annotations.json`.
 
@@ -115,7 +163,39 @@ Here is the schema for the `head.annotations.json` file (the `complete.annotatio
 
 The types of annotations can be `increase`, `variable`, `postembryonic`, `decrease` or `stable`
 
-### Note:
+## Bucket Storage
+
+The cloud storage of the ingested files will be organized in the following pattern:
+
+```console
+.
+├── dataset-1
+│   ├── 3d
+│   │   ├── nervering.stl
+│   │   ├── ADAL.stl
+│   │   ├── ADAR.stl
+│   │   ├── ADEL.stl
+│   │   │   ...
+│   ├── em
+│   │   ├── ...
+│   │   ├── 13
+│   │   │   ├── 0_0_5.jpg
+│   │   │   ├── 0_1_4.jpg
+│   │   │   ├── 0_1_5.jpg
+│   │   │   ...
+│   │   ├── ...
+│   │   └── metadata.json
+│   └── segmentations
+│       ├── s000.json
+│       └── s001.json
+│       └── ...
+├── dataset-2
+├── dataset-3
+...
+```
+
+Each dataset will have its own base directory with the name being the dataset identifier. Inside each dataset directory we will find 3 subdirectories:
 
-The existing repository contains a `trajectories` folder with a set of JSON files.
-Those files are not ingested anymore, they are part of a legacy system.
+- `3d`: containing the 3D models for the neurons with the file name following `<neuron name>.stl`, with the exception of `nervering.stl`.
+- `em`: storing each slice tileset in its own subdirectory and a `metadata.json` file with information required to represent the tiles in the frontend application _(TODO: define `metadata.json` format)_.
+- `segmentations`: stores all the segmentation json files following the namming schema `s<slice>.json`, where `slice` is a positive integer (can contain left padding zeros).
diff --git a/ingestion/ingestion/__main__.py b/ingestion/ingestion/__main__.py
@@ -2,96 +2,127 @@
 
 import logging
 import sys
-from pathlib import Path
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
 
-from pydantic import ValidationError
-
-from ingestion.errors import DataValidationError, ErrorWriter
-from ingestion.filesystem import find_data_files, load_data
-from ingestion.schema import Data
+from ingestion.extract import add_flags as add_extract_flags
+from ingestion.extract import extract_cmd
+from ingestion.ingest import add_add_dataset_flags as add_ingest_add_dataset_flags
+from ingestion.ingest import add_flags as add_ingest_flags
+from ingestion.ingest import ingest_cmd
+from ingestion.logging import setup_logger
 
 logger = logging.getLogger(__name__)
 
 
-def main():
-    import argparse
-    import os
+def split_argv(argv: list[str], delimiter: str) -> list[list[str]]:
+    out: list[list[str]] = []
+    temp: list[str] = []
 
-    parser = argparse.ArgumentParser(
-        description="This is a python script to read c-elegans ingestion"
-        "files and validate its content."
-    )
+    for arg in argv:
+        if arg == delimiter:
+            out.append(temp)
+            temp = [arg]
+            continue
+        temp.append(arg)
+
+    if temp:
+        out.append(temp)
 
-    def directory(raw_path: str) -> Path:
-        if not os.path.isdir(raw_path):
-            raise argparse.ArgumentTypeError(f"{raw_path} is not an existing directory")
-        return Path(os.path.abspath(raw_path))
-
-    parser.add_argument(
-        "-i",
-        "--ingestion-dir",
-        help="input files to be ingested (default: current directory)",
-        type=directory,
-        default=os.path.curdir,
+    return out
+
+
+def _main(argv: list[str] | None = None):
+    parser = ArgumentParser(
+        prog="celegans",
+        description="Support tool for the C-Elegans application",
+        formatter_class=ArgumentDefaultsHelpFormatter,
     )
 
-    parser.add_argument(
-        "--overwrite",
-        help="overwrite files in the bucket",
-        default=False,
-        action="store_true",
+    def add_debug_flag(parser: ArgumentParser):
+        parser.add_argument(
+            "--debug",
+            help="runs with debug logs",
+            default=False,
+            action="store_true",
+        )
+
+    add_debug_flag(parser)
+
+    subparsers = parser.add_subparsers(dest="command")
+
+    # subcommand for the extraction of segmentation files
+    parser_extract = subparsers.add_parser(
+        name="extract",
+        help="extracs segentations from the bitmap files",
+        formatter_class=ArgumentDefaultsHelpFormatter,
     )
 
-    parser.add_argument(
-        "--prune",
-        help="prune files in the bucket before upload",
-        default=False,
-        action="store_true",
+    add_extract_flags(parser_extract)
+    add_debug_flag(parser_extract)
+
+    # subcommand for the file ingestion
+    parser_ingest = subparsers.add_parser(
+        name="ingest",
+        help="ingest files into the C-Elegans deployment",
+        formatter_class=ArgumentDefaultsHelpFormatter,
     )
 
-    parser.add_argument(
-        "--debug",
-        help="runs the ingestion with debug logs",
-        default=False,
-        action="store_true",
+    add_ingest_flags(parser_ingest)
+    add_debug_flag(parser_ingest)
+
+    subparsers_ingest = parser_ingest.add_subparsers(dest="ingest_subcommand")
+
+    parser_ingest_add_dataset = subparsers_ingest.add_parser(
+        name="add-dataset",
+        help="ingests a dataset data",
+        formatter_class=ArgumentDefaultsHelpFormatter,
     )
 
-    args = parser.parse_args()
+    add_ingest_add_dataset_flags(parser_ingest_add_dataset)
 
-    if args.debug:
-        logging.basicConfig(level=logging.DEBUG)
-    else:
-        logging.basicConfig(level=logging.INFO)
+    if argv is not None and len(argv) == 0:
+        parser.print_help(sys.stderr)
+        sys.exit(0)
 
-    data_files = find_data_files(args.ingestion_dir)
-    json_data = load_data(data_files)
+    args = parser.parse_args(argv)
 
-    err_header = (
-        "Seems like we found something unexpected with your data.\n"
-        "Bellow is an overview of what we think may be wrong.\n"
-        "If you think this is an error on our side, please reach out!\n"
-    )
+    setup_logger(args.debug)
 
     try:
-        Data.model_validate(json_data)
-    except ValidationError as e:
-        sys.stdout.write(
-            DataValidationError(e).humanize(
-                w=ErrorWriter(),
-                header=err_header,
-                data_files=data_files,
-            )
+        match args.command:
+            case "ingest":
+                ingest_cmd(args)
+            case "extract":
+                extract_cmd(args, debug=args.debug)
+    except KeyboardInterrupt as e:
+        if args.debug:
+            raise
+        logger.error(
+            "execution interrupted, some resources may have not uploaded properly!"
         )
+    except Exception as e:
+        if args.debug:
+            raise
+        print(f"{type(e).__name__}: {e}", file=sys.stderr)
         sys.exit(1)
 
-    print("OK")
 
+def main(argv: list[str] | None = None):
+    """Calls main but is inspects argv and splits accordingly"""
 
-if __name__ == "__main__":
-    try:
-        import pydantic as _
-    except ImportError:
-        print('error: missing pydantic; try "pip install pydantic"')
-        sys.exit(1)
+    if argv is None:
+        argv = sys.argv[1:]
+
+    if "ingest" in argv and "add-dataset" in argv:
+        argvl = split_argv(argv, "add-dataset")
 
+        # TODO: print help of missing "add-dataset" if repeated flags are detected
+
+        for add_dataset_args in argvl[1:]:
+            _main(argvl[0] + add_dataset_args)
+    else:
+        _main(argv)
+
+
+if __name__ == "__main__":
     main()