Skip to content

Commit

Permalink
Merge branch 'feature/CELE-78' into feature/CELE-99
Browse files Browse the repository at this point in the history
  • Loading branch information
aranega committed Sep 26, 2024
2 parents d104bf2 + 5fcc821 commit 4bd83d5
Show file tree
Hide file tree
Showing 41 changed files with 3,162 additions and 210 deletions.
5 changes: 0 additions & 5 deletions extraction/requirements.txt

This file was deleted.

37 changes: 37 additions & 0 deletions ingestion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# C-Elegans Utility CLI Tool

## Installation

Currently you can only install from source.
This should change in the future as soon as the CLI is ready for user testing.

Clone the repository and `cd` in to the `ingestion` directory.
Be sure to setup a virtual environment if that is you intention.
Then run:

```console
pip install .
```

You should now have the CLI available to run. Try it out by running:

```console
celegans --help
```

## Usage

```TODO```

## Development

Setup a virtual environment with conda or equivalent so you have a clean python to work with.
To install the project dependencies and development packages, run:

```console
pip install -e ".[dev]"
```

You should now be able to run the CLI, make changes to it and see it reflected in the script entrypoint output.

Before pushing any code to the remote repository, be sure to run the code formater and unit tests.
96 changes: 88 additions & 8 deletions format-ingestion.md → ingestion/format-ingestion.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,49 @@
# Format of data ingested in the database
# Data ingest specification

This document describes the requirements and expectations of all the data ingested in to the C-Elegans application.

- [Dataset Identifier](#dataset-identifier)
- [EM data](#em-data)
- [3D data](#3d-data)
- [Format of data ingested in the database](#format-of-data-ingested-in-the-database)
- [Format of `neurons.json`](#format-of-neuronsjson)
- [Format of `datasets.json`](#format-of-datasetsjson)
- [Format of `connections/xxx.json`](#format-of-connectionsxxxjson)
- [Format of `annotations/xxx.json`](#format-of-annotationsxxxjson)
- [Bucket Storage](#bucket-storage)

## Dataset Identifier

All ingested data is contextualized within a dataset identifier.
The identifier will segregate the data in the database and in the GCP bucket, ensuring that the data is easily indexed and managed.

It is important to note that the dataset identifier is related to all data in the databse, so it has to match that of the ids in those files.

> [!WARNING]
> The dataset identifier should not contain spaces or special characters.
## Segmentations

Segmentation files are json files that encode positions on neuron labels.
They MUST follow the file path naming scheme: `**/*s<slice>.json`, where slice is a positive integer.

## EM data

Electromagnetic data MUST follow the file path namming scheme: `**/<slice>/<y>_<x>_<z>.jpg`, where `slice`, `x`, `y` and `z` are positive integers.

Files MUST be `jpg` images with the same width and height dimentions.
These images are tiled by zoom level at double the resolution of the previous zoom.

<!-- TODO: understand the impact of varying metersPerUnit (e.g 2nm voxels) in the map projection -->

## 3D data

For the 3D data, we upload all STL files following a format of `<neuron name>-*.stl`, like in <https://github.com/zhenlab-ltri/catmaid-data-explorer/tree/3d-viewer/server/3d-models>.

> [!NOTE]
> Synapsys are not currently being uploaded.
## Format of data ingested in the database

The management script is able to ingest data represented in a JSON format.
Different files are necessary:
Expand All @@ -10,7 +55,7 @@ Different files are necessary:

Those files are automatically exported from third-party tool and shouldn't be edited manually.

## Format of `neurons.json`
### Format of `neurons.json`

This file defines a list of JSON object as root structure:

Expand Down Expand Up @@ -41,7 +86,7 @@ Each JSON object represents a neuron with this schema:
```


## Format of `datasets.json`
### Format of `datasets.json`

This file defines a list of JSON object as root structure.

Expand Down Expand Up @@ -73,7 +118,10 @@ Each JSON object represents a specific dataset with this schema:
}
```

## Format of `connections/xxx.json`
> [!WARNING]
> It is important to note that the datasets `id` defined in `datasets.json` MUST match with the [Dataset Identifier](#dataset-identifier) specified through the ingestion process so data can be correlated.
### Format of `connections/xxx.json`

The `connections` directory encodes the information about the different connections by dataset.
Each file in this directory is named after the `id` of a dataset present in the `datasets.json` file, e.g.: a dataset defined using the `id` `white_1986_jsh` will defines each of the connections of the dataset in the file `connections/white_1986_jsh.json`.
Expand All @@ -95,7 +143,7 @@ The schema is the following:

For each of those objects: `ids`, `post_tid`, `pre_tid` and `syn` need to have the same number of elements when `ids` is present.

## Format of `annotations/xxx.json`
### Format of `annotations/xxx.json`

The `annotations` directory encodes annotations about the different part (`head` or `complete`) following the naming convention `part.annotations.json`, e.g.: the annotations for the `head` are located in `annotations/head.annotations.json`.

Expand All @@ -115,7 +163,39 @@ Here is the schema for the `head.annotations.json` file (the `complete.annotatio

The types of annotations can be `increase`, `variable`, `postembryonic`, `decrease` or `stable`

### Note:
## Bucket Storage

The cloud storage of the ingested files will be organized in the following pattern:

```console
.
├── dataset-1
│   ├── 3d
│   │   ├── nervering.stl
│   │   ├── ADAL.stl
│   │   ├── ADAR.stl
│   │   ├── ADEL.stl
│ │ │ ...
│   ├── em
│   │   ├── ...
│   │   ├── 13
│   │   │   ├── 0_0_5.jpg
│   │   │   ├── 0_1_4.jpg
│   │   │   ├── 0_1_5.jpg
│ │ │ ...
│   │   ├── ...
│   │   └── metadata.json
│   └── segmentations
│   ├── s000.json
│   └── s001.json
│   └── ...
├── dataset-2
├── dataset-3
...
```

Each dataset will have its own base directory with the name being the dataset identifier. Inside each dataset directory we will find 3 subdirectories:

The existing repository contains a `trajectories` folder with a set of JSON files.
Those files are not ingested anymore, they are part of a legacy system.
- `3d`: containing the 3D models for the neurons with the file name following `<neuron name>.stl`, with the exception of `nervering.stl`.
- `em`: storing each slice tileset in its own subdirectory and a `metadata.json` file with information required to represent the tiles in the frontend application _(TODO: define `metadata.json` format)_.
- `segmentations`: stores all the segmentation json files following the namming schema `s<slice>.json`, where `slice` is a positive integer (can contain left padding zeros).
163 changes: 97 additions & 66 deletions ingestion/ingestion/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,96 +2,127 @@

import logging
import sys
from pathlib import Path
from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser

from pydantic import ValidationError

from ingestion.errors import DataValidationError, ErrorWriter
from ingestion.filesystem import find_data_files, load_data
from ingestion.schema import Data
from ingestion.extract import add_flags as add_extract_flags
from ingestion.extract import extract_cmd
from ingestion.ingest import add_add_dataset_flags as add_ingest_add_dataset_flags
from ingestion.ingest import add_flags as add_ingest_flags
from ingestion.ingest import ingest_cmd
from ingestion.logging import setup_logger

logger = logging.getLogger(__name__)


def main():
import argparse
import os
def split_argv(argv: list[str], delimiter: str) -> list[list[str]]:
out: list[list[str]] = []
temp: list[str] = []

parser = argparse.ArgumentParser(
description="This is a python script to read c-elegans ingestion"
"files and validate its content."
)
for arg in argv:
if arg == delimiter:
out.append(temp)
temp = [arg]
continue
temp.append(arg)

if temp:
out.append(temp)

def directory(raw_path: str) -> Path:
if not os.path.isdir(raw_path):
raise argparse.ArgumentTypeError(f"{raw_path} is not an existing directory")
return Path(os.path.abspath(raw_path))

parser.add_argument(
"-i",
"--ingestion-dir",
help="input files to be ingested (default: current directory)",
type=directory,
default=os.path.curdir,
return out


def _main(argv: list[str] | None = None):
parser = ArgumentParser(
prog="celegans",
description="Support tool for the C-Elegans application",
formatter_class=ArgumentDefaultsHelpFormatter,
)

parser.add_argument(
"--overwrite",
help="overwrite files in the bucket",
default=False,
action="store_true",
def add_debug_flag(parser: ArgumentParser):
parser.add_argument(
"--debug",
help="runs with debug logs",
default=False,
action="store_true",
)

add_debug_flag(parser)

subparsers = parser.add_subparsers(dest="command")

# subcommand for the extraction of segmentation files
parser_extract = subparsers.add_parser(
name="extract",
help="extracs segentations from the bitmap files",
formatter_class=ArgumentDefaultsHelpFormatter,
)

parser.add_argument(
"--prune",
help="prune files in the bucket before upload",
default=False,
action="store_true",
add_extract_flags(parser_extract)
add_debug_flag(parser_extract)

# subcommand for the file ingestion
parser_ingest = subparsers.add_parser(
name="ingest",
help="ingest files into the C-Elegans deployment",
formatter_class=ArgumentDefaultsHelpFormatter,
)

parser.add_argument(
"--debug",
help="runs the ingestion with debug logs",
default=False,
action="store_true",
add_ingest_flags(parser_ingest)
add_debug_flag(parser_ingest)

subparsers_ingest = parser_ingest.add_subparsers(dest="ingest_subcommand")

parser_ingest_add_dataset = subparsers_ingest.add_parser(
name="add-dataset",
help="ingests a dataset data",
formatter_class=ArgumentDefaultsHelpFormatter,
)

args = parser.parse_args()
add_ingest_add_dataset_flags(parser_ingest_add_dataset)

if args.debug:
logging.basicConfig(level=logging.DEBUG)
else:
logging.basicConfig(level=logging.INFO)
if argv is not None and len(argv) == 0:
parser.print_help(sys.stderr)
sys.exit(0)

data_files = find_data_files(args.ingestion_dir)
json_data = load_data(data_files)
args = parser.parse_args(argv)

err_header = (
"Seems like we found something unexpected with your data.\n"
"Bellow is an overview of what we think may be wrong.\n"
"If you think this is an error on our side, please reach out!\n"
)
setup_logger(args.debug)

try:
Data.model_validate(json_data)
except ValidationError as e:
sys.stdout.write(
DataValidationError(e).humanize(
w=ErrorWriter(),
header=err_header,
data_files=data_files,
)
match args.command:
case "ingest":
ingest_cmd(args)
case "extract":
extract_cmd(args, debug=args.debug)
except KeyboardInterrupt as e:
if args.debug:
raise
logger.error(
"execution interrupted, some resources may have not uploaded properly!"
)
except Exception as e:
if args.debug:
raise
print(f"{type(e).__name__}: {e}", file=sys.stderr)
sys.exit(1)

print("OK")

def main(argv: list[str] | None = None):
"""Calls main but is inspects argv and splits accordingly"""

if __name__ == "__main__":
try:
import pydantic as _
except ImportError:
print('error: missing pydantic; try "pip install pydantic"')
sys.exit(1)
if argv is None:
argv = sys.argv[1:]

if "ingest" in argv and "add-dataset" in argv:
argvl = split_argv(argv, "add-dataset")

# TODO: print help of missing "add-dataset" if repeated flags are detected

for add_dataset_args in argvl[1:]:
_main(argvl[0] + add_dataset_args)
else:
_main(argv)


if __name__ == "__main__":
main()
Loading

0 comments on commit 4bd83d5

Please sign in to comment.