Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cele 46 #24

Merged
merged 30 commits into from
Aug 28, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c8a081a
CELE-46 Initial work on ingestion validation
dvcorreia Aug 14, 2024
fbb54b7
CELE-46 Update dataset ingestion format doc
dvcorreia Aug 14, 2024
41399df
Merge branch 'develop' into feature/CELE-46
dvcorreia Aug 16, 2024
f92aca3
CELE-46 ingestion structures schema validation
dvcorreia Aug 16, 2024
6f7e9b0
CELE-46 Update data ingestion validation and tests
dvcorreia Aug 16, 2024
c015701
CELE-46 Updated connection elements size rule
dvcorreia Aug 20, 2024
c1b6492
CELE-46 Updated format ingestion docs
dvcorreia Aug 20, 2024
605011a
CELE-46 Remove outdated comments
dvcorreia Aug 20, 2024
519b6a7
CELE-46 Code dust
dvcorreia Aug 20, 2024
a2a5e7e
CELE-46 Code review syntactic changes
dvcorreia Aug 21, 2024
c24669a
CELE-46 Move ingestion validation code to project root
dvcorreia Aug 21, 2024
bdccca2
CELE-46 Specification of enum type for DatasetType
dvcorreia Aug 21, 2024
04fd692
CELE-46 Add subset of reference ingection data as test fixtures
dvcorreia Aug 21, 2024
dd1e02e
CELE-46 Set test fixtures to be ignored by github diff
dvcorreia Aug 21, 2024
7edc1d1
CELE-46 Set test fixtures as vendored in gitattributes
dvcorreia Aug 21, 2024
c3eb860
CELE-46 Set test fixtures as vendored in gitattributes (fix)
dvcorreia Aug 21, 2024
fd9f978
CELE-46 Set test fixtures as vendored in gitattributes (fix 2)
dvcorreia Aug 21, 2024
84c45d9
CELE-46 Work on humanizing pydantic's validation errors
dvcorreia Aug 21, 2024
fc81b94
CELE-46 Load ingestion data from the file system
dvcorreia Aug 22, 2024
4607422
CELE-46 Options pattern for ingestion data error writer
dvcorreia Aug 22, 2024
bc829b8
CELE-46 Validation file is now called schema
dvcorreia Aug 22, 2024
d6f3ecb
CELE-46 Error file snippet feature
dvcorreia Aug 23, 2024
35242b1
CELE-46 Ignore mypy errors for json_source_map
dvcorreia Aug 23, 2024
91f0b23
CELE-46 Small error print format adjustment
dvcorreia Aug 23, 2024
b82d6bd
CELE-46 Add ingestion error line in file path
dvcorreia Aug 26, 2024
54e3a64
CELE-46 Ingestion data validation error wrapper and external lib for …
dvcorreia Aug 26, 2024
e933e67
CELE-46 Remove flake.nix file
dvcorreia Aug 27, 2024
c0a810b
CELE-46 Add test for filesystem find and load ingestion files
dvcorreia Aug 27, 2024
3b1b177
CELE-46 Code review suggestions
dvcorreia Aug 28, 2024
5610682
CELE-46 Remove flake.nix
dvcorreia Aug 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions format-ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Different files are necessary:
* `neurons.json` that encodes the information about the neurons in general
* `datasets.json` that encodes the information about the different datasets
* `connections/xxx.json` that encodes the different connections for dedicated datasets
* `annotations/xxx.json` that encodes annotatinos for different zones of the anatomy
* `annotations/xxx.json` that encodes annotations for different zones of the anatomy

Those files are automatically exported from third-party tool and shouldn't be edited manually.

Expand Down Expand Up @@ -63,9 +63,9 @@ Each JSON object represents a specific dataset with this schema:
{
"id": string // unique ID for the dataset
"name": string // display name of the dataset
"type": string // type of dataset: "complete" or "head"
"time": int // time of the dataset
"visualTime": int // visualTime of the dataset
"type": string // type of dataset: "complete", "head" or "tail"
"time": float // time of the dataset
"visualTime": float // visualTime of the dataset
"description": string // description of the dataset
"axes": [ // OPTIONAL: different axes and their representation, not used but can appear in the file
...
Expand All @@ -89,11 +89,11 @@ The schema is the following:
"pre": string, // the name of a neuron as defined in "neurons.json"
"pre_tid": [ ... ], // a list of int where each int represents the ID of a pre synapse for a dedicated pre neuron
"syn": [ ... ], // a list of int where each int represents the weight of a post or pre synapses (indice matches the neuron in pre/post_tid)
"typ": int // the type of connection ("electrical" or "chemical")
"typ": int // the type of connection ("electrical" (0) or "chemical" (2))
}
```

For each of those objects: `ids`, `post_tid`, `pre_tid` and `syn` need to have the same number of elements.
For each of those objects: `ids`, `post_tid`, `pre_tid` and `syn` need to have the same number of elements when `ids` is present.

## Format of `annotations/xxx.json`

Expand Down
1 change: 1 addition & 0 deletions ingestion/ingestion/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# This is intentionally left blank
106 changes: 106 additions & 0 deletions ingestion/ingestion/validator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
from __future__ import annotations

from enum import Enum, IntEnum
from typing import Literal

from pydantic import BaseModel, Field, RootModel, model_validator


class Neuron(BaseModel):
inhead: bool # int used as bool, is the neuron part of the head or not
name: str # name of the neuron, can be same as classes, or L or R of classes
emb: bool # int used as bool
nt: str # neurotransmitter type
intail: bool # int used as bool
classes: str # general name of the neuron
typ: str # type of the neuron


class DatasetType(str, Enum):
aranega marked this conversation as resolved.
Show resolved Hide resolved
COMPLETE = "complete"
HEAD = "head"
TAIL = "tail"


class Axe(BaseModel):
face: str
axisIndex: int
axisTransform: int


class Dataset(BaseModel):
id: str
name: str
type: DatasetType
time: float # TODO: should add validation gte than 0?
visualTime: float # TODO: should add validation gte than 0?
description: str
axes: list[Axe] | None = Field(
default=None, description="different axes and their representation"
)


class ConnectionType(IntEnum):
ELECTRICAL = 0
CHEMICAL = 2


class Connection(BaseModel):
ids: list[int] = Field(
default_factory=list,
description="list of neuron IDs involved in this connection",
)
post: str # the name of a neuron as defined in "neurons.json"
post_tid: list[int] = Field(
default_factory=list,
description="list of neuron IDs of a post synapse for a dedicated post neuron",
)
pre: str # the name of a neuron as defined in "neurons.json"
pre_tid: list[int] = Field(
default_factory=list,
description="list of neuron IDs of a pre synapse for a dedicated pre neuron",
)
syn: list[int] = Field(
...,
description="list of weights of a post or pre synapses (indice matches the neuron in pre/post_tid)",
)
typ: ConnectionType # the type of connection ("electrical" or "chemical")

@model_validator(mode="after")
def check_same_size_elements(self):
if len(self.ids) != 0:
assert (
len(self.ids)
== len(self.post_tid)
== len(self.pre_tid)
== len(self.syn)
), "ids, post_tid, pre_tid and syn must have the same number of elements"

return self


class Annotation(RootModel):
root: dict[
Literal["increase", "variable", "postembryonic", "decrease", "stable"],
list[
tuple[ # the type of annotation
str, # pre, the ID/name of a neuron from "neurons.json"
str, # post, the ID/name of the other neuron from "neurons.json" that is part of the couple
]
],
] = {}


class Data(BaseModel):
neurons: list[Neuron]
datasets: list[Dataset]
connections: dict[str, list[Connection]] = {}
annotations: dict[Literal["head", "complete", "tail"], Annotation] = {}

@model_validator(mode="after")
def check_connection_dataset_exists(self):
existing_datasets = [dt.id for dt in self.datasets]
assert all(
dataset_id in existing_datasets for dataset_id in self.connections.keys()
), "missing dataset definition for connection"
return self
72 changes: 72 additions & 0 deletions ingestion/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
[build-system]
requires = ["setuptools", "setuptools-scm"]
build-backend = "setuptools.build_meta"

[project]
name = "ingestion"
version = "0.0.1"
description = "CLI tool to ingest c-elegans data"
readme = "README.md"
requires-python = ">=3.10"
authors = [
{ name = "Vincent Aranega", email = "[email protected]" },
{ name = "Diogo Correia", email = "[email protected]" },
]
maintainers = [
{ name = "Vincent Aranega", email = "[email protected]" },
{ name = "Diogo Correia", email = "[email protected]" },
]
dependencies = [
"pydantic==2.8.2",
]

[project.optional-dependencies]
dev = [
"black>=24.8.0",
"coverage>=7.6.1",
"isort>=5.13.2",
"mypy==1.11.1", # lock version: manual upgrade is advised
"pytest>=8",
"pytest-asyncio",
]

[tool.setuptools.packages.find]
where = ["."] # list of folders that contain the packages (["."] by default)
include = ["*"] # package names should match these glob patterns (["*"] by default)
exclude = [
"tests*",
] # exclude packages matching these glob patterns (empty by default)
namespaces = false # false to disable scanning PEP 420 namespaces (true by default)

[tool.black]
line-length = 88
target-version = ['py310']
include = '\.pyi?$'

[tool.isort]
profile = "black"
line_length = 88
src_paths = ["ingestion", "tests"]
add_imports = ["from __future__ import annotations"]

[tool.mypy]
python_version = "3.10"

[tool.pytest.ini_options]
minversion = "8.0"
addopts = "-v"
asyncio_mode = "strict"
testpaths = ["tests"]

[tool.coverage.run]
branch = true
source = ["ingestion"]
omit = [
"venv/*",
".venv/*",
"tests/*",
]

[tool.coverage.report]
show_missing = true
fail_under = 0
Loading