Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cele 78 #30

Merged
merged 72 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
5c77304
CELE-78 Work on the ingestion logic
dvcorreia Aug 28, 2024
e144b3d
CELE-78 Move extraction in to celegans ingestion CLI
dvcorreia Aug 29, 2024
68be4cd
CELE-78 Separate subcommands in its own files
dvcorreia Aug 29, 2024
cf7b93f
CELE-78 Add segmentation file finder and tile grid metadata
dvcorreia Sep 2, 2024
e379eeb
CELE-78 Code format
dvcorreia Sep 2, 2024
e6843bd
CELE-78 Segmentations and tiles GCP storage upload
dvcorreia Sep 3, 2024
06032ad
CELE-78 Add suggestions from Vincent
dvcorreia Sep 5, 2024
afedf7e
CELE-78 Fix regression on data ingestion directory
dvcorreia Sep 5, 2024
917d0ea
CELE-78 Unit tests for cli data validation and small code tweaks
dvcorreia Sep 5, 2024
4da554e
CELE-78 Unit tests for cli segmentation upload to GCS
dvcorreia Sep 6, 2024
9a6393f
CELE-78 Add RFC bucket ingestion storage pattern docs
dvcorreia Sep 9, 2024
647da8f
CELE-78 Add ingestion POC for 3D files
dvcorreia Sep 9, 2024
7327788
CELE-78 Some more format ingestion reviews
dvcorreia Sep 10, 2024
1df9667
CELE-78 Extract tileset metadata
dvcorreia Sep 10, 2024
953a59e
CELE-78 Add new blob names and unit tests
dvcorreia Sep 10, 2024
cd42ff1
CELE-78 If the prunning of the bucket is interrupted it removes the l…
dvcorreia Sep 11, 2024
09e5087
CELE-78 Code tweaks and remove of unused code
dvcorreia Sep 11, 2024
f659884
CELE-78 Upload EM tile metadata do GCP bucket
dvcorreia Sep 11, 2024
6af78bf
CELE-78 Code format
dvcorreia Sep 11, 2024
b04d79f
CELE-78 Ingestion CLI review changes: multiple datasets
dvcorreia Sep 12, 2024
e22207e
CELE-78 Fix ingestion CLI bug
dvcorreia Sep 13, 2024
2c676a6
CELE-78 Enforce dataset id to be present in datasets.json
dvcorreia Sep 13, 2024
8526960
CELE-78 Tweaks to EM tiles metadata
dvcorreia Sep 13, 2024
5113fb9
CELE-78 Fix bug in unit test
dvcorreia Sep 13, 2024
4d7747b
CELE-78 Upload EM tiles metadata
dvcorreia Sep 17, 2024
1295251
CELE-78 Upload segmentation image resolution metadata
dvcorreia Sep 17, 2024
45b660e
CELE-78 Fix ingestion uploading synapses STL files when it should not
dvcorreia Sep 19, 2024
0cdc3a4
CELE-78 Display help when root command is called with no args
dvcorreia Sep 20, 2024
15057c5
CELE-78 Clarify ingestion overwrite flag usage
dvcorreia Sep 20, 2024
c16af28
CELE-78 Add dry run ingestion flag
dvcorreia Sep 20, 2024
db52b6b
CELE-78 Upload ingestion raw data to GCP bucket
dvcorreia Sep 23, 2024
dccc13e
CELE-78 Fix --data ingestion flag typo
dvcorreia Sep 23, 2024
80b039a
CELE-78 Update dry-run option
aranega Sep 23, 2024
9368346
CELE-78 Change the raw-directory localisation
aranega Sep 24, 2024
585031a
CELE-78 Move up the argument for raw-data ingestion
aranega Sep 24, 2024
a96c37b
CELE-78 Fix some debug message
aranega Sep 24, 2024
91cf7c6
CELE-78 Remove debug level from gcp module
aranega Sep 25, 2024
0046eeb
CELE-99 Add way of pulling information from local FS to populate the DB
aranega Sep 26, 2024
9b4d122
CELE-99 Add cleaning of the DB before DB population
aranega Sep 26, 2024
5fcc821
CELE-78 Change directory for db-raw-data on the bucket
aranega Sep 26, 2024
d104bf2
CELE-99 Add fetching of raw db files from bucket
aranega Sep 26, 2024
4bd83d5
Merge branch 'feature/CELE-78' into feature/CELE-99
aranega Sep 26, 2024
44265f9
CELE-99 Add summary.txt generation
aranega Sep 26, 2024
64e8738
CELE-99 Update .gitignore for ingestion
aranega Sep 26, 2024
55bcf93
Merge branch 'develop' of github.com:MetaCell/c-elegans-app into feat…
aranega Sep 26, 2024
291c66e
Merge branch 'feature/CELE-78' into feature/CELE-99
aranega Sep 26, 2024
dea760a
CELE-78 Documentation for the ingestion script
dvcorreia Sep 26, 2024
1796d8e
CELE-78 Fix ingestion prune logic
dvcorreia Sep 26, 2024
028e072
CELE-78 Ignore mypy error in FakeBucket
dvcorreia Sep 26, 2024
d783bec
CELE-78 Fix bug in check prune rule
dvcorreia Sep 26, 2024
a89de2f
CELE-78 Code review changes
dvcorreia Sep 27, 2024
71f3a9c
CELE-78 Improve prune_bucket doc comment
dvcorreia Sep 27, 2024
be9e958
CELE-78 Update description on bucket file structure
dvcorreia Oct 1, 2024
0346ecf
CELE-97 Fix EM viewer sliding window undefined reference
dvcorreia Oct 3, 2024
03a3ff7
CELE-97 Fix request for tiles at inexistent minzoom
dvcorreia Oct 3, 2024
2a9d1a0
CELE-99 Add missing library for the backend
aranega Oct 8, 2024
0d9d652
CELE-78 Add github action tests for ingestion
aranega Oct 8, 2024
7960e00
CELE-78 Fix bad import
aranega Oct 8, 2024
ad44b6d
CELE-78 Change the way metadata are represented for EM slices
aranega Oct 8, 2024
28da8e0
Merge branch 'feature/CELE-99' into feature/em-viewer-config
aranega Oct 8, 2024
a00d7f2
Add transmission of EM data information to the frontend
aranega Oct 8, 2024
5f24887
Change the way missing em metadata are handled
aranega Oct 8, 2024
3e53729
Generate new typescript binding
aranega Oct 8, 2024
15e6b3b
Merge branch 'bug/CELE-97' into feature/em-viewer-config
aranega Oct 8, 2024
c1c1a0b
Plug backend information to the EM viewer
aranega Oct 8, 2024
6e58e31
Merge branch 'develop' of github.com:MetaCell/c-elegans-app into feat…
aranega Oct 8, 2024
b4849cb
Merge branch 'feature/CELE-78' into feature/em-viewer-config
aranega Oct 8, 2024
e2f7004
Fix linting issue
aranega Oct 8, 2024
92d6edd
Merge pull request #48 from MetaCell/feature/CELE-99
aranega Oct 9, 2024
00895c0
Merge branch 'feature/CELE-78' into feature/em-viewer-config
aranega Oct 9, 2024
3b8946e
Merge pull request #59 from MetaCell/feature/em-viewer-config
ddelpiano Oct 10, 2024
e44249a
CELE-78 Multithreading blob upload and retry
dvcorreia Oct 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 34 additions & 23 deletions ingestion/ingestion/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
import os
import sys
from argparse import ArgumentParser, Namespace
from datetime import datetime, timedelta, timezone
from pathlib import Path
from time import sleep

from google.cloud import storage
from pydantic import ValidationError
Expand Down Expand Up @@ -181,24 +181,35 @@ def validate_and_upload_data(
def prune_bucket(bucket: storage.Bucket | FakeBucket):
"""Prune the bucket and waits until the bucket is empty by checking it periodically."""

bucket.lifecycle_rules = [{"action": {"type": "Delete"}, "condition": {"age": 0}}]
bucket.patch()
yesterday = datetime.now(timezone.utc) - timedelta(days=1)
prune_rule = {
"action": {"type": "Delete"},
"condition": {"createdBefore": yesterday.strftime("%Y-%m-%d")},
}

def is_prune_lifecycle(rule: dict) -> bool:
return (
rule["action"]["type"] == "Delete"
and "createdBefore" in rule["condition"]
and len(rule["condition"].keys()) == 0
aranega marked this conversation as resolved.
Show resolved Hide resolved
)

try:
sleep_interval = 10
while True:
has_blobs = len(list(bucket.list_blobs(max_results=1))) != 0
if not has_blobs:
break

logger.info(f"bucket '{bucket.name}' is not yet empty. waiting...")
sleep(sleep_interval)
except Exception as e:
raise
finally:
# ensure that the lifecycle rule is removed
bucket.lifecycle_rules = []
bucket.patch()
lifecycle_rules = list(bucket.lifecycle_rules)

prune_lifecycle_already_exists: bool = False
for lifecycle in lifecycle_rules:
if is_prune_lifecycle(lifecycle):
lifecycle["condition"]["createdBefore"] = prune_rule["condition"][
"createdBefore"
]
prune_lifecycle_already_exists = True
break

if not prune_lifecycle_already_exists:
aranega marked this conversation as resolved.
Show resolved Hide resolved
lifecycle_rules = lifecycle_rules + [prune_rule]

bucket.lifecycle_rules = lifecycle_rules
bucket.patch()

logger.info(f"bucket '{bucket.name}' was pruned successfully!")

Expand Down Expand Up @@ -368,11 +379,6 @@ def ingest_cmd(args: Namespace):
dataset_id = args.id
overwrite = args.overwrite

if args.data:
validate_and_upload_data(dataset_id, args.data, rs, overwrite=overwrite)
elif dry_run:
logger.warning(f"skipping neurons data validation and upload")

if args.prune:
prune = args.y or ask(
"Are you sure you want to delete all files on the bucket?"
Expand All @@ -384,6 +390,11 @@ def ingest_cmd(args: Namespace):
elif dry_run:
logger.info(f"skipped prunning files from the bucket")

if args.data:
validate_and_upload_data(dataset_id, args.data, rs, overwrite=overwrite)
elif dry_run:
logger.warning(f"skipping neurons data validation and upload")

if args.segmentation:
upload_segmentations(dataset_id, args.segmentation, rs, overwrite=overwrite)
elif dry_run:
Expand Down
2 changes: 1 addition & 1 deletion ingestion/ingestion/storage/gcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def blob(self, *_) -> Blob:
return FakeBlob() # type:ignore

def patch(self, *_): ...
def list_blobs(self, *_, **kwargs) -> Iterable[Any]: ...
def list_blobs(self, *_, **kwargs) -> Iterable[Any]: ... # type:ignore


class RemoteStorage:
Expand Down