You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We start here the process of defining a file format as per title.
forewords
With the objective of easing analytics on large graphs and widen support for a standard way of storing graph data in a binary format (with the same scope of functionalities as alredy proposed by pynock for storing graphs data into parquet files), @SultanOrazbayev proposed binarisation support for the Zarr v3 specification (developed at QuantStack and funded by the Chan Zuckerberg Initiative through NumFOCUS). This format is already supported by Dask and so it provides an easily leverageable entrypoint to scalable distributed computing operations via, for example, RAPIDS tools and Ray.
forethoughts
Zarr is a file storage representation for chunked linear algebra structures like N-dimensional arrays with the flexibility required to store graphs in the shape of matrices and n-dimensional arrays for potentially billions of nodes and edges.
We have pinpointed some critical points that require attention in the development of the convention:
A. how to design a proper representation of the graph/subgraph characteristics in a metadata file in terms of names, labels and mapping to indices
B. how to store nodes attributes
C. how to store numerical data (initially in the form of weighted adjacency matrix)
Briefly, to be extended below, about each point:
A. graph metadata: an efficient representation of graph characteristics in a metadata file to allow mapping between original nodes names and indices in the matrix, and any other mapping required to provide translation from/to triples to/from matrices (lossless or selective, depending on the usage: lossless if used as storage, selective if used as analytical intermediate representation). Point to address: which format, stored where, how to link the metadata to the data file; is it possible to have the metadata to be stored in a Zarr file itself?
B. attributes metadata: if multiple attributes are needed for each node-node relation, the list and format of the attributes have to be stored somewhere in a efficient format and point to the data file that store the actual data. We have identified the best option to be to store third-dimension with N>2 in a separate files (please Sultan provides the rationale for this approach) keeping the order of indices as defined by the .
C. data: this is a matter a more technical matter of storing the elements of a weighted adjacency matrix once defined A. and B.
expected output
Output of graph data binarisation: the final form will be an object that holds references to files providing A. B. and C.
initial example
Overview of the procedure to be implemented, in pseudo-python with comments addressing open questions:
rdf_graph=rdflib.load_ttl(...) # an in-memory graph as `rdflib.Store` instance defrdf_to_zarr(rdf_graph: rdflib.Store) -> (Path, Path, Tuple[Path]):
""" General procedure """path_to_a=create_metadata_object(rdf_graph)
# which convention, how to leverage pynock# which format?path_to_b=create_attributes_object(rdf_graph)
# store them in a separate parquet file as they could leverage a columnar representation? # pynock convention for attributes?# which format?paths_to_c=create_array_representations(rdf_graph)
# which metrics: adjacency with weights# multidimensional in separate files (no 3rd dimension unless we find a good way of managing different types) ?# file.returnpath_to_a, path_to_b, paths_to_cdefcreate_metadata_object(g: rdflib.Store) ->Path:
""" define procedure for A. """
...
defcreate_attributes_object(g: rdflib.Store) ->Path:
""" define procedure for B. """
...
defcreate_array_representations(g: rdflib.Store) ->Tuple[Path]:
""" define procedure for C. """
...
We start here the process of defining a file format as per title.
forewords
With the objective of easing analytics on large graphs and widen support for a standard way of storing graph data in a binary format (with the same scope of functionalities as alredy proposed by
pynock
for storing graphs data into parquet files), @SultanOrazbayev proposed binarisation support for the Zarr v3 specification (developed at QuantStack and funded by the Chan Zuckerberg Initiative through NumFOCUS). This format is already supported by Dask and so it provides an easily leverageable entrypoint to scalable distributed computing operations via, for example, RAPIDS tools and Ray.forethoughts
Zarr is a file storage representation for chunked linear algebra structures like N-dimensional arrays with the flexibility required to store graphs in the shape of matrices and n-dimensional arrays for potentially billions of nodes and edges.
We have pinpointed some critical points that require attention in the development of the convention:
A. how to design a proper representation of the graph/subgraph characteristics in a metadata file in terms of names, labels and mapping to indices
B. how to store nodes attributes
C. how to store numerical data (initially in the form of weighted adjacency matrix)
Briefly, to be extended below, about each point:
A. graph metadata: an efficient representation of graph characteristics in a metadata file to allow mapping between original nodes names and indices in the matrix, and any other mapping required to provide translation from/to triples to/from matrices (lossless or selective, depending on the usage: lossless if used as storage, selective if used as analytical intermediate representation). Point to address: which format, stored where, how to link the metadata to the data file; is it possible to have the metadata to be stored in a Zarr file itself?
B. attributes metadata: if multiple attributes are needed for each node-node relation, the list and format of the attributes have to be stored somewhere in a efficient format and point to the data file that store the actual data. We have identified the best option to be to store third-dimension with N>2 in a separate files (please Sultan provides the rationale for this approach) keeping the order of indices as defined by the .
C. data: this is a matter a more technical matter of storing the elements of a weighted adjacency matrix once defined A. and B.
expected output
Output of graph data binarisation: the final form will be an object that holds references to files providing A. B. and C.
initial example
Overview of the procedure to be implemented, in pseudo-python with comments addressing open questions:
@ceteri please provide your inputs
The text was updated successfully, but these errors were encountered: