Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining a convention for the storage of graph data as N-dimensional arrays in a standard format #14

Open
Mec-iS opened this issue Dec 5, 2022 · 0 comments

Comments

@Mec-iS
Copy link
Contributor

Mec-iS commented Dec 5, 2022

We start here the process of defining a file format as per title.

forewords

With the objective of easing analytics on large graphs and widen support for a standard way of storing graph data in a binary format (with the same scope of functionalities as alredy proposed by pynock for storing graphs data into parquet files), @SultanOrazbayev proposed binarisation support for the Zarr v3 specification (developed at QuantStack and funded by the Chan Zuckerberg Initiative through NumFOCUS). This format is already supported by Dask and so it provides an easily leverageable entrypoint to scalable distributed computing operations via, for example, RAPIDS tools and Ray.

forethoughts

Zarr is a file storage representation for chunked linear algebra structures like N-dimensional arrays with the flexibility required to store graphs in the shape of matrices and n-dimensional arrays for potentially billions of nodes and edges.
We have pinpointed some critical points that require attention in the development of the convention:
A. how to design a proper representation of the graph/subgraph characteristics in a metadata file in terms of names, labels and mapping to indices
B. how to store nodes attributes
C. how to store numerical data (initially in the form of weighted adjacency matrix)

Briefly, to be extended below, about each point:
A. graph metadata: an efficient representation of graph characteristics in a metadata file to allow mapping between original nodes names and indices in the matrix, and any other mapping required to provide translation from/to triples to/from matrices (lossless or selective, depending on the usage: lossless if used as storage, selective if used as analytical intermediate representation). Point to address: which format, stored where, how to link the metadata to the data file; is it possible to have the metadata to be stored in a Zarr file itself?
B. attributes metadata: if multiple attributes are needed for each node-node relation, the list and format of the attributes have to be stored somewhere in a efficient format and point to the data file that store the actual data. We have identified the best option to be to store third-dimension with N>2 in a separate files (please Sultan provides the rationale for this approach) keeping the order of indices as defined by the .
C. data: this is a matter a more technical matter of storing the elements of a weighted adjacency matrix once defined A. and B.

expected output

Output of graph data binarisation: the final form will be an object that holds references to files providing A. B. and C.

initial example

Overview of the procedure to be implemented, in pseudo-python with comments addressing open questions:

rdf_graph = rdflib.load_ttl(...)  # an in-memory graph as `rdflib.Store` instance 

def rdf_to_zarr(rdf_graph: rdflib.Store) -> (Path, Path, Tuple[Path]):
    """ General procedure """
    path_to_a = create_metadata_object(rdf_graph)
    # which convention, how to leverage pynock
    # which format?
    
    path_to_b = create_attributes_object(rdf_graph)
    # store them in a separate parquet file as they could leverage a columnar representation? 
    # pynock convention for attributes?
    # which format?
    
    paths_to_c = create_array_representations(rdf_graph)
    # which metrics: adjacency with weights
    # multidimensional in separate files (no 3rd dimension unless we find a good way of managing different types) ?
    # file.
    return path_to_a, path_to_b, paths_to_c

def create_metadata_object(g: rdflib.Store) -> Path:
    """ define procedure for A. """
    ...

def create_attributes_object(g: rdflib.Store) -> Path:
    """ define procedure for B. """
    ...

def create_array_representations(g: rdflib.Store) -> Tuple[Path]:
    """ define procedure for C. """
    ...

@ceteri please provide your inputs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant