GraphScope models graph data as Property Graph, in which the edges/vertices are labeled and each label may have many properties.
To load a property graph to GraphScope, we provide a function:
load_from(edges, vertices)
This function helps users to construct the schema of the property graph. edges is a Dict. Each pair item in the dict determines a label for the edges. More specifically, the key of the pair item is the label name, the value of the pair is a configuration Tuple or List, which contains:
- a :ref:`Loader Object` for data source, it tells graphscope where to find the data for this label, it can be a file location, or a numpy, etc.
- a list of properties, the names should consistent to the header_row of the data source file or pandas. This list is optional. When it omitted or empty, all columns except the src/dst columns will be added as properties.
- a pair of str for the edge source, in the format of (column_name_for_src, label_of_src);
- a pair of str for the edge destination, in the format of (column_name_for_dst, label_of_dst);
Let's see an example:
edges={
# a kind of edge with label "group"
"group": (
# the data source, in this case, is a file location.
"file:///home/admin/group.e",
# selected column names in group.e, will load as properties
["group_id", "member_size"],
# use 'leader_student_id' column as src id, the src label should be 'student'
("leader_student_id", "student"),
# use 'member_student_id' column as dst id, the dst label is 'student'
("member_student_id", "student")
)
}
Alternatively, the configuration can be a Dict, The reserved keys of the Dict are "loader", "properties", "source" and "destination". This configuration for edges are exactly the same to the above configuration.
edges = {
"group": {
"loader": "file:///home/admin/group.e",
"properties": ["group_id", "member_size"],
"source": ("leader_teacher_id", "teacher"),
"destination": ("member_teacher_id", "teacher"),
},
}
In some cases, an edge label may connect two kinds of vertices. For example, in a graph, two kinds of edges are labeled with group but represents two relations. i.e., teacher-group-> student and student-group-> student. In this case, a group key follows a list of configurations.
edges={
# a kind of edge with label "group"
"group": [
(
"file:///home/admin/group.e",
["group_id", "member_size"],
("leader_student_id", "student"),
("member_student_id", "student")
),
(
"file:///home/admin/group_for_teacher_student.e",
["group_id", "group_name", "establish_date"],
("teacher_in_charge_id", "teacher"),
("member_student_id", "student")
)
]
}
Some configurations can omit for edges. e.g., properties can be empty, which means to select all columns
edges={
"group": (
"file:///home/admin/group.e",
[],
("leader_student_id", "student"),
("member_student_id", "student")
)
}
Alternatively, all column names can be assigned with index. For example, the number in the src/dst assigned the first column is used as src_id and the second column is used as dst_vid:
edges={
"group": (
"/home/admin/group.e",
["group_id", "member_size"],
# 0 represents the first column.
(0, "student"),
# second column used as dst.
(1, "student"),
)
}
If there is only one label in the graph, the label of vertices can be omitted.
edges={
"group": (
"file:///home/admin/group.e",
["group_id", "member_size",]
# vertex labels in the two ends of the edges are omitted.
"leader_student_id",
"member_student_id",
)
}
In the simplest case, the configuration can only assign a loader with path. By default, the first column will be used as src_id, the second column will be used as dst_id. all the rest columns in the file are parsed as properties.
edges={
"group": "file:///home/admin/group.e"
}
Similar to edges, a vertex Dict contains a key as the label, and a set of configuration for the label. The configurations contain:
- a loader for data source, which can be a file location, or a numpy, etc. See more details in :ref:`Loader Object`.
- a list of properties, the names should consistent to the header_row of the data source file or pandas. This list is optional. When it omitted, all columns except the vertex_id column will be added as properties.
- the column used as vertex_id. The value in this column of the data source will be used for src/dst when loading edges.
Here is an example for vertices:
vertices={
"student": (
# source file for vertices labeled as student;
"file:///home/admin/student.v",
# columns loaded as property
["name", "lesson_number", "avg_score"],
# the column used for vertex_id
"student_id"
)
}
Like the edges, the configuration for vertices can also be a Dict, in which the keys are "loader", "properties" and "vid"
vertices={
"student": {
"loader": "file:///home/admin/student.v",
"properties": ["name", "lesson_nums", "avg_score"],
"vid": "student_id",
},
},
We can also omit certain configurations for vertices.
- properties can be empty, which means that all columns are selected as properties;
- vid can be represented by a number of index,
- In the simplest case, the configuration can only contains a loader. In this case, the first column is used as vid, and the rest columns are used as properties.
vertices={
"student": "file:///home/admin/student.v"
}
Moreover, the vertices can be totally omitted. graphscope will extract vertices ids from edges, and a default label _ will assigned to all vertices in this case.
g = graphscope_session.load_from(
edges={
"group": "file:///home/admin/group.e"
}
)
Let's make the example complete:
g = graphscope_session.load_from(
edges={
"group": [
(
"file:///home/admin/group.e",
["group_id", "member_size"],
("leader_student_id", "student"),
("member_student_id", "student"),
),
(
"file:///home/admin/group_for_teacher_student.e",
["group_id", "group_name", "establish_date"],
("teacher_in_charge_id", "teacher"),
("member_student_id", "student"),
),
]
},
vertices={
"student": (
"/home/admin/student.v",
["name", "lesson_nums", "avg_score"],
"student_id",
),
"teacher": (
"/home/admin/teacher.v",
["name", "salary", "age"],
"teacher_id",
),
},
)
A more complex example to load LDBC snb graph can be find here.
The datasource aforementioned is an object of :ref`Loader`. A loader wraps a location or the data itself. graphscope supports load a graph from pandas dataframes or numpy ndarrays.
import pandas as pd
df_e = pd.read_csv('group.e', sep=',',
usecols=['leader_student_id', 'member_student_id', 'member_size'])
df_v = pd.read_csv('student.v', sep=',', usecols=['student_id', 'lesson_nums', 'avg_score'])
# use a dataframe as datasource, properties omitted, col_0/col_1 will be used as src/dst by default.
# (for vertices, col_0 will be used as vertex_id by default)
g1 = sess.load_graph(edges=df_e, vertices=df_v)
Or load from numpy ndarrays
import numpy
array_e = [df_e[col].values for col in ['leader_student_id', 'member_student_id', 'member_size']]
array_v = [df_v[col].values for col in ['student_id', 'lesson_nums', 'avg_score']]
g2 = sess.load_graph(edges=array_e, vertices=array_v)
When a loader wraps a location, it may only contains a str. The string follows the standard of URI. When receiving a request for loading graph from a location, graphscope will parse the URI and invoke corresponding loader according to the schema.
Currently, graphscope supports loaders for these locations:
from graphscope import Loader
ds1 = Loader("file:///var/datafiles/group.e")
ds2 = Loader("oss://graphscope_bucket/datafiles/group.e")
ds3 = Loader("hdfs://datafiles/group.e")
:meth:`graphscope.load_from` | Loading from local filesystem, OSS, or ODPS |