Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

相同参数情况下 分布式和单机训练模型精度出现差异 #273

Open
LucasTsui0725 opened this issue Aug 3, 2023 · 1 comment
Assignees

Comments

@LucasTsui0725
Copy link

使用graphlearn v1.1.0中参考代码,将train_supervised的模型训练部分替换到dist_train的worker任务中测试分布式的监督学习任务。训练数据集选择使用ogbn-arxiv并在分布式训练时将点和边均分成两个文件,分布式训练集群配置为2PS-2Worker,其余代码和模型超参保持不变。结果分布式训练的loss下降到1.6左右开始震荡(单机能下降至1左右),请问这种情况如何解决。

@LucasTsui0725
Copy link
Author

目前我尝试使用单PS和多Worker进行分布式训练,同时在数据载入时使用了全量的数据载入,

    train_table = os.path.join(data_folder, 'train_table')
    test_table = os.path.join(data_folder, 'test_table')
    valid_table = os.path.join(data_folder, 'valid_table')
    node_table = os.path.join(data_folder, node_table)
    edge_table = os.path.join(data_folder, edge_table)
    g = gl.Graph() \
          .node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
          .edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
          .node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
          .node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
          .node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST)

结果基本接近单机训练模式。
在多Ps多Worker的分布式训练中,使用如下方式:

    train_table = os.path.join(data_folder, 'train_table')
    test_table = os.path.join(data_folder, 'test_table')
    valid_table = os.path.join(data_folder, 'valid_table')
    node_table = os.path.join(data_folder, node_table + str(task_index))
    edge_table = os.path.join(data_folder, edge_table + str(task_index))
    g = gl.Graph() \
          .node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
          .edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
          .node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
          .node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
          .node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST)

则产生了上述差异。
请问关于train_table,valid_table和test_table是否需要进行类似变表数据的预拆分,即:

    train_table = os.path.join(data_folder, 'train_table_' + str(task_index))
    test_table = os.path.join(data_folder, 'test_table_' + str(task_index))
    valid_table = os.path.join(data_folder, 'valid_table_' + str(task_index))
    node_table = os.path.join(data_folder, node_table + str(task_index))
    edge_table = os.path.join(data_folder, edge_table + str(task_index))
    g = gl.Graph() \
          .node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
          .edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
          .node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
          .node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
          .node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants