Distributed Pytorch implementation #1

tlin-taolin · 2019-03-04T23:17:06Z

Here I've provided my simplified Pytorch code for distributed training. The MLBench was building on top of it and has quickly evolved. The MLBench now is a bit hard for the beginner to start with; maybe this code also has this issue but will continue to simplify it.

So in the next few days, I will try to introduce some of the new features/code design choices (e.g., different sparsification/quantization schemes) from MLBench, and make this code to as simple as possible.

Also, I still need to figure out how can I merge this code with Thijs's single worker code.

martinjaggi · 2019-03-13T17:01:50Z

@iamtao , can you simplify the code structure it a bit in view of thijs' single machine code? https://github.com/epfml/cifar/blob/master/train.py
or even better if it would look a bit like this smaller one here?
https://github.com/epfml/autoTrain/blob/master/src/train_adam.py

tlin-taolin · 2019-03-13T17:27:09Z

@iamtao , can you simplify the code structure it a bit in view of thijs' single machine code? https://github.com/epfml/cifar/blob/master/train.py
or even better if it would look a bit like this smaller one here?
https://github.com/epfml/autoTrain/blob/master/src/train_adam.py

Yes. I will try to simplify the current code and make it more close to Thijs's scheme.

…kere.

…bugs.

tlin-taolin added 5 commits March 4, 2019 23:31

mv Thijs's code to bsingle-worker folder first.

dcd0e8b

simply multi-worker code.

1307f63

slightly simplify the func.

9a69c6c

add hyper-parameter for different models.

2958c43

add a missing file.

871131f

tlin-taolin added the enhancement New feature or request label Mar 4, 2019

tlin-taolin assigned tlin-taolin and negar-foroutan Mar 4, 2019

tlin-taolin requested a review from tvogels March 4, 2019 23:17

tlin-taolin and others added 10 commits March 7, 2019 13:09

clean the code a bit; merge the aggregation to the optimizer.

016ae15

minor.

55f94dd

try to make the scheduler more clean.

52d7cde

add the env setup for distributed training.

f26829d

fix the bug for single-worker scenario.

3359407

minor

d5accef

minor adjust the code layout.

74df150

minor

a7d1189

minor

e30181c

Add sparsified/quantized sgd

5396cec

tlin-taolin added 9 commits March 13, 2019 21:26

rename platform to environment.

7b35361

start the simplification; only adjust the structure.

9aad118

improve the code design choice by improving metrics/logging/stat_trac…

546097a

…kere.

simplify data loader.

74f2961

try to simplify scheduler.

446de3b

minor

a88b1bc

rename one variable over the files; try to fix some newly introduced …

78c4461

…bugs.

minor improve.

43900a3

minor fix.

998936e

tlin-taolin added 9 commits March 17, 2019 18:28

minor improve

894b901

minor.

620265c

update the entry file.

2fe98f9

minor improve.

57c4b6b

minor.

e36f198

improve the network topology.

a6d5e33

minor improve.

b67d3e2

minor improve.

59ca75e

improve the structure a bit.

c92c329

martinjaggi unassigned negar-foroutan May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Pytorch implementation #1

Distributed Pytorch implementation #1

tlin-taolin commented Mar 4, 2019

martinjaggi commented Mar 13, 2019

tlin-taolin commented Mar 13, 2019

Distributed Pytorch implementation #1

Are you sure you want to change the base?

Distributed Pytorch implementation #1

Conversation

tlin-taolin commented Mar 4, 2019

martinjaggi commented Mar 13, 2019

tlin-taolin commented Mar 13, 2019