Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow decentralized users to join late and catch up #775

Merged
merged 56 commits into from
Oct 14, 2024

Conversation

JulienVig
Copy link
Collaborator

@JulienVig JulienVig commented Sep 11, 2024

Closes #718

Decentralized issues

Currently, we can only train decentralized with the exact number of peers specified in minNbOfParticipants.

  • No peer can leave because leaving mid training makes the round drop below the min number of participant and fail (have to abort the round)
  • No more than minNbOfParticipants peers can join one round because the server sends the list of peers as soon as the threshold is reached, so the minNbOfParticipants + 1nth peer joining doesn't get the current round's peer list but is included in the next round's
    Essentially, once the peers list has been sent, joining and leaving is not possible anymore.

Solution Implemented

  1. During onRoundBeginCommunication, peers manifest their interest to join the current round: they send a PeerJoinsRound to the server. The server keeps a list of peers wanting to join (but doesn't reply with a peer list as it currently does).
  2. The peers start training locally without the round's peer list
  3. When peers are done training locally, they notify the server with a PeerIsReady, i.e. ready to exchange weight updates. The server waits until all the peers that sent a PeerJoinsRound sends their PeerIsReady and then send the round's peer list.
    i. This allows for some time for peers to join the round. To prevent new peers from continually joining and waiting for new peers to be ready, the server can stop including peers in this round as soon as one peer is ready (and include them in the next).
    ii. Peers can leave and notify the server before. As long as the peer list hasn't been sent, peers can join and leave without it being a problem.
  4. Upon receiving the peer list, peers establish p2p connection and start exchanging weight updates.
  • Allow a peer to join a session that has already started.
  • If the number of peers drops below the minNbOfParticipants threshold, the peers wait for more participants
  • Peers leaving notifies the server which in turn notifies the remaining peers to wait.
  • More than exactly minNbOfParticipants peers can participate in the same round (because the PeerJoinsRound step allows some time for participants to join, instead of directly starting the weight update when minNbOfParticipants peers joined)

Refactoring

  • aggregator.add returns a promise for the aggregated weights (no more need for the perpetual promise loop for the server controller) + new aggregator tests
  • TaskInitializer implements discojs utils's EventEmitter
  • Drop unused trainingInformation.decentralizedSecure and add aggregationStrategy ('mean' or 'secure')

@JulienVig JulienVig marked this pull request as ready for review September 24, 2024 12:32
Copy link
Collaborator

@tharvik tharvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huge work, thank you! finally got rid of the mess in the server with promises 🤩

some potential improvement but nothing serious

discojs/src/aggregator/aggregator.ts Outdated Show resolved Hide resolved
discojs/src/aggregator/aggregator.ts Outdated Show resolved Hide resolved
discojs/src/aggregator/aggregator.ts Outdated Show resolved Hide resolved
discojs/src/aggregator/aggregator.ts Outdated Show resolved Hide resolved
discojs/src/aggregator/mean.spec.ts Outdated Show resolved Hide resolved
discojs/src/training/disco.ts Outdated Show resolved Hide resolved
server/tests/e2e/decentralized.spec.ts Show resolved Hide resolved
server/tests/e2e/decentralized.spec.ts Outdated Show resolved Hide resolved
server/tests/e2e/decentralized.spec.ts Outdated Show resolved Hide resolved
webapp/src/components/training/Trainer.vue Outdated Show resolved Hide resolved
…ndParticipants and stop resetting the nodes list at the end of round
…rs" before waiting for participants if needed
…f the constructor and get rid of the static async server constructor
…regation' event to get the aggregated weights
Copy link
Collaborator

@tharvik tharvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superbe, thanks for the hard work!

@JulienVig JulienVig merged commit b95004b into develop Oct 14, 2024
23 checks passed
@JulienVig JulienVig deleted the 718-decentralized-julien branch October 14, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New users are not able to join once training started
2 participants