Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does end_on_device make sense? #213

Open
DilipSequeira opened this issue Apr 21, 2021 · 14 comments
Open

Does end_on_device make sense? #213

DilipSequeira opened this issue Apr 21, 2021 · 14 comments

Comments

@DilipSequeira
Copy link
Contributor

DilipSequeira commented Apr 21, 2021

The rationale for start_from_device is that submissions should not need to incur the overhead of transfer from system DRAM if there is a mechanism whereby network inputs can be delivered directly into accelerator memory.

Is end_on_device symmetric in this regard - e.g. submitters should not have to incur an overhead for transfer to system DRAM if the accelerator has the equivalent outbound capability?

@tjablin opinion?

@tjablin
Copy link
Collaborator

tjablin commented Apr 21, 2021

Thinking about real applications, start_on_device makes sense because you can imagine streaming images or text or other inputs directly from the network in a real application. There's some subtlety in that real application would probably stream compressed images and decompression might have use the CPU. For end_on_device to make sense, there would have to be real applications were the output of an inference streams directly out to the network, but for most real applications, inference is not the last step in a pipeline before sending data back to a user.

@DilipSequeira
Copy link
Contributor Author

I agree inference is rarely the last pipeline step. However, if your accelerator is a general purpose programmable device, it's realistic for it to run post-processing too - for example, in the current proposal for 3D UNet where overlapping 128x128x128 tiles are recombined into a full segmented image, that would best be done on the accelerator. (This is mainly an issue for segmentation workloads, since those are the ones with large data outputs.) And then that combined image might indeed go straight to network.

@tjablin
Copy link
Collaborator

tjablin commented Apr 21, 2021

For 3D-UNet, didn't we agree that the server case made no sense, that's why it is offline only? I think the start_on_device rule is getting unwieldy to enforce. We should just move to injecting queries over the network, then submitters that implement NIC to accelerator DMAs will be able to measure the benefit directory.

@DilipSequeira
Copy link
Contributor Author

The timeline for getting that into 1.1 seems quite short, given there's no proposal yet.

@DilipSequeira
Copy link
Contributor Author

And regarding 3DUNet not being in server... that's correct, but latency is still relevant for 3DUNet in Edge Single Stream.

@tjablin
Copy link
Collaborator

tjablin commented Apr 22, 2021

Is this issue 3D-UNet specific?

@DilipSequeira
Copy link
Contributor Author

It's significant only for benchmarks where the output size is large. Today, that's only segmentation.

@tjablin
Copy link
Collaborator

tjablin commented Apr 24, 2021

Can we get an opinion from the medical advisory board?

@DilipSequeira
Copy link
Contributor Author

DilipSequeira commented Apr 24, 2021

I'm sure we can, but what are we looking for, and how would we act on it?

MLPerf has, historically, set some fairly arbitrary bounds on the timed portion of the benchmark. One thing we could meaningfully ask is for them to suggest what should be timed, and then address the question "what does the post- processing after the timed portion look like for this model?"

Then there are three cases:

  1. there is no post-processing. answer goes straight to network or storage
  2. there is post-processing that cannot reasonably be done on the accelerator
  3. there is, at least sometimes, post-processing that can be done on the accelerator

In case (1), end-from-device can use the same rules as start-from-device. Case (2) is straightforwardly "no". Case (3), which I expect is going to be true in at least some use cases, requires us to make rules to determine whether an accelerator can do the post-processing. Given that the biggest difficulty we struggled with in 1.0 was the tension between rules that are simple to arbitrate, and rules that don't force costs on submitters that they wouldn't incur in production, this doesn't seem like it will help.

How else could we frame the question to the board?

@tjablin
Copy link
Collaborator

tjablin commented Apr 26, 2021

requires us to make rules to determine whether an accelerator can do the post-processing

Do we need to make a rule? We should just add the post-processing to the benchmark. Then submitters can implement the post-processing on the host or device depending on the capabilities of their system.

@DilipSequeira
Copy link
Contributor Author

DilipSequeira commented Apr 26, 2021

That would be my preference regardless of this question. If we do that, does that mean we should assume the answer is (1) above?

@tjablin
Copy link
Collaborator

tjablin commented Apr 26, 2021

I think we should ask the Medical Advisory Board:

  1. What does the post-processing after the timed portion look like for this model?
  2. What typically happens to the inference results for 3D-Unet? Are they sent to the screen, network, storage, or somewhere else?

My current thinking is that end_on_host is probably appropriate for 3D-UNet if we add timed post-processing, but I would like to have confirmation from an expert. Unlike most of the other application in MLPerf, there's not a good 3D-UNet analogue at Google, so I am very reluctant to make changes without consulting an expert.

@alexkarargyris
Copy link

alexkarargyris commented May 6, 2021

For clarification purposes I want to share here (thanks @pabloribalta) what the current reference implementation for 3D-Unet in the Training Benchmark is:

  1. Get a scan (3D image) at certain resolution
  2. Resample to a common voxel spacing
  3. Pad every volume so it is equal or larger than 128
  4. Crop volumes so they are divisible by 64
  5. If a given edge length modulo 64 is larger than 32 it is constant padded, if a given edge length modulo 64 is less than 32 it will be cropped
  6. Split volumes: The window of 128x128x128 is sliding over pre-processed volumes with a stride of 64, representing the overlap of 0.5.
  7. Predict
  8. Stitch and produce final segmentation:
    a. The result is multiplied by a patch normalizing matrix - a gaussian kernel
    b. The results are stacked by adding together
    c. A global normalizing matrix is obtained by stacking patch normalizing matrices
    d. At the end, the result is divided by the global normalizing matrix
    e. To obtain the final labels an argmax is used

Notes:

  • Steps 1-5 are preprocessing steps that need to take place. Think of these steps as data preparation (e.g. normalization and size adjustment).
  • Step 6 is where the processed images (after step 5) are split into tiles so they can be fed into the model. Why is it 128x128x128? Because this is the size that gave the best accuracy when training the model.
  • Step 7 is where the model makes prediction on the tile (i.e. 128x128x128).
  • Step 8 is the post-processing step. The tiles need to be stitched together to produce the original size segmentation.

We believe that steps 1-5 should not count against benchmark time and steps 6-8 could be left to the submitters to optimize (i.e. change the tile size) as long as they meet the expected accuracy or above. Indeed step 8 can probably take place on an accelerator. However it is to my understanding that the Inference closed division doesn't allow hyperparameter change. So submitters have to go with 128x128x128. Is this true?

@tjablin the resulted stiched output may be either displayed on the screen, or sent to the network or stored for later view.

@DilipSequeira
Copy link
Contributor Author

The hyperparameter question is somewhat off-topic here: I've opened a new issue #216

DilipSequeira added a commit to DilipSequeira/inference_policies that referenced this issue May 11, 2021
Update rules to allow end-on-device for 3DUNet as per discussion in mlcommons#213.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants