Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the docker with Resnet50 running successfully and load it on another system with the same config but fail to run Resnet50 #2023

Open
Bob123Yang opened this issue Jan 9, 2025 · 1 comment

Comments

@Bob123Yang
Copy link

Hi @arjunsuresh I have encountered one issue for the docker migration.

I run the below command in the system A to build the docker successfully and run the Resnet50 inference in the docker successfully. Then I save the docker as the docker-with-test-successfully-1.tar.

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=1000

I loaded it on another system B with almost the same configuration and run the same Resnet50 inference again as below but failed with the below log. I'm not sure is there any limitation for the docker migration I should care about.

bob@Bob-Tomcat-Product:~$ docker images
REPOSITORY                        TAG       IMAGE ID       CREATED      SIZE
docker-with-test-successfully-1   latest    2b63d4ccc258   9 days ago   35.5GB
bob@Bob-Tomcat-Product:~$ docker run -it docker-with-test-successfully-1:latest /bin/bash
cmuser@d37b940a1f0a:~$ ls
CM  cm-run-script-versions.json  configs  hardware  version_info.json
cmuser@d37b940a1f0a:~$    cm run script --tags=run-mlperf,inference,_r4.1-dev \
>    --model=resnet50 \
>    --implementation=nvidia \
>    --framework=tensorrt \
>    --category=edge \
>    --scenario=Offline \
>    --execution_mode=valid \
>    --device=cuda \
>    --division=closed \
>    --rerun \
>    --quiet
INFO:root:* cm run script "run-mlperf inference _r4.1-dev"
INFO:root:  * cm run script "detect os"
INFO:root:         ! cd /home/cmuser
INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:         ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:  * cm run script "detect cpu"
INFO:root:    * cm run script "detect os"
INFO:root:           ! cd /home/cmuser
INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:           ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:         ! cd /home/cmuser
INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-cpu/run.sh from tmp-run.sh
INFO:root:         ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-cpu/customize.py
INFO:root:  * cm run script "get python3"
INFO:root:       ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:  * cm run script "get mlcommons inference src"
INFO:root:       ! load /home/cmuser/CM/repos/local/cache/21f79a83541549b7/cm-cached-state.json
INFO:root:  * cm run script "get sut description"
INFO:root:    * cm run script "detect os"
INFO:root:           ! cd /home/cmuser
INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:           ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:    * cm run script "detect cpu"
INFO:root:      * cm run script "detect os"
INFO:root:             ! cd /home/cmuser
INFO:root:             ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:             ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:           ! cd /home/cmuser
INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-cpu/run.sh from tmp-run.sh
INFO:root:           ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-cpu/customize.py
INFO:root:    * cm run script "get python3"
INFO:root:         ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:    * cm run script "get compiler"
INFO:root:         ! load /home/cmuser/CM/repos/local/cache/6285b87ff0f74d8a/cm-cached-state.json
INFO:root:    * cm run script "get cuda-devices _with-pycuda"
INFO:root:      * cm run script "get cuda _toolkit"
INFO:root:           ! load /home/cmuser/CM/repos/local/cache/b5a3a8af88c14cc7/cm-cached-state.json
INFO:root:ENV[CM_CUDA_PATH_LIB_CUDNN_EXISTS]: no
INFO:root:ENV[CM_CUDA_VERSION]: 12.2
INFO:root:ENV[CM_CUDA_VERSION_STRING]: cu122
INFO:root:ENV[CM_NVCC_BIN_WITH_PATH]: /usr/local/cuda/bin/nvcc
INFO:root:ENV[CUDA_HOME]: /usr/local/cuda
INFO:root:      * cm run script "get python3"
INFO:root:           ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:      * cm run script "get generic-python-lib _package.pycuda"
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:             ! cd /home/cmuser
INFO:root:             ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-generic-python-lib/validate_cache.sh from tmp-run.sh
INFO:root:             ! call "detect_version" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-generic-python-lib/customize.py
            Detected version: 2022.2.2
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:           ! load /home/cmuser/CM/repos/local/cache/a29ea6efe3564a4b/cm-cached-state.json
INFO:root:      * cm run script "get generic-python-lib _package.numpy"
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:             ! cd /home/cmuser
INFO:root:             ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-generic-python-lib/validate_cache.sh from tmp-run.sh
INFO:root:             ! call "detect_version" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-generic-python-lib/customize.py
            Detected version: 1.23.5
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:           ! load /home/cmuser/CM/repos/local/cache/19ca7b3b57a74cd2/cm-cached-state.json
INFO:root:           ! cd /home/cmuser
INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-cuda-devices/detect.sh from tmp-run.sh
Traceback (most recent call last):
  File "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-cuda-devices/detect.py", line 1, in <module>
    import pycuda.driver as cuda
  File "/home/cmuser/.local/lib/python3.8/site-packages/pycuda/driver.py", line 66, in <module>
    from pycuda._driver import *  # noqa
ImportError: /lib/x86_64-linux-gnu/libcuda.so.1: file too short

CM error: Portable CM script failed (name = get-cuda-devices, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
cmuser@d37b940a1f0a:~$ 
@arjunsuresh
Copy link
Contributor

arjunsuresh commented Jan 9, 2025

@Bob123Yang Was the docker porting working for you before? We normally don't test this and we recommend launching the docker image via CM command only. The reason is, any needed options (like --gus=all for Nvidia) and mounts (for models, datasets, results etc) to docker run command will be added by CM only. Also, now the docker build time for Nvidia container is just about 15-20 minutes as prebuilt pytorch whl is used.

Another option is to copy the docker image under the same name and then CM command should automatically pick it instead of recreating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants