Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAPIDS accelerated UDF examples build environment does not match spark-rapids-jni environment #362

Open
jlowe opened this issue Feb 7, 2024 · 9 comments
Assignees

Comments

@jlowe
Copy link
Contributor

jlowe commented Feb 7, 2024

The Dockerfile used for the RAPIDS accelerated native UDF example build environment is using Ubuntu18.04, but the build environment used by spark-rapids-jni for the libcudf.so that will be placed in the RAPIDS Accelerator jar is using centos7+devtoolset. That means code could be crossing the GCC CXX11 ABI streams and lead to failures to find symbols at runtime when trying to load the native UDF shared library, e.g.:

/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java: symbol lookup error: /tmp/nativeudfjni8442648179436293266.so: undefined symbol: _ZN4cudf13string_scalarC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbN3rmm16cuda_stream_viewEPNS9_2mr22device_memory_resourceE

which when run through cu++filt shows this is a failure to find:

cudf::string_scalar::string_scalar(const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> &, bool, rmm::cuda_stream_view, rmm::mr::device_memory_resource *)

The Dockerfile used by the examples should be using the same setup as spark-rapids-jni to avoid this. We should also add a RAPIDS Accelerated native UDF that uses a string_scalar with a std::string argument to help catch this ABI mismatch in the future.

@sameerz
Copy link
Collaborator

sameerz commented Feb 15, 2024

RAPIDS may drop support for CentOS7 in the upcoming release, and has Ubuntu 20.04 as a minimum required version ( https://docs.rapids.ai/install#system-req ). Does that change what we need to do here?

Or do we still need to ensure the Dockerfile used by the examples is using the same setup as spark-rapids-jni, and we need to update the spark-rapids-jni setup to account for the new minimum OS versions?

Ref: https://endoflife.software/operating-systems/linux/red-hat-enterprise-linux-rhel

@jlowe
Copy link
Contributor Author

jlowe commented Feb 15, 2024

Or do we still need to ensure the Dockerfile used by the examples is using the same setup as spark-rapids-jni, and we need to update the spark-rapids-jni setup to account for the new minimum OS versions?

This. Bottom line is the examples need to build in the same environment as spark-rapids-jni does, regardless of what that environment actually is. Note that we still want to build spark-rapids-jni in a way that allows a single binary to run on all supported OS's, and I'm doubtful we can simply build on Ubuntu 20.04's default toolchain to try to satisfy that requirement.

@GaryShen2008
Copy link
Collaborator

If we plan to update the spark-rapids-jni build setup in 24.04, we can do this issue after changing spark-rapids-jni.
Have we already decided to drop centos 7 in 24.04? If so, let's file another issue in spark-rapids-jni to decide which environment to be used for compiling to meet the requirement of single binary to run on alll supported OS.

@sameerz
Copy link
Collaborator

sameerz commented Feb 16, 2024

Have we already decided to drop centos 7 in 24.04? If so, let's file another issue in spark-rapids-jni to decide which environment to be used for compiling to meet the requirement of single binary to run on alll supported OS.

It looks like RAPIDS will deprecate CentOS7 in 24.04 and stop support in 24.06, per rapidsai/docs#475

For 24.04 we should make sure the Dockerfile used for the examples matches the same one used for spark-rapids-jni (Centos7+devtoolset)

In parallel we should figure out what our minimum toolchain will be so we are ready in 24.06.

@GaryShen2008
Copy link
Collaborator

Hi @YanxuanLiu, is it possible to use the same docker file to build UDF example as the JNI?

@YanxuanLiu
Copy link
Collaborator

Hi @YanxuanLiu, is it possible to use the same docker file to build UDF example as the JNI?

Sry but I think @NvTimLiu could help on this issue. I haven't dealt with this issue.

@GaryShen2008 GaryShen2008 assigned NvTimLiu and unassigned YanxuanLiu Aug 8, 2024
@GaryShen2008
Copy link
Collaborator

Hi @NvTimLiu, Can you check if it's possible to use the same docker to build UDF example as the JNI?

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Aug 14, 2024

Good for CI to use the same docker image as the rapids JNI to build UDF examples

We have a Dockerfile specified for building UDF examples https://github.com/NVIDIA/spark-rapids-examples/blob/main/examples/UDF-Examples/RAPIDS-accelerated-UDFs/Dockerfile

Shall we remove it, and document it that we build UDF examples with the rapids JNI docker image?

@NvTimLiu
Copy link
Collaborator

Discussed with Gary, we'll use the same docker in CI job and document the link of dockerfile in JNI.

I'll handle it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants