Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Replace the parallel worlds with generated package names #11926

Open
gerashegalov opened this issue Jan 7, 2025 · 3 comments
Open

[FEA] Replace the parallel worlds with generated package names #11926

gerashegalov opened this issue Jan 7, 2025 · 3 comments
Labels
feature request New feature or request

Comments

@gerashegalov
Copy link
Collaborator

gerashegalov commented Jan 7, 2025

Is your feature request related to a problem? Please describe.
There are use cases that have been difficult to accommodate with our custom classloader because they rely on custom classloading too. Typically any multi tenant server such as Connect #11902 . HiveThriftServer, collaborative notebook environments such as Databricks etc

Our standard answer to issues like this is to recommend building a custom single shim jar for the target environment without relying on parallel worlds https://github.com/NVIDIA/spark-rapids/blob/branch-25.02/CONTRIBUTING.md#building-a-distribution-for-a-single-spark-release

Describe the solution you'd like

#11665 shows the direction how we can replace the current JarURL-based classloader with explicitly generated package names

So in the most naive form we consider all classes in the sql-plugin module as requiring Shimming for API or ABI compatibility reasons

A class like com.nvidia.spark.rapids.GpuJsonTuple during the Maven generate-sources phase becomes something like

spark351.com.nvidia.spark.rapids.GpuJsonTuple. If a class is loaded using the reflection mechanism, such callsites need to be adjusted to also be processed to reference the shimifiied package name.

Our current approach is to presume every class needs shimming too, however, then we rely on binary-dedupe to catch where shimming is not required to avoid jar bloat. This automatic dedupping will no longer be possible because the bytecode will never be bitwise-identical across shims given the difference in package names.

Thus we should work on improving the code discipline of reducing the surface to shimmable code. The current content of the spark-shared folder in the rapids-4-spark jar and https://github.com/NVIDIA/spark-rapids/blob/branch-25.02/docs/dev/shims.md#how-to-externalize-an-internal-class-as-a-compile-time-dependency should gives a clue how to reduce the unnecessarily shimmed classes.

Describe alternatives you've considereds

  • Carry on as is
  • release many single-shim artifacts instead of a single multi-shim artifact
@gerashegalov gerashegalov added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 7, 2025
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 7, 2025
@gerashegalov
Copy link
Collaborator Author

We can estimate the upper bound of the size of the future jar by disabling binary-dedupe. The real increase will be less

default buildall yields

$ du -h dist/target/rapids-4-spark_2.12-24.12.0-cuda11.jar 
523M    dist/target/rapids-4-spark_2.12-24.12.0-cuda11.jar

buildall without binary dedupe

SKIP_BINARY_DEDUPE=1 mvn package -pl dist -PnoSnapshots
du -h dist/target/rapids-4-spark_2.12-24.12.0-cuda11.jar 
738M    dist/target/rapids-4-spark_2.12-24.12.0-cuda11.jar

@revans2
Copy link
Collaborator

revans2 commented Jan 9, 2025

That is a 41% increase in size. That makes a choice like this much harder.

@gerashegalov
Copy link
Collaborator Author

It is a very conservative upper bound that I could produce without going too far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants