[FEA] Replace the parallel worlds with generated package names #11926

gerashegalov · 2025-01-07T00:03:52Z

Is your feature request related to a problem? Please describe.
There are use cases that have been difficult to accommodate with our custom classloader because they rely on custom classloading too. Typically any multi tenant server such as Connect #11902 . HiveThriftServer, collaborative notebook environments such as Databricks etc

Our standard answer to issues like this is to recommend building a custom single shim jar for the target environment without relying on parallel worlds https://github.com/NVIDIA/spark-rapids/blob/branch-25.02/CONTRIBUTING.md#building-a-distribution-for-a-single-spark-release

Describe the solution you'd like

#11665 shows the direction how we can replace the current JarURL-based classloader with explicitly generated package names

So in the most naive form we consider all classes in the sql-plugin module as requiring Shimming for API or ABI compatibility reasons

A class like com.nvidia.spark.rapids.GpuJsonTuple during the Maven generate-sources phase becomes something like

spark351.com.nvidia.spark.rapids.GpuJsonTuple. If a class is loaded using the reflection mechanism, such callsites need to be adjusted to also be processed to reference the shimifiied package name.

Our current approach is to presume every class needs shimming too, however, then we rely on binary-dedupe to catch where shimming is not required to avoid jar bloat. This automatic dedupping will no longer be possible because the bytecode will never be bitwise-identical across shims given the difference in package names.

Thus we should work on improving the code discipline of reducing the surface to shimmable code. The current content of the spark-shared folder in the rapids-4-spark jar and https://github.com/NVIDIA/spark-rapids/blob/branch-25.02/docs/dev/shims.md#how-to-externalize-an-internal-class-as-a-compile-time-dependency should gives a clue how to reduce the unnecessarily shimmed classes.

Describe alternatives you've considereds

Carry on as is
release many single-shim artifacts instead of a single multi-shim artifact

The text was updated successfully, but these errors were encountered:

gerashegalov · 2025-01-09T19:55:26Z

We can estimate the upper bound of the size of the future jar by disabling binary-dedupe. The real increase will be less

default buildall yields

$ du -h dist/target/rapids-4-spark_2.12-24.12.0-cuda11.jar 
523M    dist/target/rapids-4-spark_2.12-24.12.0-cuda11.jar

buildall without binary dedupe

SKIP_BINARY_DEDUPE=1 mvn package -pl dist -PnoSnapshots
du -h dist/target/rapids-4-spark_2.12-24.12.0-cuda11.jar 
738M    dist/target/rapids-4-spark_2.12-24.12.0-cuda11.jar

revans2 · 2025-01-09T21:49:41Z

That is a 41% increase in size. That makes a choice like this much harder.

gerashegalov · 2025-01-09T22:16:07Z

It is a very conservative upper bound that I could produce without going too far.

gerashegalov added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 7, 2025

mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Replace the parallel worlds with generated package names #11926

[FEA] Replace the parallel worlds with generated package names #11926

gerashegalov commented Jan 7, 2025 •

edited

Loading

gerashegalov commented Jan 9, 2025

revans2 commented Jan 9, 2025

gerashegalov commented Jan 9, 2025

[FEA] Replace the parallel worlds with generated package names #11926

[FEA] Replace the parallel worlds with generated package names #11926

Comments

gerashegalov commented Jan 7, 2025 • edited Loading

gerashegalov commented Jan 9, 2025

revans2 commented Jan 9, 2025

gerashegalov commented Jan 9, 2025

gerashegalov commented Jan 7, 2025 •

edited

Loading