Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembly shaderules break serialization/deserialization with Dataset and Dataframe #279

Closed
oroundtree opened this issue Aug 23, 2022 · 8 comments

Comments

@oroundtree
Copy link

I've been working on an issue for a while now where certain features of sparksql-scalapb haven't been working correctly, mostly related to encoders and the following error when creating a Dataframe or Dataset of serialized protobuf data:
Unable to find encoder for type Array[Byte]. An implicit Encoder[Array[Byte]] is needed to store Array[Byte] instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

My scalatests for serialization and deserialization work when they are run in the same project that the protobuf messages are in, using the compiled code. However, they fail if I'm using the assembled jar unless I remove the following shaderule from build.sbt:
ShadeRule.rename("shapeless.**" -> "shadeshapeless.@1").inAll

I've also tested this and found the same results when running a class without scalatest dependencies.

I haven't yet seen any issues from removing the above shaderule, but I'm also not sure why it is there and what the implications of removing it are...

@thesamet
Copy link
Contributor

Hi @oroundtree , thanks for reporting. Indeed sounds very strange that the presence of the shading rule creates a problem. Can you provide a minimal example to reproduce this including instructions? You can start by forking https://github.com/thesamet/sparksql-scalapb-test

@oroundtree
Copy link
Author

oroundtree commented Aug 23, 2022

Here you go: https://github.com/oroundtree/sparksql-scalapb-test-oroundtree

Master has the shaderule and the tests present, and the no-shaderule branch has the shaderule removed and the version number bumped up so I can test adding the two jars as unmanaged dependencies separately. Here are the steps I follow to reproduce it:

  1. Pull master from https://github.com/oroundtree/sparksql-scalapb-test-oroundtree
  2. build the jar using sbt assembly (take note that all the tests pass in the process)
  3. Pull master from the project we'll use to test pulling the jar as a dependency (https://github.com/oroundtree/sparksql-scalapb-import-oroundtree)
  4. Place sparksql-scalapb-test-oroundtree-assembly-1.0.0.jar into the /lib folder in the project
  5. Run sbt test
  6. You'll get an encoder not found error and the tests will not compile

After that, you can give the non-shaded jar a try using the same steps as above, except:

  1. Use the no-shaderule branch of https://github.com/oroundtree/sparksql-scalapb-test-oroundtree
  2. Assemble the jar and replace sparksql-scalapb-test-oroundtree-assembly-1.0.0.jar with sparksql-scalapb-test-oroundtree-assembly-1.0.1.jar
  3. Run sbt test
  4. The tests should run and pass

EDIT Also worth noting I get the same results in both cases if I'm pulling the jar as a managed dependency in sbt or maven (i.e. from a private maven repository)

EDIT x2 If you are using IDEA the IDE may complain that the imports from your unmanaged sbt dependency are not found. You can safely ignore the syntax highlighting

@thesamet
Copy link
Contributor

Thanks, I quickly read through. For step 3, can you provide that "another project" as well so and make the edits in your message above, just so the issue is self contained?

@oroundtree
Copy link
Author

I've updated the steps with the small example project and more exact steps on how to reproduce the error. Hope it helps!

@thesamet
Copy link
Contributor

Thanks for providing the detailed example. I was able to follow the instructions and see the issue. The example guides us into something that's a little tricky to reason about bringing : the assembled jar brings a shaded version of shapeless, and the parent project brings another unshaded copy. I think it was unintended, but the shaded jar brings also scalatest. The practice I want to encourage is to perform the assembly and shading as the final packaging step, just before it's shipped to a spark cluster.

  1. Is it possible to reproduce this problem where the assembled jar causes the problem directly when submitted to spark? (I haven't tried)
  2. Is there a reason why this specific set up to work (by that I mean having an assembled jar used as a dependency)

@oroundtree
Copy link
Author

  1. I was able to confirm that including the serialization/deserialization in the demo and then submitting the proto jar directly to a local cluster using spark-submit works.
  2. Basically I've got a complex project with lots of proto definitions, including gRPC services. These are kept in a repo which is automatically assembled and pushed to an artifact repository when changes are made and this ensures that the projects importing and using these proto definitions are working from the same proto definitions.

If I didn't do this, every project that uses the proto definitions would need to have their individual .proto files edited when a change is made to a message definition

@thesamet
Copy link
Contributor

If I didn't do this, every project that uses the proto definitions would need to have their individual .proto files edited when a change is made to a message definition

Trying to understand the above. The suggested practice is to have all the intermediate dependencies (which can contain protos) remained unshaded, and only perform the assembly/shading for the final artifacts you deploy. You write that this would lead to editing of protos that import other protos upon their change - I'm not following this part - can you explain in more detail? What edits would be necessary?

I would suggest to see how you can adopt your build to support the suggested practice of shading at the last step. sbt-assembly also calls out that introducing fat jars as dependencies is not a great idea.

Having said that, I did look deeper and it looks like the first failure that happens in the encoder derivation involves invoking a macro in the shaded copy of shapeless. I've filed a bug with sbt-assembly along with a reproducible example.

@thesamet
Copy link
Contributor

Closing due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants