-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Spark including Spark SQL #297
Comments
Example of usage:
|
@anjackson we moved over to using Sparkling for our processing earlier this year. It's been really great for us. ...and I'm reminded of conversations we had 5 or 6 years ago about our projects working towards some common code base. |
Thanks @ruebot - I have spent some time looking at Sparkling, but right now it feels like the gap is too big for me to switch over. Our current implementation is based on locally caching and repeatedly processing the payload (without passing large As I don't have much time to work on this, I'm starting off by learning how to use Spark, and porting the current process over with minimal changes. As I learn more I can hopefully bring things closer together, but it looks like being a long road. |
Some experimentation with how to setup up the extraction without knowing all fields ahead of time... See 62de913. As per https://spark.apache.org/docs/latest/sql-getting-started.html#programmatically-specifying-the-schema this can work, but needs a bit of help. The composite Map function or functions that get applied via The wrapper can then declare the schema, and the WARC-based RDD can be transformed to an Current implementation hardcodes the mapping and stores the types as metadata classes, so it can cope when the values are null. It seems that a more elegant implementation would require changes to the current indexer so the analysis declares the expected fields ahead of time. Or we could just declare them all for now. |
I realised older Hadoop support was needed, so experimented with some of these ideas. But after hacking things together, it's clear the Parquet writer is going to be painful to run under old Hadoop.
So, idea is that old Hadoop will just output JSONL, and then this can be transferred to the newer cluster as needed. |
To support more modern patterns of usage, and more complex processing, it would be good to support Spark.
Long term, this should likely integrate with the Archives Unleashed Toolkit, but at the moment this is not easy for us to transition to using that. This is mostly due to how it handles the record contents, which gets embedded in the data frames, which leads to some heavy memory pressure (TBA some notes).
The current
hadoop3
branchWarcLoader
provides an initial implementation. It works by building a RDD stream of WARC Records, but also supports running the analyser on that stream, which is able to work on the full 'local' byte streams as long as no re-partitioning has happened. This can then output a stream of objects that contain the extracted metadata fields and are no longer tied to the original WARC input streams. This can then be turned into a DataFrame and SQL can be run on it. It can also be exported as Parquet etc.loadAndAnalyse
andcreateDataFrame
(short term).df.createOrReplaceTempView("mementos")
The POJO approach does mean we end up with a very wide schema with a lot of nulls if there's not been much analysis. Supporting a more dynamic schema would be nice, but then again fixing the schema aligns with the Solr schema.
The text was updated successfully, but these errors were encountered: