Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for running the same template over a stream of JSON and output individual files #1248

Closed
cipriancraciun opened this issue Nov 18, 2021 · 8 comments

Comments

@cipriancraciun
Copy link

(This feature request is somewhat related to #485 and #197 although not quite.)

For example there is a stream of JSON objects, perhaps obtained from jq, and one wants to run the same template over each of those JSON objects, and output each result into its distinct file.

I would expect to be able to run something like this:

cat ./some-large-json-stream \
| jq '...' \
| gomplate -f ./some-template.html -c .=stdin:///.json-sequence --output-map './out/{{ .token }}.html'

Explaining the snippet above:

  • jq ... would produce one JSON object per line (although ideally gomplate should read an entire JSON object regardless of whitespaces, and then continue to the next one);
  • gomplate sees the .json-sequence extension (or perhaps a different flag like --context-sequence or similar), and read one such JSON object at a time, and then
  • gomplate would run the template on that JSON object, and the output filename generator, and write the result to that particular file;
  • gomplate would loop until it is done;

Additionally, if one doesn't specify --output-map, but only --out or nothing at all, then gomplate would execute the same template for each JSON object and just append everything together either to the single output file or to stdout.


Additionally, although this might be harder to implement but it would be orthogonal to multiple template files, if one specifies multiple --in files or --in-dir folder, and also --context-sequence (for example), but in this case --output-map is mandatory, each JSON object is executed for each template.

@cipriancraciun
Copy link
Author

Also, I've read your blog post https://blog.hairyhenderson.ca/post/one_template_many_outputs/, and although "technically" it can be a solution to what I'm proposing, it's not quite the same.

For example, in my use-case, I just had to generate ~600K files, for an input JSON file of ~22 GiB. I would assume that loading that JSON in memory would consume quite a lot more memory, thus it is just not feasible to apply your suggestion.

Indeed, I could chunk the input file, but given it is actually compressed, it would be so much easier to just be able to run over that stream.

@hairyhenderson
Copy link
Owner

Hi @cipriancraciun - thanks for filing this issue!

Just so I'm understanding this correctly - is this essentially asking for JSONL support? This also seems similar (but not directly related) to #534. I've thought about this in the past and I've also encountered a few separate standards - NDJSON (which seems identical to JSONL), and application/json-seq (which is slightly different, using the <RS> character instead of a newline).

However what I had been considering was essentially supporting JSONL/multi-doc YAML streams as arrays - in other words, a JSONL datasource would be parsed as a whole first and then accessible in the template for looping or indexing. Certainly not very useful for your particular use, especially if you're talking about multi-GiB inputs!

Just thinking very briefly about this I could imagine some sort of --stream option that would cause gomplate to repeat the template for every JSON document read from datasources that support streaming. Something like:

$ gomplate --stream -d stream=./stream.jsonl -f template.tmpl --output-map 'out/{{ .name }}'

All this said, this sort of change in gomplate's behaviour would likely be quite complex - there are a bunch of assumptions made about how it processes datasources that would need to be totally re-worked. And, there's the matter of time as well - I don't have a lot of free time these days to work on gomplate, so this would likely take quite a while to implement...

@cipriancraciun
Copy link
Author

Just so I'm understanding this correctly - is this essentially asking for JSONL support? This also seems similar (but not directly related) to #534. I've thought about this in the past and I've also encountered a few separate standards - NDJSON (which seems identical to JSONL), and application/json-seq (which is slightly different, using the character instead of a newline).

@hairyhenderson, indeed I'm asking for support for one of those formats.

Now regarding the exact format, I would list them in order by preference:

  • just expect one JSON "term" after another (be it a list, array, number, string, boolean, etc.), and ignore any white space between them (be it \n, \r, space or \t, the record separator character, the form separator character, the vertical feed separator, etc.;)
  • just expect a sequence of "lines" that are separated by one (or perhaps a sequence of) a given character; say \n, <rs> or \0 as the user specifies; by default expect \n; (\0 is especially useful in some situations;)
  • just expect a sequence of "lines", whatever the Go language thinks a "line" is by using the ReadLine function;

(In fact option one and two are useful in different cases, thus perhaps supporting both would be useful.)


However what I had been considering was essentially supporting JSONL/multi-doc YAML streams as arrays - in other words, a JSONL datasource would be parsed as a whole first and then accessible in the template for looping or indexing. Certainly not very useful for your particular use, especially if you're talking about multi-GiB inputs!

Indeed, when one has sequences of JSON terms, most likely they are quite large, and couldn't have been provided as a single JSON array in the first place. (In fact if this would be the case, a simple jq --slurp . would just convert a sequence of such JSON terms in a single array.)

@hairyhenderson
Copy link
Owner

Thanks @cipriancraciun, and my apologies for the slow response. This makes sense, and I'm tentatively interested in adding this to gomplate. However just be aware that this will take some time as the changes are complex and I don't work on gomplate full-time.

@cipriancraciun
Copy link
Author

However just be aware that this will take some time as the changes are complex and I don't work on gomplate full-time.

@hairyhenderson, I understand completely, take your time (if you decide to implement this). (I understand this is an open-source project, and if I would be accustomed with the code, I would have tried it myself.)

@github-actions
Copy link

This issue is stale because it has been open for 60 days with no activity. Remove stale label or comment or this will be automatically closed in a few days.

@github-actions github-actions bot added the Stale label Apr 10, 2023
@cipriancraciun
Copy link
Author

I am still interested in this feature (for future use of gomplate in such scenarios). :)

Granted, at the time I don't have the time to implement it myself, thus I'm OK with this feature request being closed with "won't implement".

@hairyhenderson
Copy link
Owner

Thanks for the feedback @cipriancraciun! I'm going to close this issue for now, since I don't have the time to implement this, and nobody else seems to have any interest in implementing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants