Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging simulation files loses the information on total number of simulated events #177

Open
gabemery opened this issue Nov 13, 2023 · 6 comments

Comments

@gabemery
Copy link
Collaborator

Current hdf5 file merging script copies the simulation information of only the first file (and assumes it is consistent with following files). This can be error inducing but is coherent with current productions.

Then in usage the number of simulation file is needed to correctly weight MC data. It is thus extracted from events list by finding unique 'obs_id'.
This can lead to underestimated results if the number of events per simulation file is low (and thus cuts remove some obs_id fully).

Example : processing of Helium from this production: /fefs/aswg/data/mc/DL0/LSTProd2/TestDataset/Helium/sim_telarray/node_corsika_theta_68.284_az_3*
More than 10% obs_id are removed during the DL1->DL1_stereo step. With an 'intensity > 50 p.e.' quality cut applied.

@Elisa-Visentin
Copy link
Collaborator

Do you mean the num_showers parameter ('simulation/config' key)? This is never changed (only copied). And the total number of simulated events is num_showersshower_reusenumber_of_runs. So I think the issue does not concern the number of simulated events (copied during the merge) but how we evaluate number_of_runs. You can evaluate it (as unique obs_id) from DL1 files (instead of DL1 stereo/DL2), but this is not so nice, or we can add a column to merged (MC) files (simulation/config table) where we save the number of merged files (maybe just adding a counter into the for loop or using the length of the input_files = glob.glob(input_file_mask) array), but the data format will have to be slightly changed, or the number of files could be stored in the file name (then you can read it by splitting the string)

@gabemery
Copy link
Collaborator Author

Hi @Elisa-Visentin ,
"And the total number of simulated events is num_showers shower_reuse number_of_runs."
Yes this is how things are currently handled. But in the end, number_of_runs is not an information stored in the data. And while currently the obs_id changes with runs, I can see a world where it may not always be the case.

In lstchain, the n(um)_showers parameter is stacked during merging. Which should be correct since it indeed represents the total number of individual event simulated.
Your proposal of storing the number of file could work but indeed would require a change of data format. Plus I am not sure how useful this information is standalone (versus updating num_shower).
I am not a fan of storing information in file names but it could also be a solution.

Another possibility may be to store the simulation config of all files instead of only the first one. I saw that somewhere once (forgot when).

@Elisa-Visentin
Copy link
Collaborator

If we stack the num_showers while merging (by adding over the files or by multiplying for the number of files) we should also change the other informations in the simulation config (i.e., average or sum of the other parameters over the files): keeping all but one parameters equal to the ones of the first file and stacking the number of showers could be a bit misleading (in particular if you use processed data without knowing the code details). Not sure about storing one line per file (I should check the whole code, but the file must be read appropriately everywhere).

BTW, you must merge DL1, or the code changes will be useless (maybe to be added in the doc). And, according to the chosen parameters, some runs could not survive even the dl0-dl1 step🤔 (so you should count the number of simtel files)

@gabemery
Copy link
Collaborator Author

gabemery commented Nov 14, 2023

"we should also change the other informations in the simulation config"
I am assuming that other parameters stay the same. Currently we have all parameters for each run the same for a particle/pointing. We simply split in runs to have manageable jobs.

If we want to have other settings changing I think the only option would be to store all the (unique) configs (stacking the num_showers). But it would also require to find a way to tag events associated to a given config. Maybe by adding the obs_id associated to each. But this is way more complex and maybe not that useful.

BTW, you must merge DL1, or the code changes will be useless (maybe to be added in the doc). And, according to the chosen parameters, some runs could not survive even the dl0-dl1 step🤔 (so you should count the number of simtel files)

Right, we would need to either do that or keep files with 0 events...

@Elisa-Visentin
Copy link
Collaborator

To evaluate sensitivity you use a (set of) merged file(s) (one per pointing direction): I think the info about the obs_id stored in a single run is not needed because you will use all the runs (so, no matter if run 1 has 2000 showers and run 2 1000, after you merge them you will have 3000 simulated showers). I should check the configuration entries: if none of them (except num_showers) depends on the run/number of runs, we can change num_showers while iterating over the files to store the total number of simulated showers, otherwise we can do this but changing the other parameters too (e.g., if stored, the area over which showers are simulated, which could be an average over the runs).

As for files with 0 events, we should check what can happen in the merge (are they skipped or not? any crash?). And we should evaluate the needed space and number of files: empty tables won't be 'huge', hopefully, but maybe it is not worth to store them, maybe we can just create (somewhere in the analysis folder) a summary file where a few numbers are stored during the analysis (number of DL0 runs, runs lost due to cuts, runs lost due to bugs/crashes...), maybe we can just count the number of DL0s (can change over time, so maybe we should use the number of Dl1 log files) when needed (manual, boring solution)

@gabemery
Copy link
Collaborator Author

To evaluate sensitivity you use a (set of) merged file(s) (one per pointing direction): I think the info about the obs_id stored in a single run is not needed because you will use all the runs (so, no matter if run 1 has 2000 showers and run 2 1000, after you merge them you will have 3000 simulated showers).

Yes

I really would avoid averaging. As it would lead to incorrect relative weights on events simulated with different conditions. e.g. if the energy range changes we can ignore events with energy below the average minimum. At this point it would be preferable to keep separated files in my opinion.

I don't see a perfect solution to handle 0 events file (especially if it occurs before DL1). Except having an additional log file/database keeping tabs on the number of successfully simulated simtel files, and transitions to higher levels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants