Merging simulation files loses the information on total number of simulated events #177

gabemery · 2023-11-13T10:42:10Z

Current hdf5 file merging script copies the simulation information of only the first file (and assumes it is consistent with following files). This can be error inducing but is coherent with current productions.

Then in usage the number of simulation file is needed to correctly weight MC data. It is thus extracted from events list by finding unique 'obs_id'.
This can lead to underestimated results if the number of events per simulation file is low (and thus cuts remove some obs_id fully).

Example : processing of Helium from this production: /fefs/aswg/data/mc/DL0/LSTProd2/TestDataset/Helium/sim_telarray/node_corsika_theta_68.284_az_3*
More than 10% obs_id are removed during the DL1->DL1_stereo step. With an 'intensity > 50 p.e.' quality cut applied.

Elisa-Visentin · 2023-11-13T15:20:29Z

Do you mean the num_showers parameter ('simulation/config' key)? This is never changed (only copied). And the total number of simulated events is num_showersshower_reusenumber_of_runs. So I think the issue does not concern the number of simulated events (copied during the merge) but how we evaluate number_of_runs. You can evaluate it (as unique obs_id) from DL1 files (instead of DL1 stereo/DL2), but this is not so nice, or we can add a column to merged (MC) files (simulation/config table) where we save the number of merged files (maybe just adding a counter into the for loop or using the length of the input_files = glob.glob(input_file_mask) array), but the data format will have to be slightly changed, or the number of files could be stored in the file name (then you can read it by splitting the string)

gabemery · 2023-11-13T15:50:19Z

Hi @Elisa-Visentin ,
"And the total number of simulated events is num_showers shower_reuse number_of_runs."
Yes this is how things are currently handled. But in the end, number_of_runs is not an information stored in the data. And while currently the obs_id changes with runs, I can see a world where it may not always be the case.

In lstchain, the n(um)_showers parameter is stacked during merging. Which should be correct since it indeed represents the total number of individual event simulated.
Your proposal of storing the number of file could work but indeed would require a change of data format. Plus I am not sure how useful this information is standalone (versus updating num_shower).
I am not a fan of storing information in file names but it could also be a solution.

Another possibility may be to store the simulation config of all files instead of only the first one. I saw that somewhere once (forgot when).

Elisa-Visentin · 2023-11-13T16:20:32Z

If we stack the num_showers while merging (by adding over the files or by multiplying for the number of files) we should also change the other informations in the simulation config (i.e., average or sum of the other parameters over the files): keeping all but one parameters equal to the ones of the first file and stacking the number of showers could be a bit misleading (in particular if you use processed data without knowing the code details). Not sure about storing one line per file (I should check the whole code, but the file must be read appropriately everywhere).

BTW, you must merge DL1, or the code changes will be useless (maybe to be added in the doc). And, according to the chosen parameters, some runs could not survive even the dl0-dl1 step🤔 (so you should count the number of simtel files)

gabemery · 2023-11-14T09:46:54Z

"we should also change the other informations in the simulation config"
I am assuming that other parameters stay the same. Currently we have all parameters for each run the same for a particle/pointing. We simply split in runs to have manageable jobs.

If we want to have other settings changing I think the only option would be to store all the (unique) configs (stacking the num_showers). But it would also require to find a way to tag events associated to a given config. Maybe by adding the obs_id associated to each. But this is way more complex and maybe not that useful.

BTW, you must merge DL1, or the code changes will be useless (maybe to be added in the doc). And, according to the chosen parameters, some runs could not survive even the dl0-dl1 step🤔 (so you should count the number of simtel files)

Right, we would need to either do that or keep files with 0 events...

Elisa-Visentin · 2023-11-14T15:08:41Z

To evaluate sensitivity you use a (set of) merged file(s) (one per pointing direction): I think the info about the obs_id stored in a single run is not needed because you will use all the runs (so, no matter if run 1 has 2000 showers and run 2 1000, after you merge them you will have 3000 simulated showers). I should check the configuration entries: if none of them (except num_showers) depends on the run/number of runs, we can change num_showers while iterating over the files to store the total number of simulated showers, otherwise we can do this but changing the other parameters too (e.g., if stored, the area over which showers are simulated, which could be an average over the runs).

As for files with 0 events, we should check what can happen in the merge (are they skipped or not? any crash?). And we should evaluate the needed space and number of files: empty tables won't be 'huge', hopefully, but maybe it is not worth to store them, maybe we can just create (somewhere in the analysis folder) a summary file where a few numbers are stored during the analysis (number of DL0 runs, runs lost due to cuts, runs lost due to bugs/crashes...), maybe we can just count the number of DL0s (can change over time, so maybe we should use the number of Dl1 log files) when needed (manual, boring solution)

gabemery · 2023-11-14T15:37:05Z

To evaluate sensitivity you use a (set of) merged file(s) (one per pointing direction): I think the info about the obs_id stored in a single run is not needed because you will use all the runs (so, no matter if run 1 has 2000 showers and run 2 1000, after you merge them you will have 3000 simulated showers).

Yes

I really would avoid averaging. As it would lead to incorrect relative weights on events simulated with different conditions. e.g. if the energy range changes we can ignore events with energy below the average minimum. At this point it would be preferable to keep separated files in my opinion.

I don't see a perfect solution to handle 0 events file (especially if it occurs before DL1). Except having an additional log file/database keeping tabs on the number of successfully simulated simtel files, and transitions to higher levels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging simulation files loses the information on total number of simulated events #177

Merging simulation files loses the information on total number of simulated events #177

gabemery commented Nov 13, 2023

Elisa-Visentin commented Nov 13, 2023

gabemery commented Nov 13, 2023

Elisa-Visentin commented Nov 13, 2023

gabemery commented Nov 14, 2023 •

edited

Loading

Elisa-Visentin commented Nov 14, 2023

gabemery commented Nov 14, 2023

Merging simulation files loses the information on total number of simulated events #177

Merging simulation files loses the information on total number of simulated events #177

Comments

gabemery commented Nov 13, 2023

Elisa-Visentin commented Nov 13, 2023

gabemery commented Nov 13, 2023

Elisa-Visentin commented Nov 13, 2023

gabemery commented Nov 14, 2023 • edited Loading

Elisa-Visentin commented Nov 14, 2023

gabemery commented Nov 14, 2023

gabemery commented Nov 14, 2023 •

edited

Loading