Improve performance of VLSV format in post-processing #19

rjarvinen · 2016-02-22T11:52:25Z

Study if the performance of VLSV can be improved for post-processing. Currently ~5-10 GB VLSV files are not feasible to be analyzed on laptop computers. This may not be due to a RAM memory limit but could be more an issue with the performance of the VLSV reader and the VLSV Visit plugin. For example, could stored cells be sorted for faster access by a separate post-processing tool?

iljah · 2016-02-22T13:28:18Z

It would be nice to have some benchmarks. For example how much memory is required to fetch all data in one cell? How much of the file has to be read to fetch all data from one cell? How does the required CPU time to fetch M variables from N cells scale?

In files written by dccrg cells are not guaranteed to be in any particular order but writing a post-processing tool to sort the cells (and their data for faster sequential access) based on id to make them faster to find seems almost trivial.

galfthan · 2016-02-22T13:53:28Z

In general vslv writes data out so that the data from each process is in order, so data from rank 0 comes first then rank 1 and so on. With dynamic load balancing the data is not in any particular order when looking at the ID's of individual cells. This means that to read in the the data from a particular cell one first needs to read in all cellids so that the location can be found. The overhead when reading data from a single point is thus very large. For example: to read in rho from one particular point in a Vlasiator simulation with 1000 files, each with 4000 x 2000 cells, means that one needs to read in 64 GB of data, while the actual rho data is only 8 kB in size.

There are in general two different solutions:

Sort cells while writing or in post-processing. I think it would be best to do it while writing to avoid an annoying post processing step that is potentially very slow and requires buffer space. The all-to-all like communication step, or a complex fileview would not be free either.
Add more metadata to reduce the amount of data that is read in. If we for example wrote the bounding box of each process then one could read in the cellids of just a few processes to find the cell.

I would probably start by testing what the performance penalty would be when using a custom fileview to write data in order.

sandroos · 2016-02-22T15:22:11Z

@rjarvinen any chance you could toss me a sample VLSV file via Dropbox, for example, that is too slow to analyze? Also check out the pull request..

sandroos · 2016-02-22T15:25:30Z

@galfthan Yup, bounding box per domain that limits the cell IDs would indeed make things faster, I have a much bigger update coming for vlsv where I might implement this

iljah · 2016-02-23T20:17:07Z

looking at the ID's of individual cells. This means that to read in the the data from a particular cell one first needs to read in all cellids so that the location can be found. The overhead when reading data from a single point is thus very large. For example: to read in rho from one particular point in a Vlasiator simulation with 1000 files, each with 4000 x 2000 cells, means that one needs to read in 64 GB of data, while the actual rho data is only 8 kB in size.

Here it would save a lot if cell ids were in order (regardless of num procs) so a cell can be found in log(N) reads, and also without AMR cell ids would be in the same spot in all files. Even with AMR the spot in previous file could serve as a hint on where to start searching for the same cell.

1) Sort cells while writing or in post-processing. I think it would be best to do it while writing to avoid an annoying post processing step that is potentially very slow and requires buffer space. The all-to-all like communication step, or a complex fileview would not be free either.

How would writing cell data in cell id order work in parallel? Seems like all-to-all would still be involved at least to find out who has which cells. A converter for dccrg-based file formats that sorts cell ids would use as much memory as would be needed to sort N cell ids. Also if only cell ids have to sorted, and not their data to e.g. make sequential access faster, then the file can be processed in-place instead of writing a new one. Also sorting data of all cells in-place would require of the order as much memory as the largest amount of data in any one cell. Sorting all data in a file would probably be easiest by writing a new one.

sandroos · 2016-02-25T00:27:17Z

@rjarvinen Actually just giving the mesh dimensions (xcells,ycells,zcells), number of domains (roughly), and more information of what kind of data analysis you're doing might be sufficient for me to check what I can do.

sandroos · 2016-02-25T00:31:24Z

How would writing cell data in cell id order work in parallel? Seems
like all-to-all would still be involved at least to find out who has
which cells.

For AMR yes, for regular meshes each process can calculate the correct offsets in output file, assuming there are no holes in the mesh.

I'm not a big fan of the idea of sorting data in VLSV files, however it would be possible to add indexing data as a post-processing step to speed up random accesses.

rjarvinen · 2016-02-25T07:32:10Z

I will prepare soon a benchmark

rjarvinen · 2016-02-26T13:33:27Z

Here's a quick VLSV/Visit plugin performance test with a nominal Venus run. Compared are a VTK file from the HYB simulation and a VLSV file from a Corsair/RHybrid run. Both files have the same amout of scalar and vector variables and the same grid size of 120x160x160 (+-1 cell). VTK uses STRUCTURED_POINTS grid structure.

Data files are available here (file sizes are: VLSV 1.3G and VTK 610M):

https://dl.dropboxusercontent.com/u/8446786/vlsv_perf_test_data_files.zip

Comparison uses attached Visit python script and a shell script to run the comparison (provided that Visit is installed). The script opens the VLSV/VTK file and creates plots of 6 different scalar variables and exit. VLSV takes more than twice the time VTK does to complete the script.

I don't know if the performance difference comes from the VLSV format itself, the grid type used in the VLSV file or the plugin code. I didn't test the pull request with new optimizations for UCD multimesh reader yet and don't know if it affects this test.

./run_perf_test.sh
VLSV format:
Running: cli2.10.0 -nowin -s visit_plotter.py
Running: viewer2.10.0 -nowin -noint -host 127.0.0.1 -port 5600
Running: mdserver2.10.0 -host 127.0.0.1 -port 5600
Running: engine_ser2.10.0 -host 127.0.0.1 -port 5600

real    1m17.649s
user    0m0.812s
sys 0m0.230s
VTK format:
Running: cli2.10.0 -nowin -s visit_plotter.py
Running: viewer2.10.0 -nowin -noint -host 127.0.0.1 -port 5600
Running: mdserver2.10.0 -host 127.0.0.1 -port 5600
Running: engine_ser2.10.0 -host 127.0.0.1 -port 5600

real    0m29.312s
user    0m0.776s
sys 0m0.272s

vlsv_perf_test.zip

sandroos · 2016-02-26T20:43:07Z

@rjarvinen One additional question: how many domains (=MPI procs) do you have in a nominal Venus run?

Please test the version in pull request
#18

as it may potentially give a major performance boost in VisIt.

rjarvinen · 2016-02-27T13:43:02Z

720 PEs using 60 nodes on voima. Thanks, I'll check that patch!

sandroos · 2016-02-27T18:30:11Z

vtk files are still faster as compared to the vlsv in pull request, the speed difference mainly comes from the mesh formats. Structured grid is much easier to generate than a mesh where the cells appear in random order.

I'll take a look if I can speed up things more, but in the meanwhile you can also do parallel visualization in Voima. I'm sure @ykempf can help you out if you weren't using voima for remote visualization already.

rjarvinen · 2016-02-29T09:29:20Z

The performance difference seems to come from creating individual plots.

OpenDatabase(db) command runs faster on VLSV (3 seconds) than on VTK (11 seconds). Plotting only one parameter takes roughly the same amount of time for both formats (30 seconds). Additional plots increase the running time almost linearly on VLSV but not considerably on VTK.

Maybe VTK has buffering or something, which makes it faster to use once the file is opened.

sandroos · 2016-02-29T15:55:36Z

After checking up memory usage with the resource monitor, it indeed does seem like that VTK plugin caches the whole file in memory, and that's why changing variables is faster.

I suppose running an expression in VisIt for VLSV file may be quite slow if it reads in variable data multiple times (although reading variables shouldn't be that slow) and optimizing this a bit might be a good idea. I'm not sure what's the best way to do it for multi-domain data, though, since there are no guarantees that same MPI processes read the same domains every time.

rjarvinen added the enhancement label Feb 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of VLSV format in post-processing #19

Improve performance of VLSV format in post-processing #19

rjarvinen commented Feb 22, 2016

iljah commented Feb 22, 2016

galfthan commented Feb 22, 2016

sandroos commented Feb 22, 2016

sandroos commented Feb 22, 2016

iljah commented Feb 23, 2016 via email

sandroos commented Feb 25, 2016

sandroos commented Feb 25, 2016

rjarvinen commented Feb 25, 2016

rjarvinen commented Feb 26, 2016

sandroos commented Feb 26, 2016

rjarvinen commented Feb 27, 2016

sandroos commented Feb 27, 2016

rjarvinen commented Feb 29, 2016

sandroos commented Feb 29, 2016

Improve performance of VLSV format in post-processing #19

Improve performance of VLSV format in post-processing #19

Comments

rjarvinen commented Feb 22, 2016

iljah commented Feb 22, 2016

galfthan commented Feb 22, 2016

sandroos commented Feb 22, 2016

sandroos commented Feb 22, 2016

iljah commented Feb 23, 2016 via email

sandroos commented Feb 25, 2016

sandroos commented Feb 25, 2016

rjarvinen commented Feb 25, 2016

rjarvinen commented Feb 26, 2016

sandroos commented Feb 26, 2016

rjarvinen commented Feb 27, 2016

sandroos commented Feb 27, 2016

rjarvinen commented Feb 29, 2016

sandroos commented Feb 29, 2016