Performance regression in collapsing blocked subspace #3282

jorgensd · 2024-06-25T16:50:27Z

Summarize the issue

Collapsing a blocked subspace is orders of magnitude slower than collapsing the scalar version
and increases dramatically with increase of polynomial order.

First reported at: https://fenicsproject.discourse.group/t/collapsing-mixed-space-takes-unreasonably-long/15071

How to reproduce the bug

Run MWE below

Minimal Example (Python)

from mpi4py import MPI

from dolfinx import mesh, fem
from basix.ufl import element, mixed_element
import time

N = 5
P = 3

# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()

V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")
start3 = time.perf_counter()

Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")

Output (Python)

Meshing and creating mixed space: 0.00762048000001414
Collapsing V: 0.8559996170000659
Collapsing Q: 0.0018857679999655375

Version

main branch

DOLFINx git commit

9f33dab

Installation

Docker using: ghcr.io/fenics/dolfinx/dolfinx:nightly

Additional information

No response

The text was updated successfully, but these errors were encountered:

jorgensd · 2024-06-25T17:19:29Z

dolfinx::fem::DofMap::collapse() is the problem.
Especially, it is the compute_reordering_map in build_dofmap_data that is slow (and not scalable with polynomial order).

jorgensd · 2024-06-25T17:29:54Z

I think we should turn of data reordering in serial.
Running:

from mpi4py import MPI

from dolfinx import mesh, fem, la, log
from basix.ufl import element, mixed_element
import time

# log.set_log_level(log.LogLevel.INFO)
N = 5
P = 4

# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()
V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")

start3 = time.perf_counter()

Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")

returns the following in serial:

Meshing and creating mixed space: 0.013051556999926106
Collapsing V: 7.033872703999805
Collapsing Q: 0.0023294779998650483

and

Meshing and creating mixed space: 0.020997950000037235
Collapsing V: 0.70820857300032
Collapsing Q: 0.003953482000270014
Meshing and creating mixed space: 0.020988049000152387
Collapsing V: 0.7082019069998751
Collapsing Q: 0.003954929999963497

with 2 procs, i.e. a 10 x speedup!
Three processes:

Meshing and creating mixed space: 0.02864133799994306
Collapsing V: 0.34778579499970874
Collapsing Q: 0.003480805999970471
Meshing and creating mixed space: 0.027170727000338957
Collapsing V: 0.34781397200003994
Collapsing Q: 0.0034789139999702456
Meshing and creating mixed space: 0.03079914899990399
Collapsing V: 0.3478091659999336
Collapsing Q: 0.0034698189997470763

I still think it is remarkable that creating the collapsed space is more than 10 x more expensive than creating the mixed space itself.
Any comments @chrisrichardson @mscroggs @garth-wells ?

garth-wells · 2024-06-26T09:05:53Z

We shouldn't turn off re-ordering in serial.

Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?

jorgensd · 2024-06-26T10:10:06Z

We shouldn't turn off re-ordering in serial.

Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?

It is prior to 0.7 release.
Haven’t tested it for 0.6 yet

chrisrichardson · 2024-06-26T10:10:51Z

I don't think mixed topology changes would affect this.
list_timings() reveals:

GPS: create_level_structure                                                 |  4351  0.001285  5.590000
Gibbs-Poole-Stockmeyer ordering                                             |     2  2.885000  5.770000

which seems strange. Looks like some kind of internal issue with the reordering. GPS is only called twice, but builds the level structure thousands of times.

jorgensd · 2024-06-26T10:25:23Z

It is present in 0.6 as well:
P=3

Meshing and creating mixed space: 0.5712962300000015
Collapsing V: 0.7795458059999874
Collapsing Q: 0.0005138619999911498

P=4

Meshing and creating mixed space: 0.6977521319999767
Collapsing V: 7.012910768000012
Collapsing Q: 0.0005572910000068987

chrisrichardson · 2024-06-26T10:31:52Z

I added a line in DofMap.cpp:

 if (!reorder_fn)
  {
    reorder_fn = [](const graph::AdjacencyList<std::int32_t>& g)
    {
      double nnz_per_row
          = g.offsets().back() / (double)(g.offsets().size() - 1);

      spdlog::info("DofMap collapse: reordering graph with {} nnz per row",
                   nnz_per_row);

      return graph::reorder_gps(g);
    };
  }

[2024-06-26 11:30:44.174] [info] DofMap collapse: reordering graph with 106.171875 nnz per row

so it is quite a dense graph

chrisrichardson · 2024-06-26T14:39:01Z

@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2).
I think you will then find it is not much different.

garth-wells · 2024-06-26T14:47:47Z

We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.

jorgensd · 2024-06-26T15:29:40Z

@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2). I think you will then find it is not much different.

As in not improving further?

jorgensd · 2024-06-26T21:09:08Z

We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.

I think one of the main things could Also be the block size, as you can see that an unblocked (scalar) element is magnitudes of order faster than its vector counterpart.

jorgensd added bug Something isn't working performance high-priority labels Jun 25, 2024

garth-wells removed the bug Something isn't working label Jul 4, 2024

garth-wells changed the title ~~Regression in collapsing blocked subspace~~ Performance regression in collapsing blocked subspace Jul 4, 2024

garth-wells added this to the 0.10.0 milestone Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in collapsing blocked subspace #3282

Performance regression in collapsing blocked subspace #3282

jorgensd commented Jun 25, 2024

jorgensd commented Jun 25, 2024

jorgensd commented Jun 25, 2024 •

edited

Loading

garth-wells commented Jun 26, 2024

jorgensd commented Jun 26, 2024

chrisrichardson commented Jun 26, 2024

jorgensd commented Jun 26, 2024

chrisrichardson commented Jun 26, 2024

chrisrichardson commented Jun 26, 2024

garth-wells commented Jun 26, 2024

jorgensd commented Jun 26, 2024

jorgensd commented Jun 26, 2024 •

edited

Loading

Performance regression in collapsing blocked subspace #3282

Performance regression in collapsing blocked subspace #3282

Comments

jorgensd commented Jun 25, 2024

Summarize the issue

How to reproduce the bug

Minimal Example (Python)

Output (Python)

Version

DOLFINx git commit

Installation

Additional information

jorgensd commented Jun 25, 2024

jorgensd commented Jun 25, 2024 • edited Loading

garth-wells commented Jun 26, 2024

jorgensd commented Jun 26, 2024

chrisrichardson commented Jun 26, 2024

jorgensd commented Jun 26, 2024

chrisrichardson commented Jun 26, 2024

chrisrichardson commented Jun 26, 2024

garth-wells commented Jun 26, 2024

jorgensd commented Jun 26, 2024

jorgensd commented Jun 26, 2024 • edited Loading

jorgensd commented Jun 25, 2024 •

edited

Loading

jorgensd commented Jun 26, 2024 •

edited

Loading