Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in collapsing blocked subspace #3282

Open
jorgensd opened this issue Jun 25, 2024 · 11 comments
Open

Performance regression in collapsing blocked subspace #3282

jorgensd opened this issue Jun 25, 2024 · 11 comments

Comments

@jorgensd
Copy link
Member

Summarize the issue

Collapsing a blocked subspace is orders of magnitude slower than collapsing the scalar version
and increases dramatically with increase of polynomial order.

First reported at: https://fenicsproject.discourse.group/t/collapsing-mixed-space-takes-unreasonably-long/15071

How to reproduce the bug

Run MWE below

Minimal Example (Python)

from mpi4py import MPI

from dolfinx import mesh, fem
from basix.ufl import element, mixed_element
import time

N = 5
P = 3

# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()

V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")
start3 = time.perf_counter()

Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")

Output (Python)

Meshing and creating mixed space: 0.00762048000001414
Collapsing V: 0.8559996170000659
Collapsing Q: 0.0018857679999655375

Version

main branch

DOLFINx git commit

9f33dab

Installation

Docker using: ghcr.io/fenics/dolfinx/dolfinx:nightly

Additional information

No response

@jorgensd jorgensd added bug Something isn't working performance high-priority labels Jun 25, 2024
@jorgensd
Copy link
Member Author

dolfinx::fem::DofMap::collapse() is the problem.
Especially, it is the compute_reordering_map in build_dofmap_data that is slow (and not scalable with polynomial order).

@jorgensd
Copy link
Member Author

jorgensd commented Jun 25, 2024

I think we should turn of data reordering in serial.
Running:

from mpi4py import MPI

from dolfinx import mesh, fem, la, log
from basix.ufl import element, mixed_element
import time

# log.set_log_level(log.LogLevel.INFO)
N = 5
P = 4

# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()
V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")

start3 = time.perf_counter()

Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")

returns the following in serial:

Meshing and creating mixed space: 0.013051556999926106
Collapsing V: 7.033872703999805
Collapsing Q: 0.0023294779998650483

and

Meshing and creating mixed space: 0.020997950000037235
Collapsing V: 0.70820857300032
Collapsing Q: 0.003953482000270014
Meshing and creating mixed space: 0.020988049000152387
Collapsing V: 0.7082019069998751
Collapsing Q: 0.003954929999963497

with 2 procs, i.e. a 10 x speedup!
Three processes:

Meshing and creating mixed space: 0.02864133799994306
Collapsing V: 0.34778579499970874
Collapsing Q: 0.003480805999970471
Meshing and creating mixed space: 0.027170727000338957
Collapsing V: 0.34781397200003994
Collapsing Q: 0.0034789139999702456
Meshing and creating mixed space: 0.03079914899990399
Collapsing V: 0.3478091659999336
Collapsing Q: 0.0034698189997470763

I still think it is remarkable that creating the collapsed space is more than 10 x more expensive than creating the mixed space itself.
Any comments @chrisrichardson @mscroggs @garth-wells ?

@garth-wells
Copy link
Member

We shouldn't turn off re-ordering in serial.

Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?

@jorgensd
Copy link
Member Author

We shouldn't turn off re-ordering in serial.

Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?

It is prior to 0.7 release.
Haven’t tested it for 0.6 yet

@chrisrichardson
Copy link
Contributor

I don't think mixed topology changes would affect this.
list_timings() reveals:

GPS: create_level_structure                                                 |  4351  0.001285  5.590000
Gibbs-Poole-Stockmeyer ordering                                             |     2  2.885000  5.770000

which seems strange. Looks like some kind of internal issue with the reordering. GPS is only called twice, but builds the level structure thousands of times.

@jorgensd
Copy link
Member Author

It is present in 0.6 as well:
P=3

Meshing and creating mixed space: 0.5712962300000015
Collapsing V: 0.7795458059999874
Collapsing Q: 0.0005138619999911498

P=4

Meshing and creating mixed space: 0.6977521319999767
Collapsing V: 7.012910768000012
Collapsing Q: 0.0005572910000068987

@chrisrichardson
Copy link
Contributor

I added a line in DofMap.cpp:

 if (!reorder_fn)
  {
    reorder_fn = [](const graph::AdjacencyList<std::int32_t>& g)
    {
      double nnz_per_row
          = g.offsets().back() / (double)(g.offsets().size() - 1);

      spdlog::info("DofMap collapse: reordering graph with {} nnz per row",
                   nnz_per_row);

      return graph::reorder_gps(g);
    };
  }

[2024-06-26 11:30:44.174] [info] DofMap collapse: reordering graph with 106.171875 nnz per row

so it is quite a dense graph

@chrisrichardson
Copy link
Contributor

@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2).
I think you will then find it is not much different.

@garth-wells
Copy link
Member

We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.

@jorgensd
Copy link
Member Author

@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2). I think you will then find it is not much different.

As in not improving further?

@jorgensd
Copy link
Member Author

jorgensd commented Jun 26, 2024

We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.

I think one of the main things could Also be the block size, as you can see that an unblocked (scalar) element is magnitudes of order faster than its vector counterpart.

@garth-wells garth-wells removed the bug Something isn't working label Jul 4, 2024
@garth-wells garth-wells changed the title Regression in collapsing blocked subspace Performance regression in collapsing blocked subspace Jul 4, 2024
@garth-wells garth-wells added this to the 0.10.0 milestone Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants