-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a PROJ_DB_FAST_BUILD=ON/OFF CMake option (default OFF) #4279
Conversation
"Trigger" for this (pun intented) is that most of the time spent while building GDAL Docker image when cross-building to arm64 is spent on building proj.db (close to 7.5h for a target Ubuntu 24.04 arm64 !). Setting this new option should cut that to a few minutes. ``` .. option:: PROJ_DB_FAST_BUILD=OFF .. versionadded:: 9.5.1 By default, creation of :file:`proj.db` involves inserting consistency check triggers before inserting data records, to be able to catch potential inconsistencies. Such checks are useful for core PROJ developers when they update the database content, or for advanced PROJ users that customize the content of the database. However those checks come with a non-negligible cost. On modern hardware, building :file:`proj.db` with those checks enabled takes about 50 to 60 seconds (and on scenarios where PROJ is built for other architectures with full emulation, several hours). When setting this option to ON, those triggers are inserted after data records, which decreases the build time to about 3 seconds. In short, setting this option to ON is safe if you do not customize yourself the .sql files used to build :file:`proj.db` ``` Timings on my machine: - before this PR: ``` $ time make generate_proj_db [100%] Generating proj.db [100%] Built target generate_proj_db real 0m54,752s user 0m53,968s sys 0m0,648s $ md5sum data/proj.db beecdc018b4a5131229709b3c7747036 data/proj.db $ echo ".dump" | sqlite3 data/proj.db | md5sum 64e446efdc5c18e398cc7b6b2e4b3086 - ``` - with this PR, not setting PROJ_DB_FAST_BUILD (so OFF): Same as above - with this PR, setting PROJ_DB_FAST_BUILD=ON ``` $ cmake .. -DPROJ_DB_FAST_BUILD=ON $ time make generate_proj_db [100%] Generating proj.db [100%] Built target generate_proj_db real 0m3,243s user 0m2,876s sys 0m0,204s $ md5sum data/proj.db 1955dfdc3f7abada3890bf9b7592770a data/proj.db $ echo ".dump" | sqlite3 data/proj.db | md5sum 64e446efdc5c18e398cc7b6b2e4b3086 - ``` One can notice that the binary content of proj.db is not exactly the same, however the result of dumping it to SQL is exactly the same. The reason for the slight difference is that in PROJ_DB_FAST_BUILD=ON we also skip creating a fake table and trigger, which influences the "schema version number" of the SQLite3 database, which is a non significant difference. Cf the diff of the ``od -x`` output, which shows that only a few bytes in the SQLite3 header are different. ``` $ diff -u proj.db.slow.txt proj.db.fast.txt --- proj.db.slow.txt 2024-10-16 08:50:07.211601573 +0200 +++ proj.db.fast.txt 2024-10-16 08:50:16.155615860 +0200 @@ -1,9 +1,9 @@ 0000000 5153 694c 6574 6620 726f 616d 2074 0033 -0000020 0010 0101 4000 2020 0000 1100 0000 d208 -0000040 0000 0000 0000 0000 0000 6700 0000 0400 +0000020 0010 0101 4000 2020 0000 2500 0000 d208 +0000040 0000 0000 0000 0000 0000 6300 0000 0400 0000060 0000 0000 0000 0000 0000 0100 0000 0000 0000100 0000 0000 0000 0000 0000 0000 0000 0000 -0000120 0000 0000 0000 0000 0000 0000 0000 1100 +0000120 0000 0000 0000 0000 0000 0000 0000 2500 0000140 2e00 d93f 0005 0000 0f1a 007e 0000 d208 0000160 fb0f f60f f10f ec0f e70f e20f dd0f d80f 0000200 d30f ce0f c90f c40f bf0f ba0f b50f b00f ```
ee22be4
to
97c9547
Compare
Why would we add an option called |
As explained in the doc ;-) " Such checks are useful for core PROJ developers when they update the database content, or for advanced PROJ users that customize the content of the database" We could change the default, but that would mean that when integrating a new EPSG / ESRI / whatever release we must think of doing the build & a test run at least once with the checks enabled. That said that could also be the job of a CI configuration to have that turn on. |
My point was to ask why we should make this an option that users would have to make a decision about. I wonder if the behavior should be:
|
I think I have been both things already, and those checkers saved my life in both cases. |
What could potentially be done is to check the md5sum of the concatenated all.sql.in file against a reference value. If it matches, then we use the fast way. If it doesn't match, we run once with the slow checks, and once proj.db successfully build with them, we output the new md5sum so the maintainer can update it in data/CMakeLists.txt. That way we would have the best of both worlds. |
just did that. Works just fine. Closing this PR as superseded per #4280 |
"Trigger" for this (pun intented) is that most of the time spent while building GDAL Docker image when cross-building to arm64 is spent on building proj.db (close to 7.5h for a target Ubuntu 24.04 arm64 !). Setting this new option should cut that to a few minutes.
Timings on my machine:
Same as above
One can notice that the binary content of proj.db is not exactly the same, however the result of dumping it to SQL is exactly the same. The reason for the slight difference is that in PROJ_DB_FAST_BUILD=ON we also skip creating a fake table and trigger, which influences the "schema version number" of the SQLite3 database, which is a non significant difference.
Cf the diff of the
od -x
output, which shows that only a few bytes in the SQLite3 header are different.