Skip to content

Latest commit

 

History

History
13 lines (7 loc) · 3 KB

changeset_boundary_types.md

File metadata and controls

13 lines (7 loc) · 3 KB

Osmium Changeset Boundary Types

This document is reproduced from the OSM Wiki for completeness.

Time-aligned versus Transaction-aligned

Osmosis uses two very different techniques to define changeset boundaries. It may use time boundaries capturing all changes occurring between two points in time, or it may use transaction boundaries representing all changes between two transaction commit points in the database.

Time-aligned replication boundaries are the original and simplest mechanism to both implement and understand. Identifying changeset data requires a simple query of historical tables based on date ranges. The downside of this method is that each time interval must be queried a long time after the fact to avoid missing data. This is due to the fact that a database record may be created with a timestamp matching the current time, but the database transaction may not be committed until several minutes or even hours later. Delayed commits mean that a replication job reading data based on timestamps can only query time ranges that are many hours in the past to avoid missing data. In practice this typically requires a time lag of over 24 hours to avoid missing changes, and even a 24 hour delay provides no guarantees. The advantage of time-aligned replication is that it contains well-defined data that makes it easy to identify and consume the correct changesets. Each file can be given a file based on the time period it represents.

Transaction-aligned replication boundaries are a newer technique that is more complicated, but more powerful. Identifying changeset data is now based on querying for data based on which transactions created it. Only transactions that are known to have committed are queried, and those still in-flight are queried in subsequent transactions. This results in changesets that contain data with timestamps that are non-deterministic. Each changeset will contain data with timestamps anywhere from the present to many hours in the past depending on how long transactions were in-flight. Transaction-aligned changesets cannot be given nice date-based filenames, and instead are given names based on monotonically increasing numbers. Each changeset file is given a number that is one greater than the previous changeset. Each changeset is accompanied by a state file that includes information such as the timestamp of the newest record in the changeset to make it easier to identify which changeset should be used. The downside of transactional changesets is that fine-grained changesets containing an approximate time period cannot be generated after the fact, in other words minute changesets can only be produced with reasonable accuracy if they are created every minute as transactions are committed.

The majority of replication jobs should now use transaction-aligned changeset boundaries. The except are historical changesets produced well after the fact where long-running transactions can be ignored.