CouchDB Architecture

There are several queries we need to put in the Couch GTFS:

Get trips that serve stop A before stop B

In relational terms, this requires joining a table to itself, and is relatively easy to do in SQL:

SELECT trip_id 
FROM vta.stop_times s1
JOIN vta.stop_times s2 USING (trip_id)
JOIN vta.trips USING (trip_id)
WHERE s1.stop_id = '5228'
      AND s2.stop_id = '4543'
      AND s2.stop_sequence > s1.stop_sequence
      AND route_id = '10';

(replace vta with your transit agency, of course). In Couch, this is tricky; somehow we need to pass parameters for the stop_ids into our view, but that will break indexing. In Postgres on my 2.8GHz P4, that query runs in 40ms on the VTA database, which has 365,000 rows. Also, I think we would want to scan trips and then pull in the stop times; I think we'd use a map to find all the trips (like this query does) by reading every trip, and trying to retrieve the trip_id,stop_id pair (using a second view?) and then a reduce to find the last trip. Here's the EXPLAIN ... output, for reference:

 Nested Loop  (cost=1327.81..1346.68 rows=2 width=8)
   ->  Merge Join  (cost=1327.81..1330.11 rows=2 width=16)
         Merge Cond: ((s1.trip_id)::text = (s2.trip_id)::text)
         Join Filter: (s2.stop_sequence > s1.stop_sequence)
         ->  Sort  (cost=999.40..1000.28 rows=353 width=12)
               Sort Key: s1.trip_id
               ->  Bitmap Heap Scan on stop_times s1  (cost=11.04..984.46 rows=353 width=12)
                     Recheck Cond: ((stop_id)::text = '5228'::text)
                     ->  Bitmap Index Scan on stop_times_ix1  (cost=0.00..10.95 rows=353 width=0)
                           Index Cond: ((stop_id)::text = '5228'::text)
         ->  Sort  (cost=328.42..328.65 rows=94 width=12)
               Sort Key: s2.trip_id
               ->  Bitmap Heap Scan on stop_times s2  (cost=5.03..325.34 rows=94 width=12)
                     Recheck Cond: ((stop_id)::text = '4543'::text)
                     ->  Bitmap Index Scan on stop_times_ix1  (cost=0.00..5.00 rows=94 width=0)
                           Index Cond: ((stop_id)::text = '4543'::text)
   ->  Index Scan using trips_pkey on trips  (cost=0.00..8.27 rows=1 width=8)
         Index Cond: ((trips.trip_id)::text = (s1.trip_id)::text)
         Filter: ((trips.route_id)::text = '10'::text)

Get the nearest stop that is served by a particular route

(note: the example query does not take direction into account, because the sample data I have in Postgres (Santa Clara VTA) does not have directionality. Adding a filter on the direction_id column is trivial)

SELECT DISTINCT route_id, stop_id, ST_Distance(transform(stops.geom, 2227), transform(ST_GeomFromEWKT('SRID=4326;POINT(-122.04227 37.37852)'), 2227)) as distance
FROM vta.routes
JOIN vta.trips USING (route_id)
JOIN vta.stop_times USING (trip_id)
JOIN vta.stops USING (stop_id)
WHERE ST_Distance(transform(stops.geom, 2227), transform(ST_GeomFromEWKT('SRID=4326;POINT(-122.04227 37.37852)'), 2227)) < 5280
  AND route_id = '22'
ORDER BY 3 ASC
LIMIT 1;

There's a lot of stuff to unpack in there... first of all, there's a lot of tables pulled in; trips is used only to join routes to stop_times, while stops is used only to calculate distances. Figuring out how to make all that happen in non-relational land is going to be a task. Also, the use of SELECT DISTINCT will require a reduce function that deduplicates. There's a lot of spatial SQL in there (the long ST_Distance(...) transforms the stop geometry to EPSG:2227 (California State Plane North, survey feet, the most appropriate projection for VTA data; for TriMet it'll be 2913 or 2269) and then gets the distance to a know lat/lon in feet. If that's less than 5280' (one mile), the row is considered). The distance is used to prune rows beyond the scope of the search (for instance, we needn't consider stops in Tualatin as possible return points from a trip that terminated in Pioneer Square). The route_id is also considered. ORDER BY 3 ASC orders by distance ascending, and LIMIT 1 returns only the closest candidate.

This query was very slow when I first ran it, taking over 11 seconds to complete, but once I filtered by distance to only consider nearby stops, it now runs in 479 ms in a reasonably dense area. There are probably ways to speed it up; decreasing the pruning distance is one. My machine isn't particularly fast, either.

I don't even know how to begin to implement this in Couch... there's so much spatial SQL, and pulling in the lat, lon, route and direction of course breaks indexing...

The obligatory EXPLAIN ... output


 Limit  (cost=13665.71..13665.71 rows=1 width=107)
   ->  Sort  (cost=13665.71..13686.45 rows=8295 width=107)
         Sort Key: (st_distance(transform(stops.geom, 2227), '0101000020B308000050877BB3455257414690408AD2F73D41'::geometry))
         ->  HashAggregate  (cost=11446.80..13624.23 rows=8295 width=107)
               ->  Hash Join  (cost=1276.57..11384.58 rows=8295 width=107)
                     Hash Cond: ((stop_times.stop_id)::text = (stops.stop_id)::text)
                     ->  Hash Join  (cost=191.93..8037.02 rows=22798 width=7)
                           Hash Cond: ((stop_times.trip_id)::text = (trips.trip_id)::text)
                           ->  Seq Scan on stop_times  (cost=0.00..6246.08 rows=365608 width=12)
                           ->  Hash  (cost=186.21..186.21 rows=458 width=11)
                                 ->  Nested Loop  (cost=0.00..186.21 rows=458 width=11)
                                       ->  Seq Scan on routes  (cost=0.00..2.01 rows=1 width=3)
                                             Filter: ((route_id)::text = '22'::text)
                                       ->  Seq Scan on trips  (cost=0.00..179.61 rows=458 width=11)
                                             Filter: ((trips.route_id)::text = '22'::text)
                     ->  Hash  (cost=1068.68..1068.68 rows=1276 width=104)
                           ->  Seq Scan on stops  (cost=0.00..1068.68 rows=1276 width=104)
                                 Filter: (st_distance(transform(geom, 2227), '0101000020B308000050877BB3455257414690408AD2F73D41'::geometry
) < 5280::double precision)

Matt's Solution 11/11/11

Here's my suggestion, which requires moving quite a bit of stuff from the DB to the client. For the first query (trips that serve A before B), we already know that we are looking for stop_times that are at a or b, so we just make a view that indexes stop_times by [stop_id, route, direction(?), time] and then we can use startkey and endkey for each stop, which will pull down all the stop times for the given route between now and a time to be decided. We then put all of these into a hash client-side by trip ID and throw out trips that are not in both stops or have b.stop_sequence < a.stop_sequence; then we sort by time to get the latest. If we support transfers (ie. a -> b -> c), we do a reverse search, pulling out the last trip that goes b -> c, then the last trip that goes a -> b before (with a transfer window) the last trip for b -> c.

If a bus runs every 5 minutes 24 hours/day (and we probably wouldn't request 24 hours, probably just from about 8pm-3am; need to check if any lines run 24h), there are 12 runs/hr or 12*24 = 288 runs/day. For two stops, we get a theoretical max of 576 records returned, which does not seem unreasonable.

The only issue is that we need to deal with the calendar.

Update: this won't work, because stop_times don't have a route_id, just a trip_id

Getting arbitrary documents all at once

The docs seems to indicate this request won't work, but it does. http://transitappliance.couchone.com/destinations/_all_docs?keys=[%227%22]&include_docs=true&callback=cb

CC-BY 3.0 by Portland Transport.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly