-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
74 130 clean feed and route type cleaner #195
Conversation
…ction to get errors
…idate_route_type_warnings
…ement core cleaner
… table (in line with gtfs-kit)
…for _gtfs_defence; defence for _type param in _add_validation_row; warning for non-valid trip_id
…onal param as not relevant!
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #195 +/- ##
==========================================
+ Coverage 98.17% 98.67% +0.50%
==========================================
Files 21 21
Lines 1915 2034 +119
==========================================
+ Hits 1880 2007 +127
+ Misses 35 27 -8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review started but not finished. Since we're delaying this PR to complete others, just pushing the comments I have so far in case I don't come back to this PR in the future.
def _function_pipeline( | ||
gtfs, func_map: dict, operations: Union[dict, type(None)] | ||
) -> None: | ||
"""Iterate through and act on a functional pipeline.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring.
from transport_performance.gtfs.routes import ( | ||
scrape_route_type_lookup, | ||
get_saved_route_type_lookup, | ||
) | ||
|
||
from transport_performance.gtfs.gtfs_utils import _function_pipeline | ||
from transport_performance.utils.defence import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easier to just import defence with an alias instead of this many individual functions?
self.validity_df = pd.DataFrame( | ||
columns=["type", "message", "table", "rows"] | ||
) | ||
_function_pipeline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://docs.python.org/3/faq/programming.html#how-do-i-use-strings-to-call-functions-methods
You have chosen the first ("best") option, a dictionary assigning names to function so you can call those functions by name later. The issue I see with this in this instance is that _function_pipeline
needs two arguments, first the hardcoded dict (which will need to be updated when functions change) and then another dict with the function name and arguments. When calling a function with no arguments, then None
has to be used.
A potential way could be to use the second option in the link. This way, a toml file with the function names and arguments can be loaded into the script. There would be no need of hardcoded dict. Users would only need to change or add other toml files to get different behaviours. The function can be called like, e.g.:
getattr(module, "function_name")(**kwargs)
@@ -163,7 +170,7 @@ def test__add_validation_row_defence(self): | |||
def test__add_validation_row_on_pass(self): | |||
"""General tests for _add_test_validation_row().""" | |||
gtfs = GtfsInstance(gtfs_pth=GTFS_FIX_PTH) | |||
gtfs.is_valid(far_stops=False) | |||
gtfs.is_valid(validators={"core_validation": None}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of behaviour is not documented. According to is_valid
docstrings, validators
is "a dictionary of function name to kwargs mappings". However, in this case this is a dictionary of function name and None
, which is not a keyword argument and is not a valid argument for core_validation
.
I can see some logic in _function_pipeline
, but I would not anticipate this behaviour without checking the source code.
If I understand correctly, if None
is passed, then no keywords are passed into the function apart from gtfs
. This is used as a way to select validation methods. So instead of selecting validation functions by changing the func_map
argument in _function_pipeline
, only functions passed to validators
are run. Because none of the validators have arguments other than gtfs
, then None
must be passed, and some logic in _function_pipeline
handles that behaviour. If my understanding is correct, this is a very convoluted way of selecting which validations to run. This would need to be clearly documented, and is related to my other comment about func_map
.
raise AttributeError( | ||
_gtfs_defence(gtfs, "gtfs") | ||
_type_defence(_type, "_type", str) | ||
_type_defence(message, "message", str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing defence for table
if validate: | ||
gtfs.is_valid() | ||
def clean_consecutive_stop_fast_travel_warnings(gtfs) -> None: | ||
"""Clean 'Fast Travel Between Consecutive Stops' warnings from validity_df. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running this function in the Wales GTFS, it seems that it also purges instances of fast travel between multiple stops.
So when running:
gtfs.is_valid()
clean_consecutive_stop_fast_travel_warnings(gtfs)
gtfs.is_valid()
It gets rid of all Fast Travel Between Consecutive Stops
warnings as well as all Fast Travel Over Multiple Stops
warnings. This causes that running clean_multiple_stop_fast_travel_warnings
after clean_consecutive_stop_fast_travel_warnings
has no effect. I'm not sure if there may be instances where this is not the case, but I've tried it with GTFS from Wales, Scotland and London.
Considering this, it is unlikely that we'll run clean_consecutive_stop_fast_travel_warnings
in isolation, as it could leave some problematic trips undetected. Would it make sense to merge them into a single function and apply them sequentially? Or run clean_multiple_stop_fast_travel_warnings
by default, and have an optional flag to also run the other step if requested, always after the former? This way we would be refactoring these two functions into a single one and ensuring cleaning steps are applied in the right order.
@@ -182,7 +193,6 @@ def filter_gtfs_around_trip( | |||
gtfs, | |||
trip_id: str, | |||
buffer_dist: int = 10000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function is not using meters anymore, this default needs to be changed or removed. Currently it's using 10,000 km.
@@ -182,7 +193,6 @@ def filter_gtfs_around_trip( | |||
gtfs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not being used anywhere in cleaners or validators. What is its purpose?
@@ -26,6 +31,28 @@ | |||
200: 120, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding a warning when route type not found and using the 200 km/h default?
Migrated to datasciencecampus/assess_gtfs#20 |
Description
This PR includes huge improvements to the cleaning/validating of a GTFS file.
There are three main aspects of this PR:
GtfsInstance
Fixes #130
Fixes #74
Fixes #181
Fixes #134
Fixes #72
Fixes #182
Fixes #180
Fixes #73
Issue Distribution
After reviewing the issues fixed in this PR, I have decided to assign the issues to the following people. My main approach for this is limiting @ethan-moss to only the pipelined approach to cleaners/validators, along with tech debt issues raised by himself. The rest of the issues with be assigned to @r-leyshon . This list is subject to change however.
Ethan
clean_feed()
when shape_id not available #74 (intertwined with pipelined approach)clean_feed()
pipeline #181Rich
gtfs
cleaner forUnrecognised column
warnings #72gtfs
cleaner to handlerepeated pair (trip_id, departure_time)
warnings #73I also advise that Rich pick up lines that my not directly relate to the issue descriptions, such as additional utilities.
File Distribution
Similar to above, the files/functions/classes must be allocated to certain reviewers.
Ethan
_function_pipeline
and some fixed tech debt that is mentioned in the issues)is_valid
andclean_feed
pipeline approachRich
_function_pipeline
and tech debt)html_report
and_extended_validation
Type of change
How Has This Been Tested?
Test configuration details:
Checklist:
Additional comments