Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

74 130 clean feed and route type cleaner #195

Closed
wants to merge 39 commits into from

Conversation

CBROWN-ONS
Copy link
Collaborator

@CBROWN-ONS CBROWN-ONS commented Oct 30, 2023

Description

This PR includes huge improvements to the cleaning/validating of a GTFS file.

There are three main aspects of this PR:

  1. A pipelined approach to cleaning and validating the GTFS
  2. New cleaners and validators for GtfsInstance
  3. GTFS validation/cleaning tech debt fixes

Fixes #130
Fixes #74
Fixes #181
Fixes #134
Fixes #72
Fixes #182
Fixes #180
Fixes #73

Issue Distribution

After reviewing the issues fixed in this PR, I have decided to assign the issues to the following people. My main approach for this is limiting @ethan-moss to only the pipelined approach to cleaners/validators, along with tech debt issues raised by himself. The rest of the issues with be assigned to @r-leyshon . This list is subject to change however.

Ethan

Rich

I also advise that Rich pick up lines that my not directly relate to the issue descriptions, such as additional utilities.

File Distribution

Similar to above, the files/functions/classes must be allocated to certain reviewers.

Ethan

  • gtfs_utils.py (only function _function_pipeline and some fixed tech debt that is mentioned in the issues)
  • validation.py, some tech debt resolved, is_valid and clean_feed pipeline approach

Rich

  • cleaners.py
  • gtfs_utils.py (all but the function _function_pipeline and tech debt)
  • validation.py, changes to html_report and _extended_validation
  • validators.py
  • constants.py
  • test_defence.py - added unit test requested in tech debt ticket

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) (MAYBE?)

How Has This Been Tested?

Test configuration details:

  • OS: Windows 10
  • Python version: 3.9.13
  • Java version: N/A
  • Python management system: Conda

Checklist:

  • My code follows the intended structure of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Additional comments

…for _gtfs_defence; defence for _type param in _add_validation_row; warning for non-valid trip_id
@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2023

Codecov Report

Attention: Patch coverage is 99.41520% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 98.67%. Comparing base (7bb4863) to head (f9ad23d).

Files Patch % Lines
src/transport_performance/gtfs/validation.py 95.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #195      +/-   ##
==========================================
+ Coverage   98.17%   98.67%   +0.50%     
==========================================
  Files          21       21              
  Lines        1915     2034     +119     
==========================================
+ Hits         1880     2007     +127     
+ Misses         35       27       -8     
Flag Coverage Δ
unittests 98.67% <99.41%> (+0.50%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ethan-moss ethan-moss removed their request for review November 14, 2023 12:04
@SergioRec SergioRec self-assigned this Nov 14, 2023
@SergioRec SergioRec self-requested a review November 14, 2023 14:02
Copy link
Contributor

@SergioRec SergioRec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review started but not finished. Since we're delaying this PR to complete others, just pushing the comments I have so far in case I don't come back to this PR in the future.

def _function_pipeline(
gtfs, func_map: dict, operations: Union[dict, type(None)]
) -> None:
"""Iterate through and act on a functional pipeline."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring.

from transport_performance.gtfs.routes import (
scrape_route_type_lookup,
get_saved_route_type_lookup,
)

from transport_performance.gtfs.gtfs_utils import _function_pipeline
from transport_performance.utils.defence import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier to just import defence with an alias instead of this many individual functions?

self.validity_df = pd.DataFrame(
columns=["type", "message", "table", "rows"]
)
_function_pipeline(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://docs.python.org/3/faq/programming.html#how-do-i-use-strings-to-call-functions-methods

You have chosen the first ("best") option, a dictionary assigning names to function so you can call those functions by name later. The issue I see with this in this instance is that _function_pipeline needs two arguments, first the hardcoded dict (which will need to be updated when functions change) and then another dict with the function name and arguments. When calling a function with no arguments, then None has to be used.

A potential way could be to use the second option in the link. This way, a toml file with the function names and arguments can be loaded into the script. There would be no need of hardcoded dict. Users would only need to change or add other toml files to get different behaviours. The function can be called like, e.g.:

getattr(module, "function_name")(**kwargs)

@@ -163,7 +170,7 @@ def test__add_validation_row_defence(self):
def test__add_validation_row_on_pass(self):
"""General tests for _add_test_validation_row()."""
gtfs = GtfsInstance(gtfs_pth=GTFS_FIX_PTH)
gtfs.is_valid(far_stops=False)
gtfs.is_valid(validators={"core_validation": None})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of behaviour is not documented. According to is_valid docstrings, validators is "a dictionary of function name to kwargs mappings". However, in this case this is a dictionary of function name and None, which is not a keyword argument and is not a valid argument for core_validation.

I can see some logic in _function_pipeline, but I would not anticipate this behaviour without checking the source code.

If I understand correctly, if None is passed, then no keywords are passed into the function apart from gtfs. This is used as a way to select validation methods. So instead of selecting validation functions by changing the func_map argument in _function_pipeline, only functions passed to validators are run. Because none of the validators have arguments other than gtfs, then None must be passed, and some logic in _function_pipeline handles that behaviour. If my understanding is correct, this is a very convoluted way of selecting which validations to run. This would need to be clearly documented, and is related to my other comment about func_map.

raise AttributeError(
_gtfs_defence(gtfs, "gtfs")
_type_defence(_type, "_type", str)
_type_defence(message, "message", str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing defence for table

if validate:
gtfs.is_valid()
def clean_consecutive_stop_fast_travel_warnings(gtfs) -> None:
"""Clean 'Fast Travel Between Consecutive Stops' warnings from validity_df.
Copy link
Contributor

@SergioRec SergioRec Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running this function in the Wales GTFS, it seems that it also purges instances of fast travel between multiple stops.

So when running:

gtfs.is_valid()
clean_consecutive_stop_fast_travel_warnings(gtfs)
gtfs.is_valid()

It gets rid of all Fast Travel Between Consecutive Stops warnings as well as all Fast Travel Over Multiple Stops warnings. This causes that running clean_multiple_stop_fast_travel_warnings after clean_consecutive_stop_fast_travel_warnings has no effect. I'm not sure if there may be instances where this is not the case, but I've tried it with GTFS from Wales, Scotland and London.

Considering this, it is unlikely that we'll run clean_consecutive_stop_fast_travel_warnings in isolation, as it could leave some problematic trips undetected. Would it make sense to merge them into a single function and apply them sequentially? Or run clean_multiple_stop_fast_travel_warnings by default, and have an optional flag to also run the other step if requested, always after the former? This way we would be refactoring these two functions into a single one and ensuring cleaning steps are applied in the right order.

@@ -182,7 +193,6 @@ def filter_gtfs_around_trip(
gtfs,
trip_id: str,
buffer_dist: int = 10000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function is not using meters anymore, this default needs to be changed or removed. Currently it's using 10,000 km.

@@ -182,7 +193,6 @@ def filter_gtfs_around_trip(
gtfs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not being used anywhere in cleaners or validators. What is its purpose?

tests/utils/test_defence.py Show resolved Hide resolved
@@ -26,6 +31,28 @@
200: 120,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding a warning when route type not found and using the 200 km/h default?

@r-leyshon
Copy link
Contributor

Migrated to datasciencecampus/assess_gtfs#20

@r-leyshon r-leyshon closed this Aug 26, 2024
@r-leyshon r-leyshon deleted the 74-130-clean-feed-and-route-type-cleaner branch August 26, 2024 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment