74 130 clean feed and route type cleaner #195

CBROWN-ONS · 2023-10-30T15:20:06Z

Description

This PR includes huge improvements to the cleaning/validating of a GTFS file.

There are three main aspects of this PR:

A pipelined approach to cleaning and validating the GTFS
New cleaners and validators for GtfsInstance
GTFS validation/cleaning tech debt fixes

Fixes #130
Fixes #74
Fixes #181
Fixes #134
Fixes #72
Fixes #182
Fixes #180
Fixes #73

Issue Distribution

After reviewing the issues fixed in this PR, I have decided to assign the issues to the following people. My main approach for this is limiting @ethan-moss to only the pipelined approach to cleaners/validators, along with tech debt issues raised by himself. The rest of the issues with be assigned to @r-leyshon . This list is subject to change however.

Ethan

Improve clean_feed() when shape_id not available #74 (intertwined with pipelined approach)
gtfs clean_feed() pipeline #181
close out fast travel TODO comments #182
fast travel cleaners tech-debt #180

Rich

I also advise that Rich pick up lines that my not directly relate to the issue descriptions, such as additional utilities.

File Distribution

Similar to above, the files/functions/classes must be allocated to certain reviewers.

Ethan

gtfs_utils.py (only function _function_pipeline and some fixed tech debt that is mentioned in the issues)
validation.py, some tech debt resolved, is_valid and clean_feed pipeline approach

Rich

cleaners.py
gtfs_utils.py (all but the function _function_pipeline and tech debt)
validation.py, changes to html_report and _extended_validation
validators.py
constants.py
test_defence.py - added unit test requested in tech debt ticket

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected) (MAYBE?)

How Has This Been Tested?

Test configuration details:

OS: Windows 10
Python version: 3.9.13
Java version: N/A
Python management system: Conda

Checklist:

My code follows the intended structure of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

Additional comments

…ction to get errors

…idate_route_type_warnings

…ement core cleaner

… table (in line with gtfs-kit)

…nc maap

…ntance.table_map

…for _gtfs_defence; defence for _type param in _add_validation_row; warning for non-valid trip_id

…onal param as not relevant!

codecov-commenter · 2023-10-30T15:28:12Z

Codecov Report

Attention: Patch coverage is 99.41520% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 98.67%. Comparing base (7bb4863) to head (f9ad23d).

Files	Patch %	Lines
src/transport_performance/gtfs/validation.py	95.23%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #195      +/-   ##
==========================================
+ Coverage   98.17%   98.67%   +0.50%     
==========================================
  Files          21       21              
  Lines        1915     2034     +119     
==========================================
+ Hits         1880     2007     +127     
+ Misses         35       27       -8

Flag	Coverage Δ
unittests	`98.67% <99.41%> (+0.50%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…Trip

SergioRec

Review started but not finished. Since we're delaying this PR to complete others, just pushing the comments I have so far in case I don't come back to this PR in the future.

SergioRec · 2023-11-14T09:32:08Z

src/transport_performance/gtfs/gtfs_utils.py

+def _function_pipeline(
+    gtfs, func_map: dict, operations: Union[dict, type(None)]
+) -> None:
+    """Iterate through and act on a functional pipeline."""


Missing docstring.

SergioRec · 2023-11-14T09:38:08Z

src/transport_performance/gtfs/validation.py

 from transport_performance.gtfs.routes import (
    scrape_route_type_lookup,
    get_saved_route_type_lookup,
 )
+
+from transport_performance.gtfs.gtfs_utils import _function_pipeline
 from transport_performance.utils.defence import (


Would it be easier to just import defence with an alias instead of this many individual functions?

SergioRec · 2023-11-14T10:15:29Z

src/transport_performance/gtfs/validation.py

+        self.validity_df = pd.DataFrame(
+            columns=["type", "message", "table", "rows"]
+        )
+        _function_pipeline(


See https://docs.python.org/3/faq/programming.html#how-do-i-use-strings-to-call-functions-methods

You have chosen the first ("best") option, a dictionary assigning names to function so you can call those functions by name later. The issue I see with this in this instance is that _function_pipeline needs two arguments, first the hardcoded dict (which will need to be updated when functions change) and then another dict with the function name and arguments. When calling a function with no arguments, then None has to be used.

A potential way could be to use the second option in the link. This way, a toml file with the function names and arguments can be loaded into the script. There would be no need of hardcoded dict. Users would only need to change or add other toml files to get different behaviours. The function can be called like, e.g.:

getattr(module, "function_name")(**kwargs)

SergioRec · 2023-11-14T13:19:34Z

tests/gtfs/test_gtfs_utils.py

@@ -163,7 +170,7 @@ def test__add_validation_row_defence(self):
    def test__add_validation_row_on_pass(self):
        """General tests for _add_test_validation_row()."""
        gtfs = GtfsInstance(gtfs_pth=GTFS_FIX_PTH)
-        gtfs.is_valid(far_stops=False)
+        gtfs.is_valid(validators={"core_validation": None})


This kind of behaviour is not documented. According to is_valid docstrings, validators is "a dictionary of function name to kwargs mappings". However, in this case this is a dictionary of function name and None, which is not a keyword argument and is not a valid argument for core_validation.

I can see some logic in _function_pipeline, but I would not anticipate this behaviour without checking the source code.

If I understand correctly, if None is passed, then no keywords are passed into the function apart from gtfs. This is used as a way to select validation methods. So instead of selecting validation functions by changing the func_map argument in _function_pipeline, only functions passed to validators are run. Because none of the validators have arguments other than gtfs, then None must be passed, and some logic in _function_pipeline handles that behaviour. If my understanding is correct, this is a very convoluted way of selecting which validations to run. This would need to be clearly documented, and is related to my other comment about func_map.

SergioRec · 2023-11-14T13:27:27Z

src/transport_performance/gtfs/gtfs_utils.py

-        raise AttributeError(
+    _gtfs_defence(gtfs, "gtfs")
+    _type_defence(_type, "_type", str)
+    _type_defence(message, "message", str)


Missing defence for table

SergioRec · 2023-11-15T09:03:47Z

src/transport_performance/gtfs/cleaners.py

-    if validate:
-        gtfs.is_valid()
+def clean_consecutive_stop_fast_travel_warnings(gtfs) -> None:
+    """Clean 'Fast Travel Between Consecutive Stops' warnings from validity_df.


When running this function in the Wales GTFS, it seems that it also purges instances of fast travel between multiple stops.

So when running:

gtfs.is_valid() clean_consecutive_stop_fast_travel_warnings(gtfs) gtfs.is_valid()

It gets rid of all Fast Travel Between Consecutive Stops warnings as well as all Fast Travel Over Multiple Stops warnings. This causes that running clean_multiple_stop_fast_travel_warnings after clean_consecutive_stop_fast_travel_warnings has no effect. I'm not sure if there may be instances where this is not the case, but I've tried it with GTFS from Wales, Scotland and London.

Considering this, it is unlikely that we'll run clean_consecutive_stop_fast_travel_warnings in isolation, as it could leave some problematic trips undetected. Would it make sense to merge them into a single function and apply them sequentially? Or run clean_multiple_stop_fast_travel_warnings by default, and have an optional flag to also run the other step if requested, always after the former? This way we would be refactoring these two functions into a single one and ensuring cleaning steps are applied in the right order.

SergioRec · 2023-11-15T09:19:35Z

src/transport_performance/gtfs/gtfs_utils.py

@@ -182,7 +193,6 @@ def filter_gtfs_around_trip(
    gtfs,
    trip_id: str,
    buffer_dist: int = 10000,


If this function is not using meters anymore, this default needs to be changed or removed. Currently it's using 10,000 km.

SergioRec · 2023-11-15T09:22:29Z

src/transport_performance/gtfs/gtfs_utils.py

@@ -182,7 +193,6 @@ def filter_gtfs_around_trip(
    gtfs,


This function is not being used anywhere in cleaners or validators. What is its purpose?

tests/utils/test_defence.py

SergioRec · 2023-11-16T08:34:57Z

src/transport_performance/gtfs/validators.py

@@ -26,6 +31,28 @@
    200: 120,


Probably worth adding a warning when route type not found and using the 200 km/h default?

r-leyshon · 2024-08-26T08:04:37Z

Migrated to datasciencecampus/assess_gtfs#20

CBROWN-ONS added 30 commits October 16, 2023 16:41

Add validation for invalid route type warnings; update tests; add fun…

06aa699

…ction to get errors

Add docstring to private function

a420098

added _remove_validation_row function to gtfs_utils

37df68a

Resolve merge conflicts

a4f452b

Add tests for _get_validation_warnings

70ef720

Added tests for _remove_validation_row

449b445

Add tests for validate_route_type_warnings

85e6631

Add pipeline model for validating gtfs

e9e3c9c

Merge branch 'dev' into 74-130-clean-feed-and-route-type-cleaner

4ec8fcd

Fix tests to align with new validation pipeline

4f9802f

Add tests for new is_valid functionality; add additional test for val…

9131720

…idate_route_type_warnings

Add addtional test for is_valid()

cd2a5c3

Add core cleaner that utilises gtfs-kit cleaners

5f97b53

add _function_pipeline(); updaet clean_feed to use;update tests; impl…

ad38cac

…ement core cleaner

Add tests for core_cleaner

3cb7443

Update core_validation docstrings and type hinting; add accepted gtfs…

d6e73d2

… table (in line with gtfs-kit)

add code for validate_gtfs_files()

3beafa3

add tests for validte_gtfs_file()

160d7b5

Add clean_unrecognised_column_warnings function

54acd6a

add tests for clean_unregnised_column_warnings; add new cleaner to fu…

be28415

…nc maap

patch bug in html report generation

4ce54c3

merge with dev

a9ea856

182: add type defences to gtfs_utils; add fast travel tables to GtfsI…

3fa8aa2

…ntance.table_map

182: cleaners tech debt; add GtfsInstance attr 'units'; passing test …

606e25d

…for _gtfs_defence; defence for _type param in _add_validation_row; warning for non-valid trip_id

182: cleaners tech debt; use pd.isna() instead of math.isnan()

5c2cd9b

Add tests for returning speed bound of unrecognised route_type

d324dfc

180: Functionalise part of fast travel cleaners; remove validate opti…

c679d23

…onal param as not relevant!

fix tests

8c16268

Add tests for clean_duplicate_stop_times

eb1e7c3

Improve coverage

539ff66

CBROWN-ONS added the GTFS label Oct 30, 2023

CBROWN-ONS requested review from r-leyshon and ethan-moss October 30, 2023 15:20

CBROWN-ONS and others added 3 commits November 6, 2023 15:10

Merge with dev and resolve conflicts

728f128

chore: Up to date with dev

9f3cc25

chore: minor typos in gtfs_utils.py

b2a2175

ethan-moss removed their request for review November 14, 2023 12:04

SergioRec added 2 commits November 14, 2023 13:48

chore: minor typos in Test_AddValidationRow and Test_FilterGtfsAround…

a82c8f7

…Trip

fix: change wrong path in test

0068a61

SergioRec self-assigned this Nov 14, 2023

SergioRec self-requested a review November 14, 2023 14:02

SergioRec reviewed Nov 17, 2023

View reviewed changes

CBROWN-ONS added 4 commits February 1, 2024 11:29

fix: merge with dev

f6f2b09

fix: update tests to better align with new cleaning/validation pipelines

034c6fb

fix: update typo in function name

dd1c2a3

Merge branch 'dev' into 74-130-clean-feed-and-route-type-cleaner

f9ad23d

This was referenced Feb 26, 2024

fast travel cleaners tech-debt #180

Closed

180 cleanup for 195 #254

Merged

Advanced cleaning and validating for GTFS #256

Open

r-leyshon mentioned this pull request Aug 26, 2024

Advanced validation and cleaning datasciencecampus/assess_gtfs#20

Open

11 tasks

r-leyshon added the wontfix This will not be worked on label Aug 26, 2024

r-leyshon closed this Aug 26, 2024

r-leyshon deleted the 74-130-clean-feed-and-route-type-cleaner branch August 26, 2024 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

74 130 clean feed and route type cleaner #195

74 130 clean feed and route type cleaner #195

CBROWN-ONS commented Oct 30, 2023 •

edited by SergioRec

Loading

codecov-commenter commented Oct 30, 2023 •

edited

Loading

SergioRec left a comment

SergioRec Nov 14, 2023

SergioRec Nov 14, 2023

SergioRec Nov 14, 2023

SergioRec Nov 14, 2023

SergioRec Nov 14, 2023

SergioRec Nov 15, 2023 •

edited

Loading

SergioRec Nov 15, 2023

SergioRec Nov 15, 2023

SergioRec Nov 16, 2023

r-leyshon commented Aug 26, 2024

74 130 clean feed and route type cleaner #195

74 130 clean feed and route type cleaner #195

Conversation

CBROWN-ONS commented Oct 30, 2023 • edited by SergioRec Loading

Description

Issue Distribution

File Distribution

Type of change

How Has This Been Tested?

Checklist:

Additional comments

codecov-commenter commented Oct 30, 2023 • edited Loading

Codecov Report

SergioRec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergioRec Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r-leyshon commented Aug 26, 2024

CBROWN-ONS commented Oct 30, 2023 •

edited by SergioRec

Loading

codecov-commenter commented Oct 30, 2023 •

edited

Loading

SergioRec Nov 15, 2023 •

edited

Loading