Store R5 gtfs parse errors #516

mattwigway · 2025-09-12T19:55:16Z

This stores parse errors that R5 encountered when parsing GTFS, which are accessible by running gtfs_errors(r5r_network). For example, with the network we've been using to debug:

> r5r_network <- here("network") |>
+   r5r::build_network(overwrite=T)
✖ High priority errors found; network will not be usable. Use gtfs_errors(r5r_network) to see them.
> gtfs_errors(r5r_network)
         file  line                      type   field     id priority
       <char> <int>                    <char>  <char> <char>   <char>
1: stop_times 12890 ReferentialIntegrityError stop_id   <NA>     HIGH

There are a few more things to think about though:

Currently if you load an invalid network (one that would trigger the error in Fail fast on broken feeds, work around conveyal/r5#978 #514), I print an error to the console, but build_network still succeeds—if you want to be able to see what the errors were using gtfs_errors we need to actually return the network. Ideally we'd have some way to prevent this network from actually being used for any r5r functions other than gtfs_errors though.
The errors aren't serialized with the network, so are only visible when the network is first built.

mattwigway · 2025-09-12T20:25:49Z

Also, for reasons not clear to me, the temporary files used to parse the GTFS are not getting deleted even though they are set as deleteOnExit

mattwigway · 2025-09-12T21:05:48Z

Okay, the latest commit has a kind of hacky solution to points 1 and (partially) 2 above. If there are critical errors, it just sets routingProperties to null, and does not serialize the network. This addresses (1) because if you try to do anything with the network, it won't work, because there isn't actually a network (though the error message will be that "routingProperties is null" but hopefully the previous error message that the network is unusable will lead them to figure out the issue. It addresses (2) because if there are critical errors, the network will never get written out at all - though if there are less critical errors, those will still only be accessible when the network is first built.

rafapereirabr · 2025-09-12T22:46:29Z

Hi @mattwigway, I'm glad to see you're making quick progress with this.

I think setting routingProperties to null is problably not the best approach, though. Instead, we can check for GTFS errors within the build_network() network function and throw an error if the GTFS feed has a critical problem. We already to this to detect whether the transport network is larger than the max bounding box permitted by R5. See here. My suggestion is that we use a similar approach with the GTFS critical errors.

mattwigway · 2025-09-12T23:43:49Z

That was my first thought - the only reason I didn't do that is that we need some way to return the errors to the user, and if we throw an error we can't return anything. I don't think just printing the errors is a good strategy as depending on the feed in some cases there might be thousands of them.

Maybe we could just put them in a CSV in the network directory?

rafapereirabr · 2025-09-13T01:10:41Z

A CSV in the network directory could be one option. Alternatives would be to save the errors as a text / pdf an open it on the browser, but the best strategy here depends on the format of the error messages / data. Do you have an example ?

mattwigway · 2025-09-13T14:22:49Z

Here's an example of the errors present in the GTFS feeds for the various operators here in the Research Triangle of NC:

              file  line            type  field     id priority
            <char> <int>          <char> <char> <char>   <char>
1:     frequencies     0 EmptyTableError   <NA>   <NA>   MEDIUM
2:       transfers     0 EmptyTableError   <NA>   <NA>   MEDIUM
3: fare_attributes     0 EmptyTableError   <NA>   <NA>   MEDIUM
4:      fare_rules     0 EmptyTableError   <NA>   <NA>   MEDIUM
5: fare_attributes     0 EmptyTableError   <NA>   <NA>   MEDIUM
6:      fare_rules     0 EmptyTableError   <NA>   <NA>   MEDIUM
7:       transfers     0 EmptyTableError   <NA>   <NA>   MEDIUM
8:     frequencies     0 EmptyTableError   <NA>   <NA>   MEDIUM
9:     frequencies     0 EmptyTableError   <NA>   <NA>   MEDIUM

BardyBard · 2025-09-15T08:51:49Z

I personally like the idea of saving the errors to a CSV, and then also including a helper like get_gtfs_errors() which would load the errors to a df for the user to investigate.

e-kotov · 2025-09-20T20:07:08Z

I have just test this with https://api.transitous.org/gtfs/fi_fintraffic.gtfs.zip (Finland-wide GTFS file that I am having issues with).

High priority errors found; network will not be usable. Use gtfs_errors(r5r_network) to see them.

ge <- r5r:::gtfs_errors(r5r_network = r5net)
ge

#            file  line                      type        field     id priority
#          <char> <int>                    <char>       <char> <char>   <char>
#    1: transfers  2128 ReferentialIntegrityError from_trip_id   <NA>     HIGH
#    2:     stops 83945  SuspectStopLocationError      stop_id 336844   MEDIUM
#    3: transfers   116 ReferentialIntegrityError from_trip_id   <NA>     HIGH
#    4:     stops 68751  SuspectStopLocationError      stop_id 368979   MEDIUM
#    5: transfers   800 ReferentialIntegrityError from_trip_id   <NA>     HIGH
#   ---                                                                       
# 8903:     stops 22802  SuspectStopLocationError      stop_id 337228   MEDIUM
# 8904:     stops 30847  SuspectStopLocationError      stop_id 354136   MEDIUM
# 8905:     stops 48251  SuspectStopLocationError      stop_id 353817   MEDIUM
# 8906:     stops 85672  SuspectStopLocationError      stop_id 366390   MEDIUM
# 8907:     stops 66636  SuspectStopLocationError      stop_id 368984   MEDIUM

Seems to work really well.

Two notes:

Jar needs to be rebuild. I have rebuild it in a copy of @mattwigway 's branch here https://github.com/e-kotov/r5r/tree/gtfs-errors
@mattwigway I guess you should export the gtfs_errors function, as currently it is unexported.

e-kotov · 2025-09-20T20:08:15Z

By the way, anyone has any idea how to fix these GTFS errors, preferably quickly and automatically (I would even say auto-magically)? 😉

mattwigway · 2025-09-21T10:27:19Z

@e-kotov thanks for the feedback! I need to look into why the r5 jar isn't being rebuilt automatically.

In terms of fixing your feed, the referential integrity errors are what's causing the network build to fail. It looks like (some of) the trip IDs in transfers.txt don't exist. A quick fix would be to just remove transfers.txt and see if that fixes the problem.

e-kotov · 2025-09-21T12:13:10Z

I need to look into why the r5 jar isn't being rebuilt automatically.

I'm not sure it is supposed to be built. The build jar action in this repo is separate from the R CMD check. In a recent PR I made to r5r I also rebuilt the jar manually.

A quick fix would be to just remove transfers.txt and see if that fixes the problem.

Thanks for the tip! UPDATE: that actually helped!

BardyBard · 2025-09-22T22:16:23Z

Yeah @mattwigway @e-kotov the jar gets rebuilt by the bot but the tests run before the jar gets rebuilt. So they run on the old jar unless you push your own rebuilt jar. Perhaps that's something I could fix 🫣

mattwigway · 2025-09-23T14:30:01Z

Maybe we should just build the JAR every time the tests run to ensure it's up to date? I think it's a good practice to only build the JAR only via Github Actions—that ensures reproducible and stable builds, and since JARs can't be easily reviewed helps ensure that the JARs for a pull request match the code changes.

e-kotov · 2025-09-23T16:44:13Z

Also, even though the JAR is small, is it sustainable in the long term to keep updating a binary file in the repo? Perhaps it should be moved to a new repo and connected to the main one as a submodule.

mattwigway · 2025-09-23T17:40:23Z

I wonder if we could build the JAR as part of the package build process before submitting to CRAN rather than having it in the repo. I guess that complicates installing the development version but maybe there's some workaround - I believe r5r already has a requirement to have a jdk rather than jre installed, so every machine running r5r should be capable of building it too.

…

On Tue, Sep 23, 2025, 12:48 PM Egor Kotov ***@***.***> wrote: *e-kotov* left a comment (ipeaGIT/r5r#516) <#516 (comment)> Also, even though the JAR is small, is it sustainable in the long term to keep updating a binary file in the repo? Perhaps it should be moved to a new repo and connected to the main one as a submodule. — Reply to this email directly, view it on GitHub <#516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEKNLWZPHIZYZZLVJSPXLL3UF2HFAVCNFSM6AAAAACGL6VKI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMRUHAYDMOBUGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

BardyBard · 2025-09-26T22:47:09Z

Maybe we should just build the JAR every time the tests run to ensure it's up to date? I think it's a good practice to only build the JAR only via Github Actions—that ensures reproducible and stable builds, and since JARs can't be easily reviewed helps ensure that the JARs for a pull request match the code changes.

Yep, I completely agree, the jar should be rebuilt before the tests are executed, not after. I assume that was the intended design of the CI but I've just never gotten around to looking at it.

As for having every machine build r5r from source that's starting to sound like an ambitious feature. But if someone knows how to do it, it sounds like a sound idea.

mattwigway · 2025-10-06T16:28:42Z

@BardyBard just wanted to clarify one thing - I don't think we should have every machine build from source, but rather have the build happen during the package build process. So if you install from CRAN binaries you get a prebuilt version, but if installing from GH it gets built from source.

BardyBard · 2025-10-06T20:45:32Z

@mattwigway the jar CI doesn't even get run on your branch which is super odd. Perhaps its to do with the on: conditions, and you merging from a branch outside of the repo.

mattwigway · 2025-10-06T20:52:22Z

@BardyBard I was actually just looking at this. It turns out it does, it just shows up in the actions tab on my fork rather than the main repo. Something is wrong with the cran mirrors that I'm trying to figure out which is why it's not getting built.

mattwigway · 2025-10-06T22:55:23Z

@BardyBard I got the JAR build working on my branch again. A combination of the use_public_rspm: true option evidently no longer working (specified the P3M binary repo manually), and failed gradle-wrapper verification (a version mismatch, not a malicious/corrupted wrapper file tbc - though the whole reason for needing wrapper checksum verification is a good argument for moving the jar building to the package build stage rather than checking the JAR into the repo).

BardyBard · 2025-10-06T23:00:23Z

Wow, I'm glad you got it working. I was playing with it earlier today and only got past the cran mirror part.

I'll talk to Rafa about the jar building on Wednesday.

For now I'll take your CI fixes and merge them with another branch I'm working on where the coverage and cran checks only run after the jar is complied. That way the tests execute with the newly compiled jar not the old version.

rafapereirabr · 2025-10-15T18:42:27Z

Also, even though the JAR is small, is it sustainable in the long term to keep updating a binary file in the repo? Perhaps it should be moved to a new repo and connected to the main one as a submodule.

I agree this would be ideal in the long run, but I think this could make the package development quite incovenient from the R side. Speaking for myself, the R developers of {r5r} don't know Java or how to compile Java code (sorry). So keeping the 'small' jar inside the repo makes it super convenient to develop the package . Also, we don't have that many {r5r} developers, and even fewer developers from the Java side, so these changes tend to be more rare. I'm open to change my mind on this issue, but for now I would say this is very low priority

store R5 gtfs parse errors

cbb128c

mattwigway mentioned this pull request Sep 12, 2025

Fail fast on broken feeds, work around conveyal/r5#978 #514

Closed

clean up all files, use temporary files safely

a88fb71

set routingProperties to null if the network failed to build

06a5ede

fix cran mirror in jar build

85a781a

mattwigway added 6 commits October 6, 2025 18:03

fix cran url

02095d5

explicitly specify CRAN repo

3b89a16

fix typo

7e38741

fix cran url

3162fb0

use P3M explicitly

046e58f

update gradle-wrapper to v8.5 to get gradle wrapper verification working

c768d98

Rebuild JAR for commit c768d98

56839fc

Store R5 gtfs parse errors #516

Are you sure you want to change the base?

Store R5 gtfs parse errors #516

Uh oh!

Conversation

mattwigway commented Sep 12, 2025

Uh oh!

mattwigway commented Sep 12, 2025

Uh oh!

mattwigway commented Sep 12, 2025

Uh oh!

rafapereirabr commented Sep 12, 2025

Uh oh!

mattwigway commented Sep 12, 2025

Uh oh!

rafapereirabr commented Sep 13, 2025

Uh oh!

mattwigway commented Sep 13, 2025

Uh oh!

BardyBard commented Sep 15, 2025

Uh oh!

e-kotov commented Sep 20, 2025

Uh oh!

e-kotov commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattwigway commented Sep 21, 2025

Uh oh!

e-kotov commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BardyBard commented Sep 22, 2025

Uh oh!

mattwigway commented Sep 23, 2025

Uh oh!

e-kotov commented Sep 23, 2025

Uh oh!

mattwigway commented Sep 23, 2025 via email

Uh oh!

BardyBard commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattwigway commented Oct 6, 2025

Uh oh!

BardyBard commented Oct 6, 2025

Uh oh!

mattwigway commented Oct 6, 2025

Uh oh!

mattwigway commented Oct 6, 2025

Uh oh!

BardyBard commented Oct 6, 2025

Uh oh!

rafapereirabr commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

e-kotov commented Sep 20, 2025 •

edited

Loading

e-kotov commented Sep 21, 2025 •

edited

Loading

BardyBard commented Sep 26, 2025 •

edited

Loading