Skip to content

Conversation

@mattwigway
Copy link
Contributor

This stores parse errors that R5 encountered when parsing GTFS, which are accessible by running gtfs_errors(r5r_network). For example, with the network we've been using to debug:

> r5r_network <- here("network") |>
+   r5r::build_network(overwrite=T)
✖ High priority errors found; network will not be usable. Use gtfs_errors(r5r_network) to see them.
> gtfs_errors(r5r_network)
         file  line                      type   field     id priority
       <char> <int>                    <char>  <char> <char>   <char>
1: stop_times 12890 ReferentialIntegrityError stop_id   <NA>     HIGH

There are a few more things to think about though:

  1. Currently if you load an invalid network (one that would trigger the error in Fail fast on broken feeds, work around conveyal/r5#978 #514), I print an error to the console, but build_network still succeeds—if you want to be able to see what the errors were using gtfs_errors we need to actually return the network. Ideally we'd have some way to prevent this network from actually being used for any r5r functions other than gtfs_errors though.
  2. The errors aren't serialized with the network, so are only visible when the network is first built.

@mattwigway
Copy link
Contributor Author

Also, for reasons not clear to me, the temporary files used to parse the GTFS are not getting deleted even though they are set as deleteOnExit

@mattwigway
Copy link
Contributor Author

Okay, the latest commit has a kind of hacky solution to points 1 and (partially) 2 above. If there are critical errors, it just sets routingProperties to null, and does not serialize the network. This addresses (1) because if you try to do anything with the network, it won't work, because there isn't actually a network (though the error message will be that "routingProperties is null" but hopefully the previous error message that the network is unusable will lead them to figure out the issue. It addresses (2) because if there are critical errors, the network will never get written out at all - though if there are less critical errors, those will still only be accessible when the network is first built.

@rafapereirabr
Copy link
Member

Hi @mattwigway, I'm glad to see you're making quick progress with this.

I think setting routingProperties to null is problably not the best approach, though. Instead, we can check for GTFS errors within the build_network() network function and throw an error if the GTFS feed has a critical problem. We already to this to detect whether the transport network is larger than the max bounding box permitted by R5. See here. My suggestion is that we use a similar approach with the GTFS critical errors.

@mattwigway
Copy link
Contributor Author

That was my first thought - the only reason I didn't do that is that we need some way to return the errors to the user, and if we throw an error we can't return anything. I don't think just printing the errors is a good strategy as depending on the feed in some cases there might be thousands of them.

Maybe we could just put them in a CSV in the network directory?

@rafapereirabr
Copy link
Member

A CSV in the network directory could be one option. Alternatives would be to save the errors as a text / pdf an open it on the browser, but the best strategy here depends on the format of the error messages / data. Do you have an example ?

@mattwigway
Copy link
Contributor Author

Here's an example of the errors present in the GTFS feeds for the various operators here in the Research Triangle of NC:

              file  line            type  field     id priority
            <char> <int>          <char> <char> <char>   <char>
1:     frequencies     0 EmptyTableError   <NA>   <NA>   MEDIUM
2:       transfers     0 EmptyTableError   <NA>   <NA>   MEDIUM
3: fare_attributes     0 EmptyTableError   <NA>   <NA>   MEDIUM
4:      fare_rules     0 EmptyTableError   <NA>   <NA>   MEDIUM
5: fare_attributes     0 EmptyTableError   <NA>   <NA>   MEDIUM
6:      fare_rules     0 EmptyTableError   <NA>   <NA>   MEDIUM
7:       transfers     0 EmptyTableError   <NA>   <NA>   MEDIUM
8:     frequencies     0 EmptyTableError   <NA>   <NA>   MEDIUM
9:     frequencies     0 EmptyTableError   <NA>   <NA>   MEDIUM

@BardyBard
Copy link
Collaborator

I personally like the idea of saving the errors to a CSV, and then also including a helper like get_gtfs_errors() which would load the errors to a df for the user to investigate.

@e-kotov
Copy link
Contributor

e-kotov commented Sep 20, 2025

I have just test this with https://api.transitous.org/gtfs/fi_fintraffic.gtfs.zip (Finland-wide GTFS file that I am having issues with).

High priority errors found; network will not be usable. Use gtfs_errors(r5r_network) to see them.

ge <- r5r:::gtfs_errors(r5r_network = r5net)
ge

#            file  line                      type        field     id priority
#          <char> <int>                    <char>       <char> <char>   <char>
#    1: transfers  2128 ReferentialIntegrityError from_trip_id   <NA>     HIGH
#    2:     stops 83945  SuspectStopLocationError      stop_id 336844   MEDIUM
#    3: transfers   116 ReferentialIntegrityError from_trip_id   <NA>     HIGH
#    4:     stops 68751  SuspectStopLocationError      stop_id 368979   MEDIUM
#    5: transfers   800 ReferentialIntegrityError from_trip_id   <NA>     HIGH
#   ---                                                                       
# 8903:     stops 22802  SuspectStopLocationError      stop_id 337228   MEDIUM
# 8904:     stops 30847  SuspectStopLocationError      stop_id 354136   MEDIUM
# 8905:     stops 48251  SuspectStopLocationError      stop_id 353817   MEDIUM
# 8906:     stops 85672  SuspectStopLocationError      stop_id 366390   MEDIUM
# 8907:     stops 66636  SuspectStopLocationError      stop_id 368984   MEDIUM

Seems to work really well.

Two notes:

  1. Jar needs to be rebuild. I have rebuild it in a copy of @mattwigway 's branch here https://github.com/e-kotov/r5r/tree/gtfs-errors
  2. @mattwigway I guess you should export the gtfs_errors function, as currently it is unexported.

@e-kotov
Copy link
Contributor

e-kotov commented Sep 20, 2025

By the way, anyone has any idea how to fix these GTFS errors, preferably quickly and automatically (I would even say auto-magically)? 😉

@mattwigway
Copy link
Contributor Author

@e-kotov thanks for the feedback! I need to look into why the r5 jar isn't being rebuilt automatically.

In terms of fixing your feed, the referential integrity errors are what's causing the network build to fail. It looks like (some of) the trip IDs in transfers.txt don't exist. A quick fix would be to just remove transfers.txt and see if that fixes the problem.

@e-kotov
Copy link
Contributor

e-kotov commented Sep 21, 2025

I need to look into why the r5 jar isn't being rebuilt automatically.

I'm not sure it is supposed to be built. The build jar action in this repo is separate from the R CMD check. In a recent PR I made to r5r I also rebuilt the jar manually.

A quick fix would be to just remove transfers.txt and see if that fixes the problem.

Thanks for the tip! UPDATE: that actually helped!

@BardyBard
Copy link
Collaborator

Yeah @mattwigway @e-kotov the jar gets rebuilt by the bot but the tests run before the jar gets rebuilt. So they run on the old jar unless you push your own rebuilt jar. Perhaps that's something I could fix 🫣

@mattwigway
Copy link
Contributor Author

Maybe we should just build the JAR every time the tests run to ensure it's up to date? I think it's a good practice to only build the JAR only via Github Actions—that ensures reproducible and stable builds, and since JARs can't be easily reviewed helps ensure that the JARs for a pull request match the code changes.

@e-kotov
Copy link
Contributor

e-kotov commented Sep 23, 2025

Also, even though the JAR is small, is it sustainable in the long term to keep updating a binary file in the repo? Perhaps it should be moved to a new repo and connected to the main one as a submodule.

@mattwigway
Copy link
Contributor Author

mattwigway commented Sep 23, 2025 via email

@BardyBard
Copy link
Collaborator

BardyBard commented Sep 26, 2025

Maybe we should just build the JAR every time the tests run to ensure it's up to date? I think it's a good practice to only build the JAR only via Github Actions—that ensures reproducible and stable builds, and since JARs can't be easily reviewed helps ensure that the JARs for a pull request match the code changes.

Yep, I completely agree, the jar should be rebuilt before the tests are executed, not after. I assume that was the intended design of the CI but I've just never gotten around to looking at it.

As for having every machine build r5r from source that's starting to sound like an ambitious feature. But if someone knows how to do it, it sounds like a sound idea.

@mattwigway
Copy link
Contributor Author

@BardyBard just wanted to clarify one thing - I don't think we should have every machine build from source, but rather have the build happen during the package build process. So if you install from CRAN binaries you get a prebuilt version, but if installing from GH it gets built from source.

@BardyBard
Copy link
Collaborator

@mattwigway the jar CI doesn't even get run on your branch which is super odd. Perhaps its to do with the on: conditions, and you merging from a branch outside of the repo.

@mattwigway
Copy link
Contributor Author

@BardyBard I was actually just looking at this. It turns out it does, it just shows up in the actions tab on my fork rather than the main repo. Something is wrong with the cran mirrors that I'm trying to figure out which is why it's not getting built.

@mattwigway
Copy link
Contributor Author

@BardyBard I got the JAR build working on my branch again. A combination of the use_public_rspm: true option evidently no longer working (specified the P3M binary repo manually), and failed gradle-wrapper verification (a version mismatch, not a malicious/corrupted wrapper file tbc - though the whole reason for needing wrapper checksum verification is a good argument for moving the jar building to the package build stage rather than checking the JAR into the repo).

@BardyBard
Copy link
Collaborator

Wow, I'm glad you got it working. I was playing with it earlier today and only got past the cran mirror part.

I'll talk to Rafa about the jar building on Wednesday.

For now I'll take your CI fixes and merge them with another branch I'm working on where the coverage and cran checks only run after the jar is complied. That way the tests execute with the newly compiled jar not the old version.

@rafapereirabr
Copy link
Member

Also, even though the JAR is small, is it sustainable in the long term to keep updating a binary file in the repo? Perhaps it should be moved to a new repo and connected to the main one as a submodule.

I agree this would be ideal in the long run, but I think this could make the package development quite incovenient from the R side. Speaking for myself, the R developers of {r5r} don't know Java or how to compile Java code (sorry). So keeping the 'small' jar inside the repo makes it super convenient to develop the package . Also, we don't have that many {r5r} developers, and even fewer developers from the Java side, so these changes tend to be more rare. I'm open to change my mind on this issue, but for now I would say this is very low priority

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants