-
Notifications
You must be signed in to change notification settings - Fork 32
Store R5 gtfs parse errors #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Also, for reasons not clear to me, the temporary files used to parse the GTFS are not getting deleted even though they are set as |
|
Okay, the latest commit has a kind of hacky solution to points 1 and (partially) 2 above. If there are critical errors, it just sets |
|
Hi @mattwigway, I'm glad to see you're making quick progress with this. I think setting routingProperties to null is problably not the best approach, though. Instead, we can check for GTFS errors within the |
|
That was my first thought - the only reason I didn't do that is that we need some way to return the errors to the user, and if we throw an error we can't return anything. I don't think just printing the errors is a good strategy as depending on the feed in some cases there might be thousands of them. Maybe we could just put them in a CSV in the network directory? |
|
A CSV in the network directory could be one option. Alternatives would be to save the errors as a text / pdf an open it on the browser, but the best strategy here depends on the format of the error messages / data. Do you have an example ? |
|
Here's an example of the errors present in the GTFS feeds for the various operators here in the Research Triangle of NC: |
|
I personally like the idea of saving the errors to a CSV, and then also including a helper like |
|
I have just test this with https://api.transitous.org/gtfs/fi_fintraffic.gtfs.zip (Finland-wide GTFS file that I am having issues with). High priority errors found; network will not be usable. Use gtfs_errors(r5r_network) to see them.
ge <- r5r:::gtfs_errors(r5r_network = r5net)
ge
# file line type field id priority
# <char> <int> <char> <char> <char> <char>
# 1: transfers 2128 ReferentialIntegrityError from_trip_id <NA> HIGH
# 2: stops 83945 SuspectStopLocationError stop_id 336844 MEDIUM
# 3: transfers 116 ReferentialIntegrityError from_trip_id <NA> HIGH
# 4: stops 68751 SuspectStopLocationError stop_id 368979 MEDIUM
# 5: transfers 800 ReferentialIntegrityError from_trip_id <NA> HIGH
# ---
# 8903: stops 22802 SuspectStopLocationError stop_id 337228 MEDIUM
# 8904: stops 30847 SuspectStopLocationError stop_id 354136 MEDIUM
# 8905: stops 48251 SuspectStopLocationError stop_id 353817 MEDIUM
# 8906: stops 85672 SuspectStopLocationError stop_id 366390 MEDIUM
# 8907: stops 66636 SuspectStopLocationError stop_id 368984 MEDIUMSeems to work really well. Two notes:
|
|
By the way, anyone has any idea how to fix these GTFS errors, preferably quickly and automatically (I would even say auto-magically)? 😉 |
|
@e-kotov thanks for the feedback! I need to look into why the r5 jar isn't being rebuilt automatically. In terms of fixing your feed, the referential integrity errors are what's causing the network build to fail. It looks like (some of) the trip IDs in transfers.txt don't exist. A quick fix would be to just remove transfers.txt and see if that fixes the problem. |
I'm not sure it is supposed to be built. The build jar action in this repo is separate from the R CMD check. In a recent PR I made to r5r I also rebuilt the jar manually.
Thanks for the tip! UPDATE: that actually helped! |
|
Yeah @mattwigway @e-kotov the jar gets rebuilt by the bot but the tests run before the jar gets rebuilt. So they run on the old jar unless you push your own rebuilt jar. Perhaps that's something I could fix 🫣 |
|
Maybe we should just build the JAR every time the tests run to ensure it's up to date? I think it's a good practice to only build the JAR only via Github Actions—that ensures reproducible and stable builds, and since JARs can't be easily reviewed helps ensure that the JARs for a pull request match the code changes. |
|
Also, even though the JAR is small, is it sustainable in the long term to keep updating a binary file in the repo? Perhaps it should be moved to a new repo and connected to the main one as a submodule. |
|
I wonder if we could build the JAR as part of the package build process
before submitting to CRAN rather than having it in the repo. I guess that
complicates installing the development version but maybe there's some
workaround - I believe r5r already has a requirement to have a jdk rather
than jre installed, so every machine running r5r should be capable of
building it too.
…On Tue, Sep 23, 2025, 12:48 PM Egor Kotov ***@***.***> wrote:
*e-kotov* left a comment (ipeaGIT/r5r#516)
<#516 (comment)>
Also, even though the JAR is small, is it sustainable in the long term to
keep updating a binary file in the repo? Perhaps it should be moved to a
new repo and connected to the main one as a submodule.
—
Reply to this email directly, view it on GitHub
<#516 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEKNLWZPHIZYZZLVJSPXLL3UF2HFAVCNFSM6AAAAACGL6VKI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMRUHAYDMOBUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yep, I completely agree, the jar should be rebuilt before the tests are executed, not after. I assume that was the intended design of the CI but I've just never gotten around to looking at it. As for having every machine build r5r from source that's starting to sound like an ambitious feature. But if someone knows how to do it, it sounds like a sound idea. |
|
@BardyBard just wanted to clarify one thing - I don't think we should have every machine build from source, but rather have the build happen during the package build process. So if you install from CRAN binaries you get a prebuilt version, but if installing from GH it gets built from source. |
|
@mattwigway the jar CI doesn't even get run on your branch which is super odd. Perhaps its to do with the |
|
@BardyBard I was actually just looking at this. It turns out it does, it just shows up in the actions tab on my fork rather than the main repo. Something is wrong with the cran mirrors that I'm trying to figure out which is why it's not getting built. |
|
@BardyBard I got the JAR build working on my branch again. A combination of the |
|
Wow, I'm glad you got it working. I was playing with it earlier today and only got past the cran mirror part. I'll talk to Rafa about the jar building on Wednesday. For now I'll take your CI fixes and merge them with another branch I'm working on where the coverage and cran checks only run after the jar is complied. That way the tests execute with the newly compiled jar not the old version. |
I agree this would be ideal in the long run, but I think this could make the package development quite incovenient from the R side. Speaking for myself, the R developers of {r5r} don't know Java or how to compile Java code (sorry). So keeping the 'small' jar inside the repo makes it super convenient to develop the package . Also, we don't have that many {r5r} developers, and even fewer developers from the Java side, so these changes tend to be more rare. I'm open to change my mind on this issue, but for now I would say this is very low priority |
This stores parse errors that R5 encountered when parsing GTFS, which are accessible by running
gtfs_errors(r5r_network). For example, with the network we've been using to debug:There are a few more things to think about though:
gtfs_errorswe need to actually return the network. Ideally we'd have some way to prevent this network from actually being used for any r5r functions other thangtfs_errorsthough.