You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand a common problem is having failures on some, but not all, source files. It is nearly impossible to run a massively parallel job and not face some sort of connection issue or other unexpected error from opening a file.
It would be great if there were a way to skip over failures, perhaps by writing nan's for the expected dimensions, log the failure, and then run a retry version of the same recipe which tried to fill in those gaps.
Julius Buseke has been running a bunch of the CMIP6 archive through pangeo-forge-recipes (on dataflow). I can ask him if he has found any good ways to re-run failed jobs and keep track of them.
Later today I plan to crosswalk what Flink/Beam have for checkpointing (which is another way to solve this). But it depends on the runner. Running with LocalDirectBakery on a decent sized machine still produces network issues for an auth-fronted s3 bucket. Will also compare to public bucket also
I understand a common problem is having failures on some, but not all, source files. It is nearly impossible to run a massively parallel job and not face some sort of connection issue or other unexpected error from opening a file.
It would be great if there were a way to skip over failures, perhaps by writing nan's for the expected dimensions, log the failure, and then run a retry version of the same recipe which tried to fill in those gaps.
cc @ranchodeluxe @norlandrhagen @sharkinsspatial (who came up with the name "skipsies"
The text was updated successfully, but these errors were encountered: