Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bulk reprocessing tool #112

Draft
wants to merge 41 commits into
base: master
Choose a base branch
from
Draft

Add bulk reprocessing tool #112

wants to merge 41 commits into from

Conversation

jeremyh
Copy link
Member

@jeremyh jeremyh commented Jun 28, 2024

This is deployed along with scene select, so the module building docs still apply: https://github.com/GeoscienceAustralia/dea-ard-scene-select#dass-module-creation

An example is deployed to module ard-scene-select-py3-pipeline/20240628b-bulk

Some usage examples are in the help text:

$ ard-bulk-reprocess --help
Usage: ard-bulk-reprocess [OPTIONS] PREFIX [EXPRESSIONS]...

  Create a PBS Job to (re)process ARD datasets in bulk

  This is a simpler alternative to scene_select, intended for bulk jobs that
  replace existing data.

  (It is intended to have good defaults, so you don't need to remember our
  AOI, where our Level 1 lives on NCI, or where logs should go, etc... )

  This will search, filter the scenes, and create the ard_pbs script. You can
  then run the result if it looks good.

  The first argument is which "collections" to process. This can be "ls",
  "s2", or more specific matches like "ls7" or "s2a"*

   (*prefix is globbed on ARD product names as "ga_{prefix}*", so "ls" is
   expanded to "ga_ls*", etc)

  Nothing is actually run by default, it simply creates a work directory and
  scripts to kickoff.

  Optionally, you can provide standard ODC search expressions to limit the
  scenes chosen.

  Normal ODC expressions will be passed to the ODC search function, except for
  any field with suffix `_version`. It will be compared against the proc-info
  file software versions.

  Expression examples:

      platform = LANDSAT_8

      time in 2014-03-02     time in 2014-3-2     time in 2014-3     time >
      2014     time in [2014, 2014]     time in [2014-03-01, 2014-04-01]
      time > 2014

      lat in [4, 6] time in 2014-03-02     lat in [4, 6]

      wagl_version in ["1.2.3", "3.4.5"]     wagl_version < "1.2.3.dev4"
      fmask_version = "4.2.0"

      platform=LS8 lat in [-14, -23.5] instrument="OTHER"

  Examples:

      # Any five sentinel2 scenes

      ard-bulk-reprocess s2  --max-count 5

      # Landsat scenes for a given month that are below a gqa value

      ard-bulk-reprocess ls  'time in 2023-04'  'wagl_version>"0.1.3"'  --max-
      count 1000

  Logs are printed to stderr by default. It will be in readable (coloured)
  form if to a live terminal, or json otherwise.

  You can redirect stderr if you want to record logs:

      ard-bulk-reprocess s2  --max-count 500 2> bulk-reprocess.jsonl

Options:
  -E, --env TEXT
  -C, --config, --config_file TEXT
  --max-count INTEGER             Maximum number of scenes to process
  --work-dir PATH                 Base folder for working files (will create
                                  subfolders for each job)
  --pkg-dir PATH                  Output package base path (default: work-
                                  dir/pkg)
  --workers-per-node INTEGER      Workers per node
  --help                          Show this message and exit.

@jeremyh jeremyh force-pushed the jez/bulk-process branch from a18b0c4 to da4ffbf Compare June 28, 2024 04:51
@jeremyh jeremyh force-pushed the jez/bulk-process branch 4 times, most recently from 03f1d79 to c287fe4 Compare October 1, 2024 02:56
When our list of archived datasets isn't enough to remove
the destination directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant