New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

CUMULUS-3757: Update granule upsert logic to allow updating collection info #3887

Closed

Nnaga1 wants to merge 8 commits into release-18.5.x from es_for_move_collections_backport

+540 −2

Contributor

Nnaga1 commented Dec 19, 2024

Summary: Summary of changes

Addresses CUMULUS-XX: Develop amazing new feature

Changes

Detailed list or prose of changes
...

PR Checklist

Update CHANGELOG
Unit tests
Ad-hoc testing - Deploy changes and test manually
Integration tests

etcart and others added 7 commits

December 18, 2024 18:40


          updtaing granules and testing

5a7fd51


          3757 changes - endpoint, PG, and tests

82cd1aa


          turn on es update

86738c8


          pull bare wrap for updateGranule

6bf4880


          slightly clearer name

af612f8


          Merge branch 'es_for_move_collections_backport' of https://github.com…

b6365cf

…/nasa/cumulus into es_for_move_collections_backport


          linting

95c8617

Jkovarik requested changes

View reviewed changes

Member

Jkovarik left a comment

Hey @Nnaga1 - did a once over, the approach looks good 🎉 , but have a few implementation questions for discussion. Take a look when you have a moment and let's chat!

packages/db/src/lib/granule.ts

@@ @@ -354,3 +360,61 @@ export const getGranulesByGranuleId = async ( @@
                   .where({ granule_id: granuleId });
                 return records;
               };
+              /**
+               * Change a granules' PG record and its files' PG record based on collection move

Member

Jkovarik Dec 19, 2024

Nit: formatting. "Change granules PG record and files" or something like that.

Member

Jkovarik Dec 19, 2024

We need to better define what 'change' means. Is this an upsert, a bounded modification, etc.

packages/db/src/lib/granule.ts

+              /**
+               * Change a granules' PG record and its files' PG record based on collection move
+               *
+               * @param {Knex | Knex.Transaction} knexOrTransaction - DB client or transaction

Member

Jkovarik Dec 19, 2024

Nit: Should this be either? I think we probably don't want to allow a transaction to create a subtransactio here

packages/db/src/lib/granule.ts

+               * Change a granules' PG record and its files' PG record based on collection move
+               *
+               * @param {Knex | Knex.Transaction} knexOrTransaction - DB client or transaction
+               * @param {Array<ApiGranuleRecord>} [granules] - updated ApiGranule records

Member

Jkovarik Dec 19, 2024

Are these going to be full ApiGranuleRecords, or is a partial allowed? This feels like you might want to consider ApiGranule instead, or else enforce a new type that has update fields if they're required.

Contributor Author

Nnaga1 Dec 19, 2024

in the master PR you mentioned making them ApiGranuleRecords: #3872 (comment)

packages/db/src/lib/granule.ts

+                          last_update_date_time: pgGranule.last_update_date_time,
+                        }
+                      );
+                      if (granule.files) {

Member

Jkovarik Dec 19, 2024

This doesn't look like it's handling file object set mismatches 'correctly' - if we're going to expose this business logic as a DB package method, we need to make sure it handles that case correctly or put a logic-bound on making sure the behavior is understood.

e.g. what does updateGranulesAndFiles behavior look like in the case where you have a files mismatch.

Based on a first read through, it looks like this does a .update to only file records that match (e.g. already exist)... which means if you pass in a file object with less or more files, that's ignored.

Documentation (or documentation via units) and guide-rails are probably needed here if we're writing a method that specific.

packages/db/src/lib/granule.ts

+                } = TableNames;
+                try {
+                  await createRejectableTransaction(knexOrTransaction, async (trx) => {

Member

Jkovarik Dec 19, 2024

Is it appropriate/intended for the batch update to be all or nothing?

Nit: This should probably be documented in the header

packages/api/endpoints/granules.js

+                } = req.testContext || {};
+                const payload = req.body;
+                try {
+                  await updateGranulesAndFiles(knex, payload);

Member

Jkovarik Dec 19, 2024

🤔

tl;dr.

We should probably move the concurrency logic to the API package, handle bounding/configuring concurrency and address validation via something like zod.

Details/concerns:

We should consider implementing a concurrency or size bound somehow the API is limited by design to 2 DB connections >by design<, if you throw a large enough request at this endpoint I'd expect it will either timeout or overwhelm the knex connection pool, or both. That's probably not appropriate to the db package (and why we wound up with this sort of logic in the api/lib write 'stuff' as an approach.
Users aren't going to just use this for the intended approach for 3757, if it's part of the public API we likely should use zod to validate inputs/schemas/etc before passing a payload to a @cumulus/db method.

Contributor Author

Nnaga1 Dec 19, 2024 •

edited

Loading

so this should probably just update one granule and its file, while the api/lib or something else, would do the promise.all/concurrency stuff,

I'm all for changing the approach, I was really hoping in the review it would be to make it better than what I had bc theres stuff I was sure about (concurrency, api calls, the api timing out, how to call this.......), so whatever would be better, this has also changed a lot in the way it should be done so that has also thrown me off

Contributor Author

Nnaga1 Dec 19, 2024

so should I do it like this:

make this endpoint just update one granule and its files in PG and ES -> make an api-client call for that -> elswhere, call the api-client fn concurrently with the list of target_granules (the changes to make)

Member

Jkovarik Dec 19, 2024

I don't think we want to make a bulk update feature like this utilize a single granule call - that's going to force scaling of lambda invocations per granule, and is super inefficient.

There's room for various approaches here, but particularly, we need to put configurations/limits on concurrency around await Promise.all(granules.map(async (granule) => { so that we don't overwhelm the limited knex connection pool or make an exception for this bulk call and increase it, however that has user monitoring/tuning concerns.

As an opinionated take: I'd expect that best looks like having the DB package implement a single granule + all files update and associated tests, with an api lib method that does it for a bunch of granules and has concurrency controls in it. That allows for solid testing around a single granule update,and decouples concurrency concerns from the DB package. If the team feels strongly we should support bulk updates as part of the domain of the DB package, we could easily do that same approach entirely within the DB package.

packages/es-client/indexer.js

                * @param  {string} type     - Elasticsearch type
                * @returns {Promise} Elasticsearch response
                */
-              async function updateExistingRecord(esClient, id, doc, index, type) {
+              async function updateExistingRecord(esClient, id, doc, index, type, parent = undefined) {

Member

Jkovarik Dec 19, 2024

Why was this change needed?


          api client for this endpoint

e6822c8

Nnaga1 closed this

Nnaga1 deleted the es_for_move_collections_backport branch

January 1, 2025 22:44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet