Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUMULUS-3757: Update granule upsert logic to allow updating collection info #3887

Closed
wants to merge 8 commits into from

Conversation

Nnaga1
Copy link
Contributor

@Nnaga1 Nnaga1 commented Dec 19, 2024

Summary: Summary of changes

Addresses CUMULUS-XX: Develop amazing new feature

Changes

  • Detailed list or prose of changes
  • ...

PR Checklist

  • Update CHANGELOG
  • Unit tests
  • Ad-hoc testing - Deploy changes and test manually
  • Integration tests

Copy link
Member

@Jkovarik Jkovarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Nnaga1 - did a once over, the approach looks good 🎉 , but have a few implementation questions for discussion. Take a look when you have a moment and let's chat!

@@ -354,3 +360,61 @@ export const getGranulesByGranuleId = async (
.where({ granule_id: granuleId });
return records;
};

/**
* Change a granules' PG record and its files' PG record based on collection move
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: formatting. "Change granules PG record and files" or something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to better define what 'change' means. Is this an upsert, a bounded modification, etc.

/**
* Change a granules' PG record and its files' PG record based on collection move
*
* @param {Knex | Knex.Transaction} knexOrTransaction - DB client or transaction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should this be either? I think we probably don't want to allow a transaction to create a subtransactio here

* Change a granules' PG record and its files' PG record based on collection move
*
* @param {Knex | Knex.Transaction} knexOrTransaction - DB client or transaction
* @param {Array<ApiGranuleRecord>} [granules] - updated ApiGranule records
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these going to be full ApiGranuleRecords, or is a partial allowed? This feels like you might want to consider ApiGranule instead, or else enforce a new type that has update fields if they're required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the master PR you mentioned making them ApiGranuleRecords: #3872 (comment)

last_update_date_time: pgGranule.last_update_date_time,
}
);
if (granule.files) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like it's handling file object set mismatches 'correctly' - if we're going to expose this business logic as a DB package method, we need to make sure it handles that case correctly or put a logic-bound on making sure the behavior is understood.

e.g. what does updateGranulesAndFiles behavior look like in the case where you have a files mismatch.

Based on a first read through, it looks like this does a .update to only file records that match (e.g. already exist)... which means if you pass in a file object with less or more files, that's ignored.

Documentation (or documentation via units) and guide-rails are probably needed here if we're writing a method that specific.

} = TableNames;

try {
await createRejectableTransaction(knexOrTransaction, async (trx) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it appropriate/intended for the batch update to be all or nothing?

Nit: This should probably be documented in the header

} = req.testContext || {};
const payload = req.body;
try {
await updateGranulesAndFiles(knex, payload);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

tl;dr.

We should probably move the concurrency logic to the API package, handle bounding/configuring concurrency and address validation via something like zod.

Details/concerns:

  1. We should consider implementing a concurrency or size bound somehow the API is limited by design to 2 DB connections >by design<, if you throw a large enough request at this endpoint I'd expect it will either timeout or overwhelm the knex connection pool, or both. That's probably not appropriate to the db package (and why we wound up with this sort of logic in the api/lib write 'stuff' as an approach.

  2. Users aren't going to just use this for the intended approach for 3757, if it's part of the public API we likely should use zod to validate inputs/schemas/etc before passing a payload to a @cumulus/db method.

Copy link
Contributor Author

@Nnaga1 Nnaga1 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this should probably just update one granule and its file, while the api/lib or something else, would do the promise.all/concurrency stuff,

I'm all for changing the approach, I was really hoping in the review it would be to make it better than what I had bc theres stuff I was sure about (concurrency, api calls, the api timing out, how to call this.......), so whatever would be better, this has also changed a lot in the way it should be done so that has also thrown me off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should I do it like this:

make this endpoint just update one granule and its files in PG and ES -> make an api-client call for that -> elswhere, call the api-client fn concurrently with the list of target_granules (the changes to make)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to make a bulk update feature like this utilize a single granule call - that's going to force scaling of lambda invocations per granule, and is super inefficient.

There's room for various approaches here, but particularly, we need to put configurations/limits on concurrency around await Promise.all(granules.map(async (granule) => { so that we don't overwhelm the limited knex connection pool or make an exception for this bulk call and increase it, however that has user monitoring/tuning concerns.

As an opinionated take: I'd expect that best looks like having the DB package implement a single granule + all files update and associated tests, with an api lib method that does it for a bunch of granules and has concurrency controls in it. That allows for solid testing around a single granule update,and decouples concurrency concerns from the DB package. If the team feels strongly we should support bulk updates as part of the domain of the DB package, we could easily do that same approach entirely within the DB package.

@@ -115,7 +115,7 @@ async function genericRecordUpdate(esClient, id, doc, index, type, parent) {
* @param {string} type - Elasticsearch type
* @returns {Promise} Elasticsearch response
*/
async function updateExistingRecord(esClient, id, doc, index, type) {
async function updateExistingRecord(esClient, id, doc, index, type, parent = undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change needed?

@Nnaga1 Nnaga1 closed this Jan 1, 2025
@Nnaga1 Nnaga1 deleted the es_for_move_collections_backport branch January 1, 2025 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants