-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate number of spares #417 #434
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #434 +/- ##
===========================================
+ Coverage 97.91% 98.00% +0.08%
===========================================
Files 48 48
Lines 1723 1800 +77
===========================================
+ Hits 1687 1764 +77
Misses 36 36 ☔ View full report in Codecov by Sentry. |
ce2ff73
to
e08e647
Compare
8655607
to
85a17d1
Compare
…-number-of-spares-#417
@VKTB @joshuadkitenge @asuresh-code Tagging you all just to say feel free to test this PR and see if you can think of any other cases I missed in the description. |
I have just tried the tested an alternative method of using an aggregate query on the list endpoint using
In the in the catalogue item repo list method instead of the find. (This is not using the spares definition usage status array though). This took too long for swagger to complete, and was well over 5 minutes for the case described in the description of setting the spares definition (6427 catalogue items, 9684 items). While limited by pagination and querying by catalogue item id the 100MB stage limit would be a bigger problem with lookup stage as I believe it would be a combined limit for the catalogue items and item documents that would have to be in memory at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using Python's ThreadPoolExecutor
to concurrently update the catalogue items, to see if it would improve performance. I only changed 1 function in the setting.py
in the services layer.
Using Postman, I got the following results:
- With 104 catalogue items, 159 items: 207ms w/ multithreading, 222ms w/o
- With 1194 catalogue items, 1946 items: 1.8 seconds w/, 2.33 seconds w/o
- With 4.7k catalogue items, 7.6k items: 17.92 seconds w/, 20.08 seconds w/o
from concurrent.futures import ThreadPoolExecutor
def update_spares_definition(self, spares_definition: SparesDefinitionPutSchema) -> SparesDefinitionOut:
"""
Updates the spares definition to a new value.
:param spares_definition: The new spares definition.
:return: The updated spares definition.
:raises MissingRecordError: If any of the usage statuses specified by the given IDs don't exist.
"""
# Ensure all the given usage statuses exist
for usage_status in spares_definition.usage_statuses:
if not self._usage_status_repository.get(usage_status.id):
raise MissingRecordError(f"No usage status found with ID: {usage_status.id}")
# Begin a session for transactional updates
with start_session_transaction("updating spares definition") as session:
# Upsert the new spares definition
new_spares_definition = self._setting_repository.upsert(
SparesDefinitionIn(**spares_definition.model_dump()), SparesDefinitionOut, session=session
)
# Lock catalogue items for updates
utils.prepare_for_number_of_spares_recalculation(None, self._catalogue_item_repository, session)
# Obtain all catalogue item IDs
catalogue_item_ids = self._catalogue_item_repository.list_ids()
# Precompute usage status IDs that define a spare
usage_status_ids = utils.get_usage_status_ids_from_spares_definition(new_spares_definition)
# Define the worker function for recalculations
def recalculate_spares(catalogue_item_id):
utils.perform_number_of_spares_recalculation(
catalogue_item_id, usage_status_ids, self._catalogue_item_repository, self._item_repository, session
)
# Use ThreadPoolExecutor for concurrent recalculations
logger.info("Updating the number of spares for all catalogue items concurrently")
with ThreadPoolExecutor(max_workers=10) as executor: # May need to experiment w/ max workers
executor.map(recalculate_spares, catalogue_item_ids)
return new_spares_definition
Thanks for testing this. Another good potential improvement. My main concern with this would be interference with FastAPI's own threading behaviour. As we don't use async functions fastapi itself is already using a threadpool to handle multiple requests at once, so not sure if its wise to do this again in code without making other things async. ~I am surprised that the performance increase was even that much given I would have thought the database connection is the same across these threads as and don't use async pymongo. Unless maybe the executor is executing the loop in compiled code instead of python? I am not sure, perhaps it even spawns multiple database connections. |
4ab3642
to
5b9405e
Compare
I have just tested this an also got a performance improvement. According to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context:
The current implementation of the usage status in IMS_API prevents users from deleting a usage status if it is used in an item within the Items collection. I had assumed that the same logic would be applied for editing (which has not been implemented yet).
The way the spares definition is structured is effectively assigning a type (i.e., spares) to the usage status. However, this type is stored globally, rather than within the usage status collection itself. This design means that usage statuses can be edited indirectly, which I don't believe should not be allowed. Allowing such edits would introduce unnecessary complexity, as it could redefine what items are without considering their position, context, or other related factors.
For example, consider the usage statuses: new, used, scrapped, and inUse. If the type is modified globally and affects the status of items, it could lead to significant issues. Specifically, if an item was initially defined as a spare and then the spares definition is modified , the system would need to update all related item data before allowing the change. This would require careful modification of the items that are still linked to the spare status, which introduces complexity and potential for errors. Allowing this kind of change without ensuring all related items are updated first would cause data inconsistencies.
Storing the type within the usage status collection would make this more static and controlled, as any changes would then have to follow the predefined logic within that collection. This ensures consistency across the system and prevents the accidental redefinition or removal of statuses that could break existing item definitions.
We don't allow edits for usage statuses here. Even if it did though e.g. the name, its using an aggregate query so it shouldn't care.
Not sure I follow? Its either a spare because of its usage status or its not. I thought we already discussed this part.
Yes that is what this implementation is doing.
Again this is doing the update in a way that ensures they are all updated if the spares definition changes.
I don't understand this logic? We cant just store the spares definition inside the usage statuses - its a setting not a specific usage status. Moving it there wouldn't change any of this logic, if it were updated it would still need to recalculate. |
The spares definition is effectively assigning a type to the usage status (indirectly). An item is either a Moving this logic to the |
This is what I'm talking about spares_def_concern.mp4 |
recalculating | ||
""" | ||
|
||
with start_session_transaction(action_description) as session: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that items for that specific catalogue item (but not other catalogue items) cannot be created, edited or deleted during a transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, to avoid the another spares calculation from happening in between this one (leading to an invalid number of spares being stored in the database) it will block it. However like the case above - the count query is very fast so I have been unable to cause it in practice when generating many at once. (The mock data script is also making it past these when generating data as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to double check again that would only be the case for that specific catalogue item? Or would it prevent creating items in other catalogue items too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just within the specific catalogue item as its that item that has the number of spares updated.
catalogue_item_id, self._catalogue_item_repository, session | ||
) | ||
|
||
yield session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure what the purpose of yield
is and what it accomplishes/ here, the method docs just refer to it as yielding
. Do you mind explaining please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit I have always found it awkward to try and explain what it does 😄, I am more familiar with its use. I think the best example I can think of is open
, it's definition would be something like
def open(...):
open file
yield file
close file
Then when using with open
, it executes everything up to the yield, which then returns the file object (allowing it to be used), but when the python interpreter leaves the with
block it also executes the last part after the yield closing the file.
In this case I am using it just to extend the behaviour of an existing one start_session_transaction
, but I want it to only close when the with block is exited, so I use the yield effectively allowing it _start_transaction_impacting_number_of_spares
to behave in the same way as start_session_transaction
.
Should after `perform_number_of_spares_recalculation` in order to ensure the catalogue item is write locked to avoid | ||
other similar updates interfering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should after `perform_number_of_spares_recalculation` in order to ensure the catalogue item is write locked to avoid | |
other similar updates interfering. | |
Should be called after `perform_number_of_spares_recalculation` in order to ensure the catalogue item is write | |
locked to avoid other similar updates interfering. |
Description
See #417. Leaves modified time unchanged.
Concurrency notes
There are multiple cases where concurrency can potentially cause a problem in this PR, I have attempted to mitigate these. Here are some particular cases to mention.
Performance tests
Setting the spares definition (using postman)
This is much worse for high numbers of catalogue items as it iterates through them. I did look at aggregate queries but couldn't find examples close to what would be needed here. Still potentially worth investigating further. The main limitation would be the stage memory limit for a large number of items as the count would likely have to come from the
size
of alookup
stage. (This would also have been the case for using aggregate queries in all catalogue item requests)Testing instructions
Agile board tracking
Closes #417