You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We encode Bluesky documents directly into MongoDB documents. There is one MongoDB collection per document type, and each Bluesky document becomes one MongoDB document. The MongoDB representation picks up an additional internal key (_id) that is used internally but not exposed by Databroker.
This system has the benefit of being conceptually straightforward. But it has disadvantages:
Loading a Run requires at least three database hits (start, stop, descriptors).
MongoDB does not support anything like a database join. This makes it impossible to efficiently support queries on anything in 'stop' or 'descriptor' documents. We would be forced to query on start (to filter for access control) and then either do a join or a filter in Python.
Approaches considered and attempted
We considered two approaches:
Consolidate "header" Bluesky documents into one MongoDB document. Consolidate 'event' and 'datum' documents into paged representations. This is what is implemented, but not yet thoroughly tested, in the mongo_embedded implementation in suitcase_mongo.
Consolidate "header" Bluesky documents into one MongoDB document. Store 'event' and 'datum' documentation using a different, column-oriented, technology. This is what we planned to do after mongo_embedded was validated.
New proposal
Years on from that decision, I think we should have separate the changes to "header" documents and the change to 'event' and 'datum' documents. Appending chunks of paged 'event' and 'datum' documents added a lot of complexity, and I think it's arguable whether the effort to make this work in MongoDB was worthwhile. But consolidating 'start', 'stop', and 'descriptor' documents into one MongoDB document is completely straightforward and would advance the kinds of queries we can run efficiently.
A simple migration tool could look for 'start', 'stop', and 'descriptor' collections and consolidate their contents into a new 'header' collection. This would be safe to run multiple times if some writers take more time to switch to the new writing schema.
Then, having done that, we can separately consider whether to switch to a mongo_normalized approach with 'event' and 'datum' or to move them to a separate column-based technology.
The text was updated successfully, but these errors were encountered:
Status Quo
We encode Bluesky documents directly into MongoDB documents. There is one MongoDB collection per document type, and each Bluesky document becomes one MongoDB document. The MongoDB representation picks up an additional internal key (
_id
) that is used internally but not exposed by Databroker.This system has the benefit of being conceptually straightforward. But it has disadvantages:
Approaches considered and attempted
We considered two approaches:
mongo_embedded
implementation insuitcase_mongo
.mongo_embedded
was validated.New proposal
Years on from that decision, I think we should have separate the changes to "header" documents and the change to 'event' and 'datum' documents. Appending chunks of paged 'event' and 'datum' documents added a lot of complexity, and I think it's arguable whether the effort to make this work in MongoDB was worthwhile. But consolidating 'start', 'stop', and 'descriptor' documents into one MongoDB document is completely straightforward and would advance the kinds of queries we can run efficiently.
A simple migration tool could look for 'start', 'stop', and 'descriptor' collections and consolidate their contents into a new 'header' collection. This would be safe to run multiple times if some writers take more time to switch to the new writing schema.
Then, having done that, we can separately consider whether to switch to a
mongo_normalized
approach with 'event' and 'datum' or to move them to a separate column-based technology.The text was updated successfully, but these errors were encountered: