Skip to content

Conversation

@NGrech
Copy link
Collaborator

@NGrech NGrech commented Aug 29, 2025

This PR adds a new getBatchForStudyDeployments() endpoint to DataStreamService that enables efficient retrieval of data from multiple study deployments with optional filtering capabilities.

Changes

New Endpoint

  • getBatchForStudyDeployments(): Retrieve data for multiple deployments in a single call
    • Filter by device role names
    • Filter by data types
    • Filter by time range (inclusive start, exclusive end)
    • Returns aggregated DataStreamBatch with non-overlapping sequences per stream

Implementation Details

  • Time-range filtering using inclusive lower and exclusive upper bound ([from, to))
  • Adjusts firstSequenceId when measurements are filtered from the beginning
  • Maintains sequence ordering and non-overlap guarantees per data stream

Bug Fixes

  • Fixed time filtering to properly adjust sequence IDs when clipping measurements
  • Updated RPC example requests to match ParticipantGroupStatus constructor changes

Testing

  • Added comprehensive test suite in InMemoryDataStreamServiceBatchRetrievalTest
  • Tests cover filtering, aggregation, edge cases, and non-monotonic timestamps

@NGrech NGrech added the feature New functionality. label Aug 29, 2025
@NGrech NGrech self-assigned this Aug 29, 2025
@NGrech NGrech added this to the 2.0.0 milestone Aug 29, 2025

This comment was marked as outdated.

@Whathecode

This comment was marked as resolved.

@NGrech NGrech force-pushed the feature/data-batch-endpoint-syncpoint branch from d2b7a73 to eca28e5 Compare September 1, 2025 12:07
@NGrech
Copy link
Collaborator Author

NGrech commented Sep 1, 2025

@Whathecode I squashed the new fixes into the original one.
Note that the reason the generated test files were not in the original commit was that there is no mention of that requirement in the CONTRIBUTING.md.
I think we should add fix this:

You can also run detekt separately through gradle detekt

to gradle detektPasses, since that is the command run in the code analysis check when committing and (at least on windows) gradle detekt will build successfully when there are issues that gradle detektPasses will fail on.

Whathecode

This comment was marked as outdated.

@NGrech

This comment was marked as outdated.

@Whathecode
Copy link
Member

I see you added this to the 2.0.0 milestone instead of 1.3. Any reason you expect this to be a breaking change, i.e., warranting a new major release?

@NGrech NGrech force-pushed the feature/data-batch-endpoint-syncpoint branch from eca28e5 to e96bce2 Compare September 17, 2025 11:09
@NGrech
Copy link
Collaborator Author

NGrech commented Sep 17, 2025

@Whathecode I have updated the the code based on the last discussion.

Copy link
Member

@Whathecode Whathecode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still an incomplete review, but I started looking at why you added ImmutableDataStreamBatch and ...Sequence. The PR description is missing some clarification in regards to why you are adding this. Have a look at some of the questions I asked and see whether you can clarify things.

I also have the impression that adding these changes can easily be done as a separate commit (and even PR). You don't need those for your updates to DataStreamService, and as far as I can tell, the existing data structures would work just fine.

While looking at changes, I noticed some incorrect code style whitespaces. I added a commit which you can squash.

@NGrech NGrech force-pushed the feature/data-batch-endpoint-syncpoint branch from abce685 to 7435edb Compare October 28, 2025 09:51
@NGrech NGrech changed the title feat(data): add getBatchForStudyDeployments endpoint feat(data): add getBatchForStudyDeployments endpoint with filtering Oct 28, 2025
@NGrech NGrech requested a review from Whathecode October 28, 2025 10:10
Add new DataStreamService.getBatchForStudyDeployments() endpoint to retrieve data for multiple deployments with optional filters for device roles, data types, and time ranges.

- Add getBatchForStudyDeployments to DataStreamService interface

- Implement time-range filtering with exclusive upper bound

- Adjust firstSequenceId when filtering removes measurements

- Add comprehensive tests for filtering and batch retrieval

- Update RPC examples for ParticipantGroupStatus changes

- Add documentation and test request snapshots
@NGrech NGrech force-pushed the feature/data-batch-endpoint-syncpoint branch from 7435edb to 30be1a8 Compare October 28, 2025 10:44
@NGrech
Copy link
Collaborator Author

NGrech commented Oct 28, 2025

@Whathecode & @yuanchen233 I have updated the PR, I think it is in a good state now and ready for revie.
Main thing to note from the last version is that I dropped immutable class approach to not violate LSP, and added specific tests to ensure that the returns are sequential and non overlapping.

Copy link
Member

@Whathecode Whathecode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed the DataStreamService.getBatchForStudyDeployments() contract for now.

Comment on lines +10 to +11
import kotlinx.serialization.Required
import kotlinx.serialization.Serializable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's configured, but wildcard is used for serialization imports across the codebase.

Comment on lines +78 to +79
* @param deviceRoleNames Optional device role name filter (e.g., "phone"). If null or empty, all are included.
* @param dataTypes Optional data type filter. If null or empty, all are included.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If null or empty, all are included.

That's counter intuitive. If empty, just filter out everything. A good API doesn't give two ways to do the same thing.

On why it matters: suppose a caller sets up a dynamic filter determining the set of device role names they are interested in, which ends up being empty. Now the caller will get all data, instead of no data, as expected.

Comment on lines +70 to +72
* The response is a canonical [DataStreamBatch]: for each [DataStreamId], sequences are
* ordered by start time and non-overlapping (contract preserved). No derived/secondary
* indexing is applied in this API; analytics-specific projections are out of scope here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this; all of this are implementation/design details. Not API documentation. The contract (API) of DataStreamBatch is documented already on DataStreamBatch.

Comment on lines +74 to +75
* Time range semantics: if [from] or [to] are specified, sequences are clipped to the
* half-open interval [from, to) (inclusive start, exclusive end).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That only works if from and to are specified. But, it looks like you can omit this and instead just document inclusive/exclusive nature in the corresponding from/to parameters. As is, this causes more confusion than it answers edge cases.

Instead, I'm more surprised about how Instant comes into the picture here. The data subsystem only has Long's for sensorStartTime and sensorEndTime. So ... what is happening here? How do I know what to pass?

* Time range semantics: if [from] or [to] are specified, sequences are clipped to the
* half-open interval [from, to) (inclusive start, exclusive end).
*
* @param studyDeploymentIds Study deployments to query. Must not be empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must not be empty.

Why? It seems like an overly strict contract. You can easily return nothing if you pass nothing, which would cause less additional handling for this edge case by the caller if they don't care about optimization/saving a roundtrip.

* @param to Optional absolute end time (exclusive). If null, no upper bound.
* @return A [DataStreamBatch] containing matching data sequences, preserving per-stream invariants.
*/
suspend fun getBatchForStudyDeployments(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain about the naming of this. Maybe simply getData? But, it will depend a bit on what actually comes out. It still seems like some synchronization is bound to happen (which would need documentation!), given the from and to Instant parameters, in which case getSynchronizedData or similar could be more appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants