-
Notifications
You must be signed in to change notification settings - Fork 3
New worker event assembly #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
036340e
b2c1d5b
b84a7c0
0df066a
b02582e
83c48d3
3d99bf7
b71d072
db04bb0
f33f68f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| import asyncio | ||
| from typing import AsyncGenerator, Optional | ||
|
|
||
| import zmq | ||
|
|
||
|
|
||
| from dranspose.data.stream1 import Stream1Packet, Stream1End | ||
| from dranspose.event import StreamData, InternalWorkerMessage | ||
| from dranspose.ingester import Ingester, IngesterSettings | ||
| from dranspose.protocol import ( | ||
| StreamName, | ||
| ZmqUrl, | ||
| WorkAssignment, | ||
| WorkerName, | ||
| ) | ||
|
|
||
|
|
||
| class StinsParallelSettings(IngesterSettings): | ||
| upstream_url: ZmqUrl | ||
|
|
||
|
|
||
| class StinsParallelIngester(Ingester): | ||
| """ | ||
| A simple ingester class to comsume a stream from the streaming-receiver repub port | ||
| """ | ||
|
|
||
| def __init__(self, settings: Optional[StinsParallelSettings] = None) -> None: | ||
| if settings is not None: | ||
| self._streaming_single_settings = settings | ||
| else: | ||
| self._streaming_single_settings = StinsParallelSettings() | ||
|
|
||
| super().__init__(settings=self._streaming_single_settings) | ||
| self.in_socket: Optional[zmq._future._AsyncSocket] = None | ||
|
|
||
| async def work(self) -> None: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Summary (to see if I get it):
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Quoting @felix-engelmann "The magic is that the parallel [ingester] can't use simple counting and has to rely on data in the stream ie msg_numbe[r]"
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking about this, and I started wondering if it would not make more sense to move the responsibility of providing the message number from Ingester.work() to Ingester.run_source(). This would cause some code duplication, as all existing ingester would need a few lines to count the current message number, but would open the possibility for future ingesters to decide how to compute the event number. Maybe we can just keep this in mind if this becomes an issue for other ingesters in the future.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I've changed my mind! I just went back to the replay code, and it looks to me like this would just work fine! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think I like having a shadow of work() which does mostly but not entirely the same thing. Partly because it's duplicate code, and it also makes it very unclear what is different about this ingester unless you read both implementations of work() side by side like I am doing now. I am still not sure what is going on here.
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dumping from multiple parallel ingesters would be a difficult problem, especially if two scans overlap and end up in the same files. |
||
| self._logger.info("started stins ingester work task") | ||
|
|
||
| if len(self.active_streams) == 0: | ||
| self._logger.warning("this ingester has no active streams, stopping worker") | ||
| return | ||
| sourcegen = self.run_source_part(self.active_streams[0]) | ||
| try: | ||
| while True: | ||
| nextiwm: InternalWorkerMessage = await anext(sourcegen) | ||
|
|
||
| work_assignment: WorkAssignment = await self.assignment_queue.get() | ||
| while work_assignment.event_number < nextiwm.event_number: | ||
| work_assignment = await self.assignment_queue.get() | ||
|
|
||
| workermessages: dict[WorkerName, InternalWorkerMessage] = {} | ||
| for stream, workers in work_assignment.assignments.items(): | ||
| for worker in workers: | ||
| if worker not in workermessages: | ||
| workermessages[worker] = nextiwm | ||
| self._logger.debug("workermessages %s", workermessages) | ||
| await self._send_workermessages(workermessages) | ||
| self.state.processed_events += 1 | ||
| except asyncio.exceptions.CancelledError: | ||
| self._logger.info("stopping worker") | ||
| for stream in self.active_streams: | ||
| await self.stop_source(stream) | ||
|
|
||
| async def run_source_part( | ||
| self, stream: StreamName | ||
| ) -> AsyncGenerator[InternalWorkerMessage, None]: | ||
| self.in_socket = self.ctx.socket(zmq.PULL) | ||
| self.in_socket.connect(str(self._streaming_single_settings.upstream_url)) | ||
| self._logger.info( | ||
| "pulling from %s", self._streaming_single_settings.upstream_url | ||
| ) | ||
|
|
||
| while True: | ||
| parts = await self.in_socket.recv_multipart(copy=False) | ||
| try: | ||
| packet = Stream1Packet.validate_json(parts[0].bytes) | ||
| except Exception as e: | ||
| self._logger.error("packet not valid %s", e.__repr__()) | ||
| continue | ||
| self._logger.debug("msg number %d", packet.msg_number) | ||
| yield InternalWorkerMessage( | ||
| event_number=packet.msg_number, | ||
| streams={stream: StreamData(typ="STINS", frames=parts)}, | ||
| ) | ||
|
|
||
| if isinstance(packet, Stream1End): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the parallel ingester that does not get the Stream1End will keep running and consuming messages. In general, we are walking away from the notion of a stream with a well-defined start and end; is this ingester more fragile? Or were we just more paranoid before? Would it make sense to have only one partial ingester have an open socket until it receives a Stream1Start, and then notify the others? If some delay is ok, one could even just use Redis for that.
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are assuming more implicit usage contracts, yes. When only consuming with one stream, it is easy to enforce a strict sequence from start until end. The parallel ingester in this PR does not have any of these guarantees. I like the idea of having a single ingester connected, waiting for the start and once received notifying others to start "helping out" with consuming messages from their PULL sockets. Then once some ingester got an end message, it notifies all others to disconnect except the "main" ingester which is again waiting for the next start. For communication, I would suggest redis, as it already is the bottleneck on round trip times and we don't need more dependencies. A possible design could be that the main ingester has a special flag and all ingesters for a stream share a redis-stream to which they post the start message and the end message and then act accordingly. We need to loop in the logic of restarting a stuck scan where there is no end message etc. but that should be possible to model. |
||
| break | ||
| while True: | ||
| self._logger.debug("discarding messages until next run") | ||
| await self.in_socket.recv_multipart(copy=False) | ||
|
|
||
| async def stop_source(self, stream: StreamName) -> None: | ||
| if self.in_socket: | ||
| self._logger.info("closing socket without linger") | ||
| self.in_socket.close(linger=0) | ||
| self.in_socket = None | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| from typing import Optional, AsyncGenerator | ||
| import zmq | ||
|
|
||
| from dranspose.data.eiger_legacy import ( | ||
| EigerLegacyPacket, | ||
| EigerLegacyEnd, | ||
| EigerLegacyImage, | ||
| EigerLegacyHeader, | ||
| ) | ||
| from dranspose.event import InternalWorkerMessage, StreamData | ||
| from dranspose.ingesters.stins_parallel import StinsParallelIngester | ||
| from dranspose.ingesters.zmqpull_eiger_legacy import ZmqPullEigerLegacySettings | ||
| from dranspose.protocol import StreamName, EventNumber | ||
|
|
||
|
|
||
| class Stream1ParallelIngester(StinsParallelIngester): | ||
| def __init__(self, settings: Optional[ZmqPullEigerLegacySettings] = None) -> None: | ||
| if settings is not None: | ||
| self._streaming_settings = settings | ||
| else: | ||
| self._streaming_settings = ZmqPullEigerLegacySettings() | ||
|
|
||
| super().__init__(settings=self._streaming_settings) | ||
| self.in_socket: Optional[zmq._future._AsyncSocket] = None | ||
|
|
||
| async def run_source_part( | ||
| self, stream: StreamName | ||
| ) -> AsyncGenerator[InternalWorkerMessage, None]: | ||
| self.in_socket = self.ctx.socket(zmq.PULL) | ||
| self.in_socket.connect(str(self._streaming_settings.upstream_url)) | ||
| self._logger.info("pulling from %s", self._streaming_settings.upstream_url) | ||
|
|
||
| while True: | ||
| parts = await self.in_socket.recv_multipart(copy=False) | ||
| try: | ||
| packet = EigerLegacyPacket.validate_json(parts[0].bytes) | ||
| except Exception as e: | ||
| self._logger.error("packet not valid %s", e.__repr__()) | ||
| continue | ||
| msg_number = None | ||
| if isinstance(packet, EigerLegacyImage): | ||
| msg_number = EventNumber(packet.frame + 1) | ||
| elif isinstance(packet, EigerLegacyHeader): | ||
| msg_number = EventNumber(0) | ||
| elif isinstance(packet, EigerLegacyEnd): | ||
| break | ||
| self._logger.debug("msg number %d", msg_number) | ||
| yield InternalWorkerMessage( | ||
| event_number=msg_number, | ||
| streams={stream: StreamData(typ="EIGER_LEGACY", frames=parts)}, | ||
| ) | ||
|
|
||
| while True: | ||
| self._logger.debug("discarding messages until next run") | ||
| await self.in_socket.recv_multipart(copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pythons boolean operators are more useful than they first might seem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The danger with implicit bool casting vs None test is visible in this contrived example:
I only want to create an empty Settings object if None was provided as parameter. If the provided value happens to be an implicit false, we get the problem: