perf(ingestor_livetiming): add opt-in concurrency/parallelism for historical data ingestion by JeffreyJPZ · Pull Request #305 · br-g/openf1

JeffreyJPZ · 2026-01-15T12:03:58Z

Closes #304

Description
Currently, historical data ingestion is synchronous and can take many minutes to hours to complete when ingesting large amounts of data. This PR aims to reduce the time needed by optimizing some of the "hot paths".

Changes

Add opt-in concurrency/parallelism with --parallel and optional --max-workers, --batch-size CLI flags
Update historical ingestion README with example usage
Parallelize _parse_and_decode_topic_content and parse_messages
Leverage concurrency with async database writes
Add wrapper class for typer.Typer to allow for decorating async functions

Results
Best-case data ingestion time should be reduced by ~90%.

JeffreyJPZ · 2026-01-15T12:06:56Z

src/openf1/services/ingestor_livetiming/core/processing/main.py

+    parallel: bool = False,
+    max_workers: int | None = None,
+    batch_size: int | None = None,


I don't exactly like propagating these through all of the functions (very easy to forget), but I'm not sure what other options there are.

JeffreyJPZ · 2026-01-15T12:09:50Z

src/openf1/services/ingestor_livetiming/historical/main.py

+    if len(line) == 0:
+        return None, None


Moved the empty line check here, makes it much easier to write the parallel part.

JeffreyJPZ · 2026-01-15T12:14:47Z

src/openf1/services/ingestor_livetiming/historical/typer.py

+class Typer(typer.Typer):
+    """
+    Wrapper class for typer.Typer to allow for decorating async functions.
+    Adapted from https://github.com/byunjuneseok/async-typer.
+    """
+
+    def command(self, *args, **kwargs):
+        def decorator(func: Callable):
+            @wraps(func)
+            def _func(*_args, **_kwargs):
+                if iscoroutinefunction(func):
+                    # Use current event loop if already running
+                    loop = None
+                    try:
+                        loop = get_running_loop()
+                    except RuntimeError:
+                        pass
+
+                    if loop is not None:
+                        return loop.run_until_complete(func(*_args, **_kwargs))
+                    else:
+                        return run(func(*_args, **_kwargs))
+
+                return func(*_args, **_kwargs)
+
+            super(Typer, self).command(*args, **kwargs)(_func)
+
+            return func
+
+        return decorator


This is pretty similar to the package async-typer, but I'd rather not have the extra dependency.

JeffreyJPZ · 2026-01-15T12:17:48Z

src/openf1/services/ingestor_livetiming/historical/main.py

+    if parallel:
+        await tqdm_async.gather(
+            *[
+                insert_data_async(
+                    collection_name=collection,
+                    docs=[d.to_mongo_doc_sync() for d in docs],
+                )
+                for collection, docs in docs_by_collection.items()
+            ],
+            disable=not verbose,
+        )
+    else:
+        for collection, docs in tqdm(
+            list(docs_by_collection.items()), disable=not verbose
+        ):
+            docs_mongo = [d.to_mongo_doc_sync() for d in docs]
+            insert_data_sync(collection_name=collection, docs=docs_mongo)


Not sure if async writes should be done even when --parallel isn't enabled, I left the original code there for now.

JeffreyJPZ · 2026-01-15T12:21:26Z

src/openf1/services/ingestor_livetiming/historical/main.py

    for session_key in session_keys:
        if verbose:
            logger.info(f"Ingesting session {session_key}")
-        ingest_session(
-            year=year, meeting_key=meeting_key, session_key=session_key, verbose=False
+        # If parallel is set, program is not I/O bound so having await in a for-loop isn't an issue
+        await ingest_session(
+            year=year,
+            meeting_key=meeting_key,
+            session_key=session_key,
+            parallel=parallel,
+            max_workers=max_workers,
+            batch_size=batch_size,
+            verbose=verbose,


Can't really parallelize across meetings AND sessions AND topics as that would mean a lot of processes, so I just focused on the larger bottlenecks.

JeffreyJPZ · 2026-01-15T12:25:24Z

src/openf1/util/misc.py

+def batched(iterable, n, strict=False):
+    """
+    This function is equivalent to Python's itertools.batched: https://docs.python.org/3/library/itertools.html#itertools.batched
+
+    Batch data from the iterable into tuples of length n. The last batch may be shorter than n.
+    If strict is true, will raise a ValueError if the final batch is shorter than n.
+    """
+    if n < 1:
+        raise ValueError("n must be at least one")
+    iterator = iter(iterable)
+    while batch := tuple(islice(iterator, n)):
+        if strict and len(batch) != n:
+            raise ValueError("batched(): incomplete batch")
+        yield batch


itertools.batched was added in Python 3.12 while this project supports >=3.10.

br-g · 2026-01-17T10:16:57Z

Hello @JeffreyJPZ, thanks a lot for opening this PR.

I have run a few tests though, and the results seem to be different when --parallel is enabled.

For example, on my laptop:

python -m openf1.services.ingestor_livetiming.historical.main get-processed-documents 2025 1276 9839 laps gives 1156 laps
python -m openf1.services.ingestor_livetiming.historical.main get-processed-documents 2025 1276 9839 laps --parallel gives only 939 laps.

I'm not sure exactly why.

JeffreyJPZ · 2026-01-17T11:48:52Z

Good catch, it's a bit tedious to test everything manually. When I tried to reproduce it on my end, it looks like the correct number of messages are being fetched in both modes, but I got 945 processed laps when using parallel (sequential seems fine).

I think I narrowed it down to process_message, some collections implicitly assume that the messages are processed in chronological order because they need to maintain state (ex. laps). It would also explain why the number of processed documents differs (some accidental non-determinism). If we simply avoid processing those collections in parallel it should fix the issue, and there should still be noticeable benefits since large collections like car_data and position don't maintain any state.

Edit: this might be a bit difficult as each message can map to multiple collections, will have to think more about this

JeffreyJPZ · 2026-01-22T23:19:53Z

I've decided to remove multiprocessing as after some more testing it doesn't seem to benefit from it at all, and it limits concurrency with multiple meetings/sessions. The bottleneck is processing the car_data and location collections, which take a few seconds, but with some strategic awaits to ensure all tasks get to run along with async requests, I've been able to ingest meetings in ~1min30s.

This would definitely benefit from a few integration tests #264 (ex. ensuring that the correct number of documents is processed), but Typer's support for testing is not well-documented/easy to use, and there are external systems to consider (F1 API, Mongo). A custom framework (based on pytest) would be necessary.

JeffreyJPZ added 9 commits January 13, 2026 17:53

add async and multiprocessing support

7a52609

make async write batch size 50000

b3a55fe

fix errors

7681945

move 'multiprocessing' module to 'util'

7f125df

add optional 'max_workers' and 'batch_size' args

3acc829

update README

ae4c697

lint fix

e016ed9

add typer wrapper for decorating async functions

e9d798d

fix README

8cf5f00

JeffreyJPZ commented Jan 15, 2026

View reviewed changes

JeffreyJPZ added 12 commits January 17, 2026 15:23

fix missing documents when parallel is enabled

86a4694

change default batch size to 10

43756c1

fix type error, remove parallel flag for non-ingestion commands

433c7ed

remove parallel flag for get-messages

f4064b5

lint fix

78ffa5e

remove multiprocessing

47dd876

reuse async http client

11f87b7

remove timeout for http client session

fef3db3

lint fix

6ef6472

update README

e6117ee

add 'by-meeting' and 'by-session' flags

2ccf233

update requirements.txt

b74fda6

run commands with new event loop if closed

298603e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(ingestor_livetiming): add opt-in concurrency/parallelism for historical data ingestion#305

perf(ingestor_livetiming): add opt-in concurrency/parallelism for historical data ingestion#305
JeffreyJPZ wants to merge 22 commits intobr-g:mainfrom
JeffreyJPZ:ingest-historical-perf-improvements

JeffreyJPZ commented Jan 15, 2026 •

edited

Loading

Uh oh!

JeffreyJPZ Jan 15, 2026

Uh oh!

JeffreyJPZ Jan 15, 2026

Uh oh!

JeffreyJPZ Jan 15, 2026

Uh oh!

JeffreyJPZ Jan 15, 2026

Uh oh!

JeffreyJPZ Jan 15, 2026

Uh oh!

JeffreyJPZ Jan 15, 2026

Uh oh!

br-g commented Jan 17, 2026

Uh oh!

JeffreyJPZ commented Jan 17, 2026 •

edited

Loading

Uh oh!

JeffreyJPZ commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JeffreyJPZ commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeffreyJPZ Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

JeffreyJPZ Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

JeffreyJPZ Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

JeffreyJPZ Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

JeffreyJPZ Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

JeffreyJPZ Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

br-g commented Jan 17, 2026

Uh oh!

JeffreyJPZ commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeffreyJPZ commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JeffreyJPZ commented Jan 15, 2026 •

edited

Loading

JeffreyJPZ commented Jan 17, 2026 •

edited

Loading