-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add support for parsing PDF pages in parallel #237
Conversation
Codecov Report
@@ Coverage Diff @@
## master #237 +/- ##
==========================================
+ Coverage 88.35% 88.42% +0.07%
==========================================
Files 14 14
Lines 1571 1581 +10
Branches 358 359 +1
==========================================
+ Hits 1388 1398 +10
Misses 128 128
Partials 55 55
Continue to review full report at Codecov.
|
Why do we need to pass |
I did this to avoid copying the tests and make sure that the new argument |
Since the existing code overwrites `layout` and `dim` in each iteration, it is much more efficient to simply return the `layout` and `dim` of the first page. I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds. Signed-off-by: Karl Bonde Torp <[email protected]>
Will this ever be merged or is camelot already mp? |
What is missing for this to be merged? |
Is this still WIP? What's preventing this from being merged? |
Currently, there are merge conflicts that first need to be resolved. #353 might also get merged and that makes me uncertain how to continue with all others |
Hey! As camelot is dead, we try to build a maintained fork at Do you want to open the PR against that branch so that we can merge your improvement? |
[MRG] Utils: optimise get_page_layout
• Installing authlib (1.3.0) • Installing marshmallow (3.21.0) • Installing pydantic (1.10.14) • Installing safety-schemas (0.0.2) • Installing typer (0.9.0) • Removing gitdb (4.0.10) • Removing gitpython (3.1.37) • Removing smmap (5.0.0) • Updating attrs (23.1.0 -> 23.2.0) • Updating babel (2.12.1 -> 2.14.0) • Updating bandit (1.7.5 -> 1.7.7) • Updating beautifulsoup4 (4.12.2 -> 4.12.3) • Updating black (23.7.0 -> 24.2.0) • Updating certifi (2023.7.22 -> 2024.2.2) • Updating cffi (1.15.1 -> 1.16.0) • Updating cfgv (3.3.1 -> 3.4.0) • Updating chardet (5.1.0 -> 5.2.0) • Updating charset-normalizer (3.2.0 -> 3.3.2) • Updating click (8.1.5 -> 8.1.7) • Updating contourpy (1.1.0 -> 1.1.1) • Updating coverage (7.2.7 -> 7.4.3) • Updating cryptography (41.0.4 -> 42.0.5) • Updating cycler (0.11.0 -> 0.12.1) • Updating distlib (0.3.6 -> 0.3.8) • Updating dparse (0.6.3 -> 0.6.4b0) • Updating filelock (3.12.4 -> 3.13.1) • Updating fonttools (4.41.0 -> 4.49.0) • Updating furo (2023.9.10 -> 2024.1.29) • Updating identify (2.5.29 -> 2.5.35) • Updating idna (3.4 -> 3.6) • Updating isort (5.12.0 -> 5.13.2) • Updating jinja2 (3.1.2 -> 3.1.3) • Updating markupsafe (2.1.3 -> 2.1.5) • Updating matplotlib (3.7.2 -> 3.7.5) • Updating mypy (1.4.1 -> 1.8.0) • Updating opencv-python (4.8.1.78 -> 4.9.0.80) • Updating packaging (23.1 -> 23.2) • Updating pathspec (0.11.1 -> 0.12.1) • Updating pbr (5.11.1 -> 6.0.0) • Updating pillow (10.0.0 -> 10.2.0) • Updating platformdirs (3.8.1 -> 4.2.0) • Updating pluggy (1.2.0 -> 1.4.0) • Updating pre-commit (3.4.0 -> 3.5.0) • Updating pre-commit-hooks (4.4.0 -> 4.5.0) • Updating pygments (2.15.1 -> 2.17.2) • Updating pyparsing (3.0.9 -> 3.1.1) • Updating pypdf (3.12.1 -> 3.17.4) • Updating pytest (7.4.0 -> 8.0.2) • Updating pytz (2023.3 -> 2024.1) • Updating pyyaml (6.0 -> 6.0.1) • Updating rich (13.4.2 -> 13.7.0) • Updating ruamel-yaml (0.17.32 -> 0.18.6) • Updating ruamel-yaml-clib (0.2.7 -> 0.2.8) • Updating safety (2.3.4 -> 3.0.1) • Updating setuptools (68.0.0 -> 69.1.1) • Updating soupsieve (2.4.1 -> 2.5) • Updating sphinx (7.0.1 -> 7.1.2) • Updating sphinx-click (4.4.0 -> 5.1.0) • Updating stevedore (5.1.0 -> 5.2.0) • Updating tokenize-rt (5.1.0 -> 5.2.0) • Updating tornado (6.3.3 -> 6.4) • Updating typeguard (4.0.0 -> 4.1.5) • Updating typing-extensions (4.7.1 -> 4.10.0) • Updating urllib3 (2.0.3 -> 2.2.1) • Updating virtualenv (20.24.0 -> 20.25.1) • Updating xdoctest (1.1.1 -> 1.1.3)
Fix situation where pdftopng is not found if executing python directly from an un-activated environment.
Fix safety issues by update lockfile
428cb18
to
a06796b
Compare
Parse in parallel using multiprocessing library using available CPUs
a06796b
to
e3cd4d9
Compare
Moved to py-pdf#17 |
Closes #20
Parse pages in parallel using multiprocessing library leveraging all the available CPUs.
Checklist: