Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for parsing PDF pages in parallel #237

Closed
wants to merge 10 commits into from

Conversation

phoewass
Copy link
Contributor

@phoewass phoewass commented May 1, 2021

Closes #20

Parse pages in parallel using multiprocessing library leveraging all the available CPUs.

Checklist:

  • Process in parallel using the library
  • Tests to process with and without parallel option
  • Process in parallel using the CLI
  • Update documentation

@codecov-commenter
Copy link

Codecov Report

Merging #237 (63161fe) into master (7709e58) will increase coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #237      +/-   ##
==========================================
+ Coverage   88.35%   88.42%   +0.07%     
==========================================
  Files          14       14              
  Lines        1571     1581      +10     
  Branches      358      359       +1     
==========================================
+ Hits         1388     1398      +10     
  Misses        128      128              
  Partials       55       55              
Impacted Files Coverage Δ
camelot/io.py 100.00% <ø> (ø)
camelot/handlers.py 91.66% <100.00%> (+0.96%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7709e58...63161fe. Read the comment docs.

@vinayak-mehta
Copy link
Member

Why do we need to pass parallel as an argument to each test?

@phoewass
Copy link
Contributor Author

I did this to avoid copying the tests and make sure that the new argument parallel does not break the API.

Since the existing code overwrites `layout` and `dim` in each iteration,
it is much more efficient to simply return the `layout` and `dim` of the
first page.

I have tested the difference with a 455 page pdf and the optimisation
reduces the time spent from 50 to 5 seconds.

Signed-off-by: Karl Bonde Torp <[email protected]>
@maxdd
Copy link

maxdd commented May 6, 2022

Will this ever be merged or is camelot already mp?

@jgcmarins
Copy link

What is missing for this to be merged?

@hashangayasri
Copy link

Is this still WIP? What's preventing this from being merged?

@MartinThoma
Copy link
Collaborator

Currently, there are merge conflicts that first need to be resolved.

#353 might also get merged and that makes me uncertain how to continue with all others

@MartinThoma
Copy link
Collaborator

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

foarsitter and others added 4 commits February 28, 2024 08:38
  • Installing authlib (1.3.0)
  • Installing marshmallow (3.21.0)
  • Installing pydantic (1.10.14)
  • Installing safety-schemas (0.0.2)
  • Installing typer (0.9.0)
  • Removing gitdb (4.0.10)
  • Removing gitpython (3.1.37)
  • Removing smmap (5.0.0)
  • Updating attrs (23.1.0 -> 23.2.0)
  • Updating babel (2.12.1 -> 2.14.0)
  • Updating bandit (1.7.5 -> 1.7.7)
  • Updating beautifulsoup4 (4.12.2 -> 4.12.3)
  • Updating black (23.7.0 -> 24.2.0)
  • Updating certifi (2023.7.22 -> 2024.2.2)
  • Updating cffi (1.15.1 -> 1.16.0)
  • Updating cfgv (3.3.1 -> 3.4.0)
  • Updating chardet (5.1.0 -> 5.2.0)
  • Updating charset-normalizer (3.2.0 -> 3.3.2)
  • Updating click (8.1.5 -> 8.1.7)
  • Updating contourpy (1.1.0 -> 1.1.1)
  • Updating coverage (7.2.7 -> 7.4.3)
  • Updating cryptography (41.0.4 -> 42.0.5)
  • Updating cycler (0.11.0 -> 0.12.1)
  • Updating distlib (0.3.6 -> 0.3.8)
  • Updating dparse (0.6.3 -> 0.6.4b0)
  • Updating filelock (3.12.4 -> 3.13.1)
  • Updating fonttools (4.41.0 -> 4.49.0)
  • Updating furo (2023.9.10 -> 2024.1.29)
  • Updating identify (2.5.29 -> 2.5.35)
  • Updating idna (3.4 -> 3.6)
  • Updating isort (5.12.0 -> 5.13.2)
  • Updating jinja2 (3.1.2 -> 3.1.3)
  • Updating markupsafe (2.1.3 -> 2.1.5)
  • Updating matplotlib (3.7.2 -> 3.7.5)
  • Updating mypy (1.4.1 -> 1.8.0)
  • Updating opencv-python (4.8.1.78 -> 4.9.0.80)
  • Updating packaging (23.1 -> 23.2)
  • Updating pathspec (0.11.1 -> 0.12.1)
  • Updating pbr (5.11.1 -> 6.0.0)
  • Updating pillow (10.0.0 -> 10.2.0)
  • Updating platformdirs (3.8.1 -> 4.2.0)
  • Updating pluggy (1.2.0 -> 1.4.0)
  • Updating pre-commit (3.4.0 -> 3.5.0)
  • Updating pre-commit-hooks (4.4.0 -> 4.5.0)
  • Updating pygments (2.15.1 -> 2.17.2)
  • Updating pyparsing (3.0.9 -> 3.1.1)
  • Updating pypdf (3.12.1 -> 3.17.4)
  • Updating pytest (7.4.0 -> 8.0.2)
  • Updating pytz (2023.3 -> 2024.1)
  • Updating pyyaml (6.0 -> 6.0.1)
  • Updating rich (13.4.2 -> 13.7.0)
  • Updating ruamel-yaml (0.17.32 -> 0.18.6)
  • Updating ruamel-yaml-clib (0.2.7 -> 0.2.8)
  • Updating safety (2.3.4 -> 3.0.1)
  • Updating setuptools (68.0.0 -> 69.1.1)
  • Updating soupsieve (2.4.1 -> 2.5)
  • Updating sphinx (7.0.1 -> 7.1.2)
  • Updating sphinx-click (4.4.0 -> 5.1.0)
  • Updating stevedore (5.1.0 -> 5.2.0)
  • Updating tokenize-rt (5.1.0 -> 5.2.0)
  • Updating tornado (6.3.3 -> 6.4)
  • Updating typeguard (4.0.0 -> 4.1.5)
  • Updating typing-extensions (4.7.1 -> 4.10.0)
  • Updating urllib3 (2.0.3 -> 2.2.1)
  • Updating virtualenv (20.24.0 -> 20.25.1)
  • Updating xdoctest (1.1.1 -> 1.1.3)
Fix situation where pdftopng is not found if executing python directly from an un-activated environment.
Fix safety issues by update lockfile
@phoewass phoewass force-pushed the feature/parallel branch 2 times, most recently from 428cb18 to a06796b Compare March 29, 2024 03:38
@phoewass
Copy link
Contributor Author

Moved to py-pdf#17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use multiprocessing to parallely process PDF pages
10 participants