Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New line #12

Open
mefiu opened this issue Jun 29, 2017 · 1 comment
Open

New line #12

mefiu opened this issue Jun 29, 2017 · 1 comment

Comments

@mefiu
Copy link

mefiu commented Jun 29, 2017

While parsing tabular data a new line is invoked every time this condition is met:
if ( textYPosition > previousTextYPosition )

Now this is too sensitive if a row of a table contains two different font sizes.
It doesn't have to be a huge difference in font size.
One point in a font size is enough for the existing function
getNumberOfNewLinesFromPreviousTextPosition()
to call for a new line, which of course results in a bad text output.

I've modified this function to have simple threshold, while checking for new line:
if ( textYPosition - previousTextYPosition > newLineHeightThreshold )
and now it works just perfect.

BTW: great job Jonathan with this little class :)

@JonathanLink
Copy link
Owner

Thank you a lot for reporting that and I appreciate that you like this class.
I am going to see what I can do. The difficulty is about finding the good threshold. Maybe I could add an optional parameter to the PDFLayoutTextStripper constructor to let the user define such a threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants