Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layout issue when characters are not overlapping but below the 5.5 newline threshold #50

Open
KaranLala opened this issue Jul 26, 2024 · 0 comments

Comments

@KaranLala
Copy link

In some cases, the layout seems to incorrectly place text out of position, as you can see in the images below. The 68.7% should be before the 70.1% value, however
Screenshot 2024-07-25 180350

test

I've dug in to the code a bit, but I haven't dug in all the way, because I was able to find a solution that is working for me so far -- so bear with me.

The TextPosition list returned by PDFTextStripper contains both these values in the same list (row) even though they don't overlap, I'm assuming due to a certain height difference threshold. The values of TextPosition for the 7 in 70.1% and the 6 in 68.7% are below. Based on the values of TextPosition, the TextPositionComparator class sorts the 68.7% after the 70.1% because they do not overlap and the y threshold for comparison is set to 0.1. PDFLayoutTextStripper creates a new line when there is a y difference of 5.5 between the y co-ordinates of two adjacent TextPositions. In this case, the values do not overlap, and have a y difference greater than 0.1 but less than 5.5, and so they end up on the same line out of position.

c idx      x         y       xadj           yadj        ystart     height         
7 573   573.973   303.4036   573.973     303.4036     298.9431   4.4604654
6 525   525.895   308.4053    525.895    308.40533   303.94485    4.4604654

It can be argued that the fix for this lies in the implementation of TextLine.computeIndexForCharacter, because it should be able to handle placing the value on the same line in a previous index. However, I was able to solve this by updating the TextPositionComparator threshold from 0.1 to <= 5.5. This way characters that belong on the same line are properly sorted based on the x values.

I'm not sure if this library is actively maintained, but I hope this helps anyone that runs into the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant