You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some cases, the layout seems to incorrectly place text out of position, as you can see in the images below. The 68.7% should be before the 70.1% value, however
I've dug in to the code a bit, but I haven't dug in all the way, because I was able to find a solution that is working for me so far -- so bear with me.
The TextPosition list returned by PDFTextStripper contains both these values in the same list (row) even though they don't overlap, I'm assuming due to a certain height difference threshold. The values of TextPosition for the 7 in 70.1% and the 6 in 68.7% are below. Based on the values of TextPosition, the TextPositionComparator class sorts the 68.7% after the 70.1% because they do not overlap and the y threshold for comparison is set to 0.1. PDFLayoutTextStripper creates a new line when there is a y difference of 5.5 between the y co-ordinates of two adjacent TextPositions. In this case, the values do not overlap, and have a y difference greater than 0.1 but less than 5.5, and so they end up on the same line out of position.
c idx x y xadj yadj ystart height
7 573 573.973 303.4036 573.973 303.4036 298.9431 4.4604654
6 525 525.895 308.4053 525.895 308.40533 303.94485 4.4604654
It can be argued that the fix for this lies in the implementation of TextLine.computeIndexForCharacter, because it should be able to handle placing the value on the same line in a previous index. However, I was able to solve this by updating the TextPositionComparator threshold from 0.1 to <= 5.5. This way characters that belong on the same line are properly sorted based on the x values.
I'm not sure if this library is actively maintained, but I hope this helps anyone that runs into the same issue.
The text was updated successfully, but these errors were encountered:
In some cases, the layout seems to incorrectly place text out of position, as you can see in the images below. The 68.7% should be before the 70.1% value, however
I've dug in to the code a bit, but I haven't dug in all the way, because I was able to find a solution that is working for me so far -- so bear with me.
The TextPosition list returned by PDFTextStripper contains both these values in the same list (row) even though they don't overlap, I'm assuming due to a certain height difference threshold. The values of TextPosition for the 7 in 70.1% and the 6 in 68.7% are below. Based on the values of TextPosition, the TextPositionComparator class sorts the 68.7% after the 70.1% because they do not overlap and the y threshold for comparison is set to 0.1. PDFLayoutTextStripper creates a new line when there is a y difference of 5.5 between the y co-ordinates of two adjacent TextPositions. In this case, the values do not overlap, and have a y difference greater than 0.1 but less than 5.5, and so they end up on the same line out of position.
It can be argued that the fix for this lies in the implementation of TextLine.computeIndexForCharacter, because it should be able to handle placing the value on the same line in a previous index. However, I was able to solve this by updating the TextPositionComparator threshold from 0.1 to <= 5.5. This way characters that belong on the same line are properly sorted based on the x values.
I'm not sure if this library is actively maintained, but I hope this helps anyone that runs into the same issue.
The text was updated successfully, but these errors were encountered: