Skip to content

Conversation

@jazzpi
Copy link

@jazzpi jazzpi commented May 5, 2021

Addresses #21.

I've never worked with PDFBox before, so I hope this is the right approach -- it works at least for this file (without the color filter, some of the underlines for the hyperlinks are detected as rulings, which splits those rows). However, it doesn't work for this file (without the color filter, it simply detects all cells as separate). With the color filter, it exports the following CSV:

A,B
4","2
5,6

Is this an issue with the color filter or is it related to the red and black lines crossing?

Other notes:

  • I'm not sure if this is a good way to pass the line color filter argument to the ObjectExtractorStreamEngine.
  • I haven't added tests yet. I should hopefully have some time next week to debug further and add them.
  • I couldn't come up with a sensible short-style command line option, so I only added a long-style one.

@jazzpi
Copy link
Author

jazzpi commented May 11, 2021

... so after a couple hours of debugging I just realized that this happens because the line returns used by tabula-java are carriage returns instead of line feeds, which means the beginning of the line is overwritten, and it actually works just fine.

I've added a test as well and think this is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant