Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article scrape loses linked content #21

Open
dotdotdotpaul opened this issue Nov 21, 2016 · 1 comment
Open

Article scrape loses linked content #21

dotdotdotpaul opened this issue Nov 21, 2016 · 1 comment

Comments

@dotdotdotpaul
Copy link

The article scrape feature seems to work pretty fantastically, given how impossible the task seems to be, but I do notice that if an article's content has links, the text of those links isn't returned as part of the fulltext attribute. This results in some odd output, where if the source said something like "For more information, click here." (where "click here" is linked) the actual fulltext comes back as "For more information, ."

This appears to extend to any marked-up text, as I'm noticing things like subheadings in articles are simply missing. I might have some time this coming weekend to look into a solution, but if anyone's got a head start on this problem, let me know.

@Anonyfox
Copy link
Owner

Yes, my very naive implementation seems to get the job done in many cases. I personally use the plaintext just for word analysis stuff, so I didn't even notice this behavior.

If you want to dig in a little, I believe the relevant line is this:

https://github.com/Anonyfox/elixir-scrape/blob/master/lib/scrape/util/text.ex#L22

where I just strip out everything that is HTML. I'd happily accept a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants