Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve Links and Text related to Public Dataset #53

Merged
merged 6 commits into from
Sep 19, 2024

Conversation

Precious-Macaulay
Copy link
Contributor

@Precious-Macaulay Precious-Macaulay commented Sep 14, 2024

See #51

I updated the text_extraction.xml stylesheet to keep external links without affecting readability and also searched for more articles to know where public datasets text tends to be and updated the stylesheet to handle that accordindly. i have attached a copy of my extracted text data from some articles i gathered including the example in the issue too.
text.csv

@agt24
Copy link

agt24 commented Sep 17, 2024

Thanks @Precious-Macaulay
I'll review this in the next day or so.

@Precious-Macaulay
Copy link
Contributor Author

Alright

@agt24
Copy link

agt24 commented Sep 18, 2024

I just reviewed and tested this PR on a couple PMCids. Here are the output files
17048730_text.csv
17048739_text.csv

These outputs work well for my use case and resolved the issue in #53. They also look fine in LabelBuddy.

@adelavega and @jeromedockes is there any additional testing you want to do or are you happy to merge this PR into Main?

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, @Precious-Macaulay !
LGTM

src/pubget/_data/stylesheets/text_extraction.xsl Outdated Show resolved Hide resolved
@jeromedockes jeromedockes merged commit 12d1451 into neuroquery:main Sep 19, 2024
5 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants