-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
News #45
News #45
Conversation
Am I correct in assuming that this is related to #44? Can you list in the PR which sources you have included / which you plan on doing? |
Yes it's related. I'm planning to get through all the website mentioned. Currently have Propublica and Democracy Now. |
Can you add the (estimated) number of documents and number of tokens to the README? |
What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid? |
Yup! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Sorry for the slow review, feel free to ping me in the future!
Thanks for the PR, it handles a lot!
It does feel unfortunate to need a script that enumerates all the pages and their licenses, but I don't have a great idea on how to do that better.
Right now, the conversion to the dolma formatting and the parsing of the html to raw text is happening at once, do you think it would make sense break those up? have a raw dolma dataset that has the html pages and then the text parsing is a secondary job that uses dolma parallelization? I guess it depends on if we think the html parsing is going to change
Do you have some example outputs you could share?
Again, thanks for you work on this!
python news/get_page.py --url https://www.thepublicrecord.ca/ --output_dir data/news/raw/thepublicrecord/ | ||
|
||
# Public Domain | ||
python news/get_page.py --url https://central.asia-news.com/en_GB/ --output_dir data/news/raw/caravanserai/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add new-lines to these files? Otherwise things like this diff will keep complaining lol
python news/get_text.py --license Public Domain --input_dir data/news/raw/caravanserai/ --filename news-caravanserai.jsonl.gz --output_dir data/news/ --tag div --attrs '{"class": "article__content"}' | ||
|
||
# CC NC ND | ||
# python news/get_text.py --license CC NC ND --input_dir data/news/raw/projectmultatuli/ --filename news-projectmultatuli.jsonl.gz --output_dir data/news/ --tag div --attrs '{"class": "elementor-widget-container"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
|
||
parser = argparse.ArgumentParser(description="Download News Sites") | ||
parser.add_argument( | ||
"--url", default="https://www.propublica.org/", help="Base URL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to have no default value and make this flag required?
) | ||
parser.add_argument( | ||
"--index_path", | ||
default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to set the default to None, it is already that.
help="File that list of all pages", | ||
) | ||
parser.add_argument( | ||
"--output_dir", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as for --url
text = [soup.title.get_text() if soup.title else ""] | ||
|
||
# Search for author | ||
if byline := soup.find(class_=re.compile("byline")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all seem to do a re search and then have the same code to processes the result, can it be simplified with something like:
if byline := soup.find(class_=re.compile(r"(byline|post-author|posted-by|..."):
...
# Search for author | ||
if byline := soup.find(class_=re.compile("byline")): | ||
text.append(byline.get_text().strip()) | ||
elif byline := soup.find(class_=re.compile("post-author")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed it myself, but can you add a comment like
# Calling `re.compile(...)` repeatedly in a this processing function (once for each document) is ok because the resulting compiled re object is cached and reused
Otherwise I'm going to forget and wonder if repeated calls is an issue in the future lol.
text.append(byline.get_text().strip()) | ||
|
||
# Search for dateline | ||
if dateline := soup.find("time", class_=re.compile("title")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same simplification question as above
elif child.name == "br": | ||
split_p.append("".join(text_pieces)) | ||
text_pieces = [] | ||
elif child.name == "em": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the rest of this conditional can be simplified to something like the following?
...
elif child.name in ("em", "a", ...):
text_pieces.extend(child.get_text())
And maybe the in
check should be against a variable like FORMATTED_STRING_TAGS
or something?
text_article = [ | ||
article_paragraph | ||
for s in split_p | ||
if is_valid(article_paragraph := s.strip()) and s.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid the double strip here with
...
if is_valid(article_paragraph := s.strip()) and article_paragraph
...
Or you could push the empty check into the is_valid
function
Sorry for the delay. Will get to it this week. |
@lintangsutawika any bandwidth to finish this up? |
I made all the changes I had talked about and I separated the scraping steps to make the code a bit simpler in this PR #68. We should be able to merge that one |
Development moved to this branch for easier colab #68 |
Sites included. Source: https://opennewswire.org/feed/
CC BY
CC BY-SA
Public Domain
Voice of America intentionally left out so that it does not overlap with MOT.