Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

News #45

Closed
wants to merge 58 commits into from
Closed

News #45

wants to merge 58 commits into from

Conversation

lintangsutawika
Copy link
Collaborator

@lintangsutawika lintangsutawika commented Jan 5, 2024

Sites included. Source: https://opennewswire.org/feed/

CC BY

  • 360info
  • Africa is a Country
  • Alt News
  • Balkan Diskurs
  • Factly
  • Freedom of the Press Foundation
  • Agenzia Fides
  • Global Voices
  • Meduza
  • Mekong Eye
  • Milwaukee Neighborhood News Service
  • Minority Africa
  • New Canadian Media
  • SciDev.Net
  • The Solutions Journalism Exchange
  • Tasnim News Agency
  • ZimFact

CC BY-SA

  • Liberty TV
  • Oxpeckers
  • Propastop
  • The Public Record

Public Domain

  • Caravanserai

Voice of America intentionally left out so that it does not overlap with MOT.

@StellaAthena
Copy link
Collaborator

Am I correct in assuming that this is related to #44? Can you list in the PR which sources you have included / which you plan on doing?

@lintangsutawika
Copy link
Collaborator Author

Yes it's related. I'm planning to get through all the website mentioned. Currently have Propublica and Democracy Now.

@StellaAthena StellaAthena linked an issue Jan 9, 2024 that may be closed by this pull request
33 tasks
@lintangsutawika lintangsutawika marked this pull request as ready for review January 15, 2024 08:59
@StellaAthena
Copy link
Collaborator

Can you add the (estimated) number of documents and number of tokens to the README?

@lintangsutawika
Copy link
Collaborator Author

What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid?

@StellaAthena
Copy link
Collaborator

What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid?

Yup!

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Sorry for the slow review, feel free to ping me in the future!

Thanks for the PR, it handles a lot!

It does feel unfortunate to need a script that enumerates all the pages and their licenses, but I don't have a great idea on how to do that better.

Right now, the conversion to the dolma formatting and the parsing of the html to raw text is happening at once, do you think it would make sense break those up? have a raw dolma dataset that has the html pages and then the text parsing is a secondary job that uses dolma parallelization? I guess it depends on if we think the html parsing is going to change

Do you have some example outputs you could share?

Again, thanks for you work on this!

python news/get_page.py --url https://www.thepublicrecord.ca/ --output_dir data/news/raw/thepublicrecord/

# Public Domain
python news/get_page.py --url https://central.asia-news.com/en_GB/ --output_dir data/news/raw/caravanserai/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add new-lines to these files? Otherwise things like this diff will keep complaining lol

python news/get_text.py --license Public Domain --input_dir data/news/raw/caravanserai/ --filename news-caravanserai.jsonl.gz --output_dir data/news/ --tag div --attrs '{"class": "article__content"}'

# CC NC ND
# python news/get_text.py --license CC NC ND --input_dir data/news/raw/projectmultatuli/ --filename news-projectmultatuli.jsonl.gz --output_dir data/news/ --tag div --attrs '{"class": "elementor-widget-container"}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above


parser = argparse.ArgumentParser(description="Download News Sites")
parser.add_argument(
"--url", default="https://www.propublica.org/", help="Base URL"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to have no default value and make this flag required?

)
parser.add_argument(
"--index_path",
default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to set the default to None, it is already that.

help="File that list of all pages",
)
parser.add_argument(
"--output_dir",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as for --url

text = [soup.title.get_text() if soup.title else ""]

# Search for author
if byline := soup.find(class_=re.compile("byline")):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all seem to do a re search and then have the same code to processes the result, can it be simplified with something like:

if byline  := soup.find(class_=re.compile(r"(byline|post-author|posted-by|..."):
  ...

# Search for author
if byline := soup.find(class_=re.compile("byline")):
text.append(byline.get_text().strip())
elif byline := soup.find(class_=re.compile("post-author")):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed it myself, but can you add a comment like

# Calling `re.compile(...)` repeatedly in a this processing function (once for each document) is ok because the resulting compiled re object is cached and reused

Otherwise I'm going to forget and wonder if repeated calls is an issue in the future lol.

text.append(byline.get_text().strip())

# Search for dateline
if dateline := soup.find("time", class_=re.compile("title")):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same simplification question as above

elif child.name == "br":
split_p.append("".join(text_pieces))
text_pieces = []
elif child.name == "em":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the rest of this conditional can be simplified to something like the following?

...
elif child.name in ("em", "a", ...):
  text_pieces.extend(child.get_text())

And maybe the in check should be against a variable like FORMATTED_STRING_TAGS or something?

text_article = [
article_paragraph
for s in split_p
if is_valid(article_paragraph := s.strip()) and s.strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid the double strip here with

...
if is_valid(article_paragraph := s.strip()) and article_paragraph
...

Or you could push the empty check into the is_valid function

@lintangsutawika
Copy link
Collaborator Author

Sorry for the delay. Will get to it this week.

@craffel
Copy link
Collaborator

craffel commented May 6, 2024

@lintangsutawika any bandwidth to finish this up?

@blester125
Copy link
Collaborator

blester125 commented May 7, 2024

I made all the changes I had talked about and I separated the scraping steps to make the code a bit simpler in this PR #68. We should be able to merge that one

@blester125
Copy link
Collaborator

Development moved to this branch for easier colab #68

@blester125 blester125 closed this May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

List of News Sources
4 participants