News #45

lintangsutawika · 2024-01-05T12:59:56Z

Sites included. Source: https://opennewswire.org/feed/

CC BY

360info
Africa is a Country
Alt News
Balkan Diskurs
Factly
Freedom of the Press Foundation
Agenzia Fides
Global Voices
Meduza
Mekong Eye
Milwaukee Neighborhood News Service
Minority Africa
New Canadian Media
SciDev.Net
The Solutions Journalism Exchange
Tasnim News Agency
ZimFact

CC BY-SA

Liberty TV
Oxpeckers
Propastop
The Public Record

Public Domain

Caravanserai

Voice of America intentionally left out so that it does not overlap with MOT.

StellaAthena · 2024-01-08T17:29:21Z

Am I correct in assuming that this is related to #44? Can you list in the PR which sources you have included / which you plan on doing?

lintangsutawika · 2024-01-09T02:04:12Z

Yes it's related. I'm planning to get through all the website mentioned. Currently have Propublica and Democracy Now.

news/get_text.py

… into news

StellaAthena · 2024-01-16T03:37:44Z

Can you add the (estimated) number of documents and number of tokens to the README?

lintangsutawika · 2024-01-16T03:52:21Z

What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid?

StellaAthena · 2024-01-16T18:10:24Z

What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid?

Yup!

blester125

Hi! Sorry for the slow review, feel free to ping me in the future!

Thanks for the PR, it handles a lot!

It does feel unfortunate to need a script that enumerates all the pages and their licenses, but I don't have a great idea on how to do that better.

Right now, the conversion to the dolma formatting and the parsing of the html to raw text is happening at once, do you think it would make sense break those up? have a raw dolma dataset that has the html pages and then the text parsing is a secondary job that uses dolma parallelization? I guess it depends on if we think the html parsing is going to change

Do you have some example outputs you could share?

Again, thanks for you work on this!

blester125 · 2024-02-12T18:32:57Z

news/get-metadata.sh

+python news/get_page.py --url https://www.thepublicrecord.ca/ --output_dir data/news/raw/thepublicrecord/
+
+# Public Domain
+python news/get_page.py --url https://central.asia-news.com/en_GB/ --output_dir data/news/raw/caravanserai/


Can you add new-lines to these files? Otherwise things like this diff will keep complaining lol

blester125 · 2024-02-12T18:33:07Z

news/get-text.sh

+python news/get_text.py --license Public Domain --input_dir data/news/raw/caravanserai/ --filename news-caravanserai.jsonl.gz --output_dir data/news/ --tag div --attrs '{"class": "article__content"}'
+
+# CC NC ND
+# python news/get_text.py --license CC NC ND --input_dir data/news/raw/projectmultatuli/ --filename news-projectmultatuli.jsonl.gz --output_dir data/news/ --tag div --attrs '{"class": "elementor-widget-container"}'


Same as above

blester125 · 2024-02-12T18:34:29Z

news/get_page.py

+
+parser = argparse.ArgumentParser(description="Download News Sites")
+parser.add_argument(
+    "--url", default="https://www.propublica.org/", help="Base URL"


Would it be better to have no default value and make this flag required?

blester125 · 2024-02-12T18:35:03Z

news/get_page.py

+)
+parser.add_argument(
+    "--index_path",
+    default=None,


You don't need to set the default to None, it is already that.

blester125 · 2024-02-12T18:35:14Z

news/get_page.py

+    help="File that list of all pages",
+)
+parser.add_argument(
+    "--output_dir",


Same question as for --url

blester125 · 2024-02-12T19:09:42Z

news/utils.py

+    text = [soup.title.get_text() if soup.title else ""]
+
+    # Search for author
+    if byline := soup.find(class_=re.compile("byline")):


These all seem to do a re search and then have the same code to processes the result, can it be simplified with something like:

if byline := soup.find(class_=re.compile(r"(byline|post-author|posted-by|..."): ...

blester125 · 2024-02-12T19:13:21Z

news/utils.py

+    # Search for author
+    if byline := soup.find(class_=re.compile("byline")):
+        text.append(byline.get_text().strip())
+    elif byline := soup.find(class_=re.compile("post-author")):


I confirmed it myself, but can you add a comment like

# Calling `re.compile(...)` repeatedly in a this processing function (once for each document) is ok because the resulting compiled re object is cached and reused

Otherwise I'm going to forget and wonder if repeated calls is an issue in the future lol.

blester125 · 2024-02-12T19:13:46Z

news/utils.py

+        text.append(byline.get_text().strip())
+
+    # Search for dateline
+    if dateline := soup.find("time", class_=re.compile("title")):


Same simplification question as above

blester125 · 2024-02-12T19:16:34Z

news/utils.py

+                elif child.name == "br":
+                    split_p.append("".join(text_pieces))
+                    text_pieces = []
+                elif child.name == "em":


Can the rest of this conditional can be simplified to something like the following?

... elif child.name in ("em", "a", ...): text_pieces.extend(child.get_text())

And maybe the in check should be against a variable like FORMATTED_STRING_TAGS or something?

blester125 · 2024-02-12T19:22:22Z

news/utils.py

+            text_article = [
+                article_paragraph
+                for s in split_p
+                if is_valid(article_paragraph := s.strip()) and s.strip()


I think you can avoid the double strip here with

... if is_valid(article_paragraph := s.strip()) and article_paragraph ...

Or you could push the empty check into the is_valid function

lintangsutawika · 2024-03-04T15:44:17Z

Sorry for the delay. Will get to it this week.

craffel · 2024-05-06T17:45:21Z

@lintangsutawika any bandwidth to finish this up?

blester125 · 2024-05-07T05:39:45Z

I made all the changes I had talked about and I separated the scraping steps to make the code a bit simpler in this PR #68. We should be able to merge that one

blester125 · 2024-05-08T19:37:03Z

Development moved to this branch for easier colab #68

lintangsutawika added 8 commits January 5, 2024 09:37

added propublica

3984343

add process_map

d19aded

add multiprocess

f748628

add differentiation between clean and raw directory

abee566

renamed directory

2455a84

add os

e2a4f9b

added new source

38659d3

update

cde6843

lintangsutawika added 14 commits January 9, 2024 02:37

unified process for all news

dca12c4

add way to save index

ad69010

run individual news in process.py

cefb4d2

remove

81bf77a

better parsing

e6dd3f2

added notes for each site

ea0a5e3

processes ahref em and strong tags

d461178

allow both html and url choice to be used

a87f961

add byline and fix html_path

34d1efa

update how pages are saved

2a70e4f

add dependancies

e13dbcf

failed pages are saved to a new file

5de7e8a

process to get_page.py, and added news sites

8a0d0e9

add script for processing html

d199ec5

StellaAthena linked an issue Jan 9, 2024 that may be closed by this pull request

List of News Sources #44

Open

33 tasks

lintangsutawika added 5 commits January 9, 2024 15:04

set new arguments

8f41be8

add filename to jsonfile

cfd514d

add args and capture exceptions

0255da4

update get_record

b9fe990

removed comments

02f5607

blester125 reviewed Jan 12, 2024

View reviewed changes

news/get_text.py Outdated Show resolved Hide resolved

lintangsutawika added 18 commits January 13, 2024 05:06

simplify multiprocess

4e2c89c

simplify multiprocess

6e70319

Merge branch 'main' into news

7ec4b3b

Merge branch 'main' of https://github.com/lintangsutawika/licensed-pile…

574f724

… into news

Merge branch 'news' of https://github.com/lintangsutawika/licensed-pile…

68ec9b9

… into news

add empty line

29a032f

add license arg

47723d6

alphabetical order

1fa19dc

better process to capture bylines and time

9421857

add header

a1fa891

add paramter of searching for date and bylines

90cefc9

attrs are searched as regex string

6f8bdd4

update attribute search

6b83775

update method name

1e5f80c

change header name

8493cfc

updated parameters for CC BY sites

1c5b9a8

author then date

910058f

update search and attrs

c686194

lintangsutawika marked this pull request as ready for review January 15, 2024 08:59

add readme

9ee788d

blester125 requested changes Feb 12, 2024

View reviewed changes

blester125 closed this May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

News #45

News #45

lintangsutawika commented Jan 5, 2024 •

edited

Loading

StellaAthena commented Jan 8, 2024

lintangsutawika commented Jan 9, 2024

StellaAthena commented Jan 16, 2024

lintangsutawika commented Jan 16, 2024

StellaAthena commented Jan 16, 2024

blester125 left a comment

blester125 Feb 12, 2024

blester125 Feb 12, 2024

blester125 Feb 12, 2024

blester125 Feb 12, 2024

blester125 Feb 12, 2024

blester125 Feb 12, 2024

blester125 Feb 12, 2024

blester125 Feb 12, 2024

blester125 Feb 12, 2024

blester125 Feb 12, 2024

lintangsutawika commented Mar 4, 2024

craffel commented May 6, 2024

blester125 commented May 7, 2024 •

edited

Loading

blester125 commented May 8, 2024

News #45

News #45

Conversation

lintangsutawika commented Jan 5, 2024 • edited Loading

StellaAthena commented Jan 8, 2024

lintangsutawika commented Jan 9, 2024

StellaAthena commented Jan 16, 2024

lintangsutawika commented Jan 16, 2024

StellaAthena commented Jan 16, 2024

blester125 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintangsutawika commented Mar 4, 2024

craffel commented May 6, 2024

blester125 commented May 7, 2024 • edited Loading

blester125 commented May 8, 2024

lintangsutawika commented Jan 5, 2024 •

edited

Loading

blester125 commented May 7, 2024 •

edited

Loading