-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regulations #84
base: main
Are you sure you want to change the base?
Regulations #84
Conversation
- fix missing imports
I believe that for most of the other sources I'm aware of, there are typically about 1/2 as many tokens as there are on-disk bytes of gzipped jsonl. In your case there are about 2x as many, any idea where this discrepancy comes from? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some comments, I think the index metadata format should be clarified, there are some places where it seems to get processed in different ways.
I also didn't look at the usgpo files at all, I assumed they covered by your other PR
@@ -0,0 +1,150 @@ | |||
"""Build index of document URLs from a Regulations.gov bulk download file""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy pasted docstring
index = defaultdict(list) | ||
num_skipped = 0 | ||
num_parsed = 0 | ||
with open(input_file, "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A file that is going to be used with the csv reader should be opened with newline=''
|
||
|
||
def parse_args(): | ||
parser = argparse.ArgumentParser("Regulations.gov index builder") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing description=...
, as is this overwrites the program name
|
||
|
||
def parse_args(): | ||
parser = argparse.ArgumentParser("Regulations.gov index builder") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
return parser.parse_args() | ||
|
||
|
||
def convert_htm(input_path, output_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer naming this convert_html
, htm
is super legacy from when windows couldn't have an extension with > 3 chars lol
with open(file_path, "r", encoding="windows-1252") as f: | ||
text = f.read() | ||
except UnicodeDecodeError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add an error that this file couldn't be opened
url = None | ||
for file_metadata in metadata["Content Files"]: | ||
if file_metadata["File Type"] in [".htm", ".txt", ".doc", ".docx"]: | ||
url = file_metadata["URL"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this index format again, this is always going to set url to the last file type in the metadata["content Files"]
field, but the downloads earlier seemed to download all the different formats?
if file_metadata["File Type"] in [".htm", ".txt", ".doc", ".docx"]: | ||
url = file_metadata["URL"] | ||
|
||
record = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dolma format is incorrect, document_type
, title
, agency
should all be in the metadata
dict and posted_date
should be created
for agency in args.agencies: | ||
os.makedirs(os.path.join(args.output_dir, year, agency), exist_ok=True) | ||
|
||
for year, agency in itertools.product(args.years, args.agencies): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this downloading be parallelized at all? Or is the data small/you are parallelizing by running disjoint (year, agency) pairs on multiple workers?
|
||
# Data Collection | ||
|
||
This collection's metadata was first gathered via a number of bulk download requests to Regulations.gov since each bulk download can only be for metadata related to a single agency over a single year span. In total, we requested metadata from 14 agencies between the years 2000 and 2023. To collect the documents, run the script `get-data.sh`. Internally this parses the metadata files to create an index of all file URLs referenced in the metadata. It then downloads all of the referenced .doc, .docx, .txt, and .htm files and converts each format to plaintext. Finally, it reads each of these converted files and stores them in a Dolma dataset. The resulting dataset is written to `data/regulations/v0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mentioned get-data.sh
script doesn't seem to be checked in
This PR closes #32. The data from 2000-2023 for the agencies listed below has been collected and is on HF here. The total amount of data is around 2B tokens and 1.3GB on disk.
Agencies: