Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regulations #84

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Regulations #84

wants to merge 9 commits into from

Conversation

nkandpa2
Copy link
Collaborator

This PR closes #32. The data from 2000-2023 for the agencies listed below has been collected and is on HF here. The total amount of data is around 2B tokens and 1.3GB on disk.

Agencies:

  • BIS
  • DOT
  • EPA
  • FAA
  • FDA
  • FEMA
  • FERC
  • FMCSA
  • FRA
  • NHTSA
  • OSHA
  • PHMSA
  • SEC
  • USCG

@craffel
Copy link
Collaborator

craffel commented Jun 10, 2024

I believe that for most of the other sources I'm aware of, there are typically about 1/2 as many tokens as there are on-disk bytes of gzipped jsonl. In your case there are about 2x as many, any idea where this discrepancy comes from?

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments, I think the index metadata format should be clarified, there are some places where it seems to get processed in different ways.

I also didn't look at the usgpo files at all, I assumed they covered by your other PR

@@ -0,0 +1,150 @@
"""Build index of document URLs from a Regulations.gov bulk download file"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy pasted docstring

index = defaultdict(list)
num_skipped = 0
num_parsed = 0
with open(input_file, "r") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A file that is going to be used with the csv reader should be opened with newline=''



def parse_args():
parser = argparse.ArgumentParser("Regulations.gov index builder")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing description=..., as is this overwrites the program name



def parse_args():
parser = argparse.ArgumentParser("Regulations.gov index builder")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

return parser.parse_args()


def convert_htm(input_path, output_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer naming this convert_html, htm is super legacy from when windows couldn't have an extension with > 3 chars lol

with open(file_path, "r", encoding="windows-1252") as f:
text = f.read()
except UnicodeDecodeError:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add an error that this file couldn't be opened

url = None
for file_metadata in metadata["Content Files"]:
if file_metadata["File Type"] in [".htm", ".txt", ".doc", ".docx"]:
url = file_metadata["URL"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this index format again, this is always going to set url to the last file type in the metadata["content Files"] field, but the downloads earlier seemed to download all the different formats?

if file_metadata["File Type"] in [".htm", ".txt", ".doc", ".docx"]:
url = file_metadata["URL"]

record = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dolma format is incorrect, document_type, title, agency should all be in the metadata dict and posted_date should be created

for agency in args.agencies:
os.makedirs(os.path.join(args.output_dir, year, agency), exist_ok=True)

for year, agency in itertools.product(args.years, args.agencies):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this downloading be parallelized at all? Or is the data small/you are parallelizing by running disjoint (year, agency) pairs on multiple workers?


# Data Collection

This collection's metadata was first gathered via a number of bulk download requests to Regulations.gov since each bulk download can only be for metadata related to a single agency over a single year span. In total, we requested metadata from 14 agencies between the years 2000 and 2023. To collect the documents, run the script `get-data.sh`. Internally this parses the metadata files to create an index of all file URLs referenced in the metadata. It then downloads all of the referenced .doc, .docx, .txt, and .htm files and converts each format to plaintext. Finally, it reads each of these converted files and stores them in a Dolma dataset. The resulting dataset is written to `data/regulations/v0`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentioned get-data.sh script doesn't seem to be checked in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regulations.gov
3 participants