Regulations #84

nkandpa2 · 2024-06-10T17:23:57Z

This PR closes #32. The data from 2000-2023 for the agencies listed below has been collected and is on HF here. The total amount of data is around 2B tokens and 1.3GB on disk.

Agencies:

BIS
DOT
EPA
FAA
FDA
FEMA
FERC
FMCSA
FRA
NHTSA
OSHA
PHMSA
SEC
USCG

- fix missing imports

craffel · 2024-06-10T17:45:25Z

I believe that for most of the other sources I'm aware of, there are typically about 1/2 as many tokens as there are on-disk bytes of gzipped jsonl. In your case there are about 2x as many, any idea where this discrepancy comes from?

blester125

Made some comments, I think the index metadata format should be clarified, there are some places where it seems to get processed in different ways.

I also didn't look at the usgpo files at all, I assumed they covered by your other PR

blester125 · 2024-06-10T18:00:49Z

regulations/convert.py

@@ -0,0 +1,150 @@
+"""Build index of document URLs from a Regulations.gov bulk download file"""


copy pasted docstring

blester125 · 2024-06-10T18:02:26Z

regulations/build-index.py

+        index = defaultdict(list)
+        num_skipped = 0
+        num_parsed = 0
+        with open(input_file, "r") as f:


A file that is going to be used with the csv reader should be opened with newline=''

blester125 · 2024-06-10T18:05:23Z

regulations/build-index.py

+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Regulations.gov index builder")


Missing description=..., as is this overwrites the program name

blester125 · 2024-06-10T18:07:27Z

regulations/convert.py

+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Regulations.gov index builder")


Same as above

blester125 · 2024-06-10T18:08:44Z

regulations/convert.py

+    return parser.parse_args()
+
+
+def convert_htm(input_path, output_path):


prefer naming this convert_html, htm is super legacy from when windows couldn't have an extension with > 3 chars lol

blester125 · 2024-06-10T18:41:30Z

regulations/to-dolma.py

+                        with open(file_path, "r", encoding="windows-1252") as f:
+                            text = f.read()
+                    except UnicodeDecodeError:
+                        continue


Probably add an error that this file couldn't be opened

blester125 · 2024-06-10T18:43:52Z

regulations/to-dolma.py

+                url = None
+                for file_metadata in metadata["Content Files"]:
+                    if file_metadata["File Type"] in [".htm", ".txt", ".doc", ".docx"]:
+                        url = file_metadata["URL"]


I'm confused about this index format again, this is always going to set url to the last file type in the metadata["content Files"] field, but the downloads earlier seemed to download all the different formats?

blester125 · 2024-06-10T18:46:06Z

regulations/to-dolma.py

+                    if file_metadata["File Type"] in [".htm", ".txt", ".doc", ".docx"]:
+                        url = file_metadata["URL"]
+
+                record = {


Dolma format is incorrect, document_type, title, agency should all be in the metadata dict and posted_date should be created

blester125 · 2024-06-10T18:47:45Z

regulations/download-files.py

+        for agency in args.agencies:
+            os.makedirs(os.path.join(args.output_dir, year, agency), exist_ok=True)
+
+    for year, agency in itertools.product(args.years, args.agencies):


Should this downloading be parallelized at all? Or is the data small/you are parallelizing by running disjoint (year, agency) pairs on multiple workers?

blester125 · 2024-06-10T18:48:32Z

regulations/README.md

+
+# Data Collection
+
+This collection's metadata was first gathered via a number of bulk download requests to Regulations.gov since each bulk download can only be for metadata related to a single agency over a single year span. In total, we requested metadata from 14 agencies between the years 2000 and 2023. To collect the documents, run the script `get-data.sh`. Internally this parses the metadata files to create an index of all file URLs referenced in the metadata. It then downloads all of the referenced .doc, .docx, .txt, and .htm files and converts each format to plaintext. Finally, it reads each of these converted files and stores them in a Dolma dataset. The resulting dataset is written to `data/regulations/v0`.


The mentioned get-data.sh script doesn't seem to be checked in

nkandpa2 added 9 commits April 15, 2024 09:56

initial commit for usgpo

8ed6b61

Write out to Dolma

509fe0d

Driver script

78aced6

fix logging

f06e1e8

Set default start-date parameter as 01/01/1990

b7def76

Add README

931ddd1

- fix the dataset output location to be in usgpo/

6e1e3ee

- fix missing imports

black

68e44b9

regulations.gov initial commit

0a22965

blester125 requested changes Jun 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regulations #84

Regulations #84

nkandpa2 commented Jun 10, 2024

craffel commented Jun 10, 2024

blester125 left a comment

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

blester125 Jun 10, 2024

		@@ -0,0 +1,150 @@
		"""Build index of document URLs from a Regulations.gov bulk download file"""



		def parse_args():
		parser = argparse.ArgumentParser("Regulations.gov index builder")

		return parser.parse_args()


		def convert_htm(input_path, output_path):


		# Data Collection

		This collection's metadata was first gathered via a number of bulk download requests to Regulations.gov since each bulk download can only be for metadata related to a single agency over a single year span. In total, we requested metadata from 14 agencies between the years 2000 and 2023. To collect the documents, run the script `get-data.sh`. Internally this parses the metadata files to create an index of all file URLs referenced in the metadata. It then downloads all of the referenced .doc, .docx, .txt, and .htm files and converts each format to plaintext. Finally, it reads each of these converted files and stores them in a Dolma dataset. The resulting dataset is written to `data/regulations/v0`.

Regulations #84

Are you sure you want to change the base?

Regulations #84

Conversation

nkandpa2 commented Jun 10, 2024

craffel commented Jun 10, 2024

blester125 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment