ECHR decisions - scraping and extraction

This project contains a set of scripts to scrape decisions of the ECHR in html and to extract semi-structured text in xml for downstream applications. An additional script can be used for converting these xml files to UIMA xmi for annotation in Inception and similar tools.

This project was used in our paper entitled "Mining legal arguments in court decisions" ( https://link.springer.com/article/10.1007/s10506-023-09361-y ), so please cite that paper if you use this project.

Data and processing pipeline

We scraped all decisions between 1950 and the end of 2022.

01_downloaded_rss_files

ECHR (HUDOC) publishes RSS files which makes further scraping more convenient
We extract decisions IDs from the RSS files
Generated by 01_all_years_months_rss_scraper.py

02_extracted_case_ids

All ECHR document IDs that were found in the HUDOC database through RSS as simple txt file
Generated by 02_extract_case_ids_from_rss.py

03_all_cases_html (compressed to .tar.gz, decompress by $ cat 03_all_cases_html.tar.gz.* | tar xzvf - which re-creates the folder 03_all_cases_html)

172,770 court decisions as raw html scraped from HUDOC, each html document has its corresponding json metadata
The files are spread across 100 folders starting with last two first digits of its ID for convenience manipulating with over 300k files in total
Generated by 03_dowload_all_cases_html.py, takes a lot of time as it scrapes from the HUDOC website with a pause between requests

04_structured_cases (compressed to .tar.gz, decompress by $ cat 04_structured_cases.tar.gz.* | tar xzvf - which re-creates the folder 04_structured_cases)

Raw html files converted to a simple xml in UTF-8
Split to paragraphs <p> corresponding to paragraphs in the original html files
Each paragraph contains one or several spans <span text="..."/> where the text attribute contains the extracted text
Converting this xml to a plain-text form can be done simply by concatenating the text attribute of each span for each paragraph p, no additional whitespaces are necessary (these are preserved in the text attribute); an implementation can be found on line 16 in 05_convert_structured_case_to_xmi_with_spacy.py (function def xml_to_plain_text(input_xml_file))
Generated by 04_convert_html_to_structured_cases.py
In total 137,716 non-empty xml files (172,572 files in total minus 34,856 empty files which weren't parsed and were somehow malformed html; this is still possible to fix in the html extraction code)

(note: compressed as $ tar cvzf - 03_all_cases_html/ | split --bytes=95MB - 03_all_cases_html.tar.gz. due to GitHub limit 100 MB per file max )

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
01_downloaded_rss_files		01_downloaded_rss_files
02_extracted_case_ids		02_extracted_case_ids
.gitignore		.gitignore
01_all_years_months_rss_scraper.py		01_all_years_months_rss_scraper.py
02_extract_case_ids_from_rss.py		02_extract_case_ids_from_rss.py
03_all_cases_html.tar.gz.aa		03_all_cases_html.tar.gz.aa
03_all_cases_html.tar.gz.ab		03_all_cases_html.tar.gz.ab
03_all_cases_html.tar.gz.ac		03_all_cases_html.tar.gz.ac
03_all_cases_html.tar.gz.ad		03_all_cases_html.tar.gz.ad
03_all_cases_html.tar.gz.ae		03_all_cases_html.tar.gz.ae
03_all_cases_html.tar.gz.af		03_all_cases_html.tar.gz.af
03_all_cases_html.tar.gz.ag		03_all_cases_html.tar.gz.ag
03_all_cases_html.tar.gz.ah		03_all_cases_html.tar.gz.ah
03_all_cases_html.tar.gz.ai		03_all_cases_html.tar.gz.ai
03_all_cases_html.tar.gz.aj		03_all_cases_html.tar.gz.aj
03_all_cases_html.tar.gz.ak		03_all_cases_html.tar.gz.ak
03_all_cases_html.tar.gz.al		03_all_cases_html.tar.gz.al
03_all_cases_html.tar.gz.am		03_all_cases_html.tar.gz.am
03_all_cases_html.tar.gz.an		03_all_cases_html.tar.gz.an
03_download_all_cases_html.py		03_download_all_cases_html.py
04_convert_html_to_structured_cases.py		04_convert_html_to_structured_cases.py
04_structured_cases.tar.gz.aa		04_structured_cases.tar.gz.aa
04_structured_cases.tar.gz.ab		04_structured_cases.tar.gz.ab
04_structured_cases.tar.gz.ac		04_structured_cases.tar.gz.ac
04_structured_cases.tar.gz.ad		04_structured_cases.tar.gz.ad
04_structured_cases.tar.gz.ae		04_structured_cases.tar.gz.ae
04_structured_cases.tar.gz.af		04_structured_cases.tar.gz.af
04_structured_cases.tar.gz.ag		04_structured_cases.tar.gz.ag
04_structured_cases.tar.gz.ah		04_structured_cases.tar.gz.ah
04_structured_cases.tar.gz.ai		04_structured_cases.tar.gz.ai
04_structured_cases.tar.gz.aj		04_structured_cases.tar.gz.aj
05_convert_structured_case_to_xmi_with_spacy.py		05_convert_structured_case_to_xmi_with_spacy.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECHR decisions - scraping and extraction

Data and processing pipeline

About

Releases

Packages

Languages

License

trusthlt/echr-decisions-scraping-extraction

Folders and files

Latest commit

History

Repository files navigation

ECHR decisions - scraping and extraction

Data and processing pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages