Companion code to the arXiv preprint presenting the LaCour!
corpus.
If you are looking for the dataset, please visit LaCour! Corpus.
Note
This repo is still a work in progress and will be updated in the coming days!
For installation with Miniconda:
conda create -n lacour-generation python=3.9
conda activate lacour-generation
git clone https://github.com/trusthlt/lacour-generation.git
cd lacour-generation
pip install -r requirements.txt
Producing the hearing transcripts and associated documents is divided into several steps. The code for all scrapers is located in scrape.
- Download video files and video information by running
scrape_webcast_videos.py
, producesall_webcasts_{date}.json
- Find associated files in HUDOC and download them by running
scrape_case_html_matching_webcast.py
- Find related press releases in HUDOC and download them by running
scrape_press_releases.py
Warning
Due to changes to the webcast website, the scraper for videos no longer works. You can instead skip the first step and download the last scraped file all_webcasts.json
Transcribing a video into a hearing transcript requires several steps, with one manual annotation step. The code for transcription is located in transcribe.
- Diarize the video by running
diarize.py
. This requires a huggingface token to access the models of pyannote/[email protected] - Generate a speaker schedule, clustering the diarization output by running
generate_speaker_schedule.py
. This will result in one text file with a speaker schedule per hearing webcast - (MANUAL) Annotate the speaker schedule with the correct tags
- Generate a transcript by passing the annotated speaker schedule with the video to
transcribe_segmented_whisper.py