Skip to content

waitma/nlp_hm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Homework Toolkit

This project implements a bilingual corpus workflow for:

  • crawling public English and Chinese webpages,
  • cleaning extracted text,
  • estimating symbol and token probabilities,
  • computing Shannon entropy,
  • validating Zipf's law on English word frequencies,
  • comparing the results across increasing sample sizes,
  • generating a Markdown technical report.

The crawler/extractor layer is built around the trafilatura ecosystem:

Install

python3 -m pip install -e .

Seed Sites

The default seed list lives in config/seed_sites.json. It contains a small, polite set of public English and Chinese websites and conservative crawl limits.

Usage

Run the full workflow step by step:

nlp-hm crawl --config config/seed_sites.json --output data/raw/corpus.jsonl
nlp-hm clean --input data/raw/corpus.jsonl --output data/clean/corpus.jsonl
nlp-hm analyze --input data/clean/corpus.jsonl --output-dir results/full
nlp-hm experiment --input data/clean/corpus.jsonl --sizes 10,30,60 --output-dir results/experiments
nlp-hm report --config config/seed_sites.json --clean-input data/clean/corpus.jsonl --experiment-dir results/experiments --output report/report.md

Run the CLI module without installing the package:

python3 -m nlp_hm --help

Project Layout

config/            Seed sites and crawl settings
data/              Raw and cleaned corpora
report/            Generated Markdown report
results/           Analysis outputs and figures
src/nlp_hm/        Implementation
tests/             Unit tests

About

Bilingual corpus crawling, entropy analysis, and Zipf validation toolkit.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages