This project implements a bilingual corpus workflow for:
- crawling public English and Chinese webpages,
- cleaning extracted text,
- estimating symbol and token probabilities,
- computing Shannon entropy,
- validating Zipf's law on English word frequencies,
- comparing the results across increasing sample sizes,
- generating a Markdown technical report.
The crawler/extractor layer is built around the trafilatura ecosystem:
- GitHub: https://github.com/adbar/trafilatura
- Documentation: https://trafilatura.readthedocs.io/
python3 -m pip install -e .The default seed list lives in config/seed_sites.json. It contains a small, polite set of public English and Chinese websites and conservative crawl limits.
Run the full workflow step by step:
nlp-hm crawl --config config/seed_sites.json --output data/raw/corpus.jsonl
nlp-hm clean --input data/raw/corpus.jsonl --output data/clean/corpus.jsonl
nlp-hm analyze --input data/clean/corpus.jsonl --output-dir results/full
nlp-hm experiment --input data/clean/corpus.jsonl --sizes 10,30,60 --output-dir results/experiments
nlp-hm report --config config/seed_sites.json --clean-input data/clean/corpus.jsonl --experiment-dir results/experiments --output report/report.mdRun the CLI module without installing the package:
python3 -m nlp_hm --helpconfig/ Seed sites and crawl settings
data/ Raw and cleaned corpora
report/ Generated Markdown report
results/ Analysis outputs and figures
src/nlp_hm/ Implementation
tests/ Unit tests