Utility scripts or libraries for various Natural Language Processing tasks.
charfreq.awk: calculate character frequency.convcat.py: cat files with different encodings together.csvcol.py: get specified columns of csv files.csvsql.py: convert csv file to sql definition.dbsort.tcl: sort SQLite tables in place.detokenizer.py: detokenize Chinese text.dump2db.py: make a database from leaked password dumps.epubzhconv.py: Chinese varient conversion for epub books.filtermd5.py: remove md5s not in known list.findbadlines.py: find encoding errors in stdin.gbk_pua.py: convert PUA codes in GBK to unicode.getautodesk.py: get Moses format parallel text from Autodesk corpus.gettxtcollection.py: merge a txt file collection to one large corpus.haodoo: crawl and download all books from haodoo.net.iconv.py: implements iconv.iso639.json,iso639-to-calibre.py: get ISO639 codes from Wikipedia and convert to calibre po file.jiebazhc: tokenize Classical Chinese using jieba.libpinyin_bopomofo.py: Decorator to use with python-pinyin, to convert Pinyin to Bopomofo. (now useless)ngramfreq.awk: calculate n-gram character frequency.num2chinese.py: convert numbers to Chinese numbers.phrasecombine.py: combine splitted words to large phrases given a dictionary.pwdsort.js,zxcvbn.js: print out password strength according to zxcvbn.pgexplaindot.py: output a GraphVizdotfile forEXPLAIN (FORMAT JSON).pgviewdep.tcl: output a GraphVizdotfile representing view dependencies in a PostgreSQL database.rmdup.c: remove duplicate lines without sort (compile withmake, needslibxxhash-dev).simpdump.py: try to find username, email, password and hash from leaked password dumps.splitrecutfilter.py: reads stdin, filters non-chinese sentences and cuts sentences and words.tatoeba: convert tatoeba dumps to a SQLite3 database.wordfreq.awk: calculate word frequency.WWStarClone.py: clone of WWStar, an ancient Classical Chinese translator.zhutil.py: misc. utils for processing Chinese.modelzh.json: model to detect Classical/Modern Chinese.
If not otherwise noted in file, all files are licensed under MIT License.