diff --git a/README.md b/README.md index 377bd88..e6fd026 100644 --- a/README.md +++ b/README.md @@ -47,15 +47,15 @@ Fonts are merged/combined as per the regions defined in the [Unicode Standard (pdf)](https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf). Chapter numbers refer to that spec. -| Regional font | Coverage | -|----------------------------|-------------------------------------------------------------------------------------------| -| GoNotoEuropeAmericas.ttf | "Europe" - ch. 7, 8 and "Americas" - ch 20 | -| GoNotoAfricaMiddleEast.ttf | "Middle East" - ch. 9, 10, 11 and "Africa" - ch. 19 | -| GoNotoSouthAsia.ttf | "South and Central Asia" - ch. 12 and 13 | -| GoNotoAsiaHistorical.ttf | "South and Central Asia" - ch. 14 and 15 | -| GoNotoSouthEastAsia.ttf | "Southeast Asia" - ch. 16 and "Indonesia and Ocenia" - ch 17 | -| GoNotoEastAsia.ttf | "East Asia" - ch 18. everything other than Han (CJK) | -| GoNotoCJKCore2003.ttf | [Unicode IICore][1] subset of CJK (~10K ideographs). See [Noto CJK][2] for full coverage | +| Regional font | Coverage | +|----------------------------|-----------------------------------------------------------------------------------------| +| GoNotoEuropeAmericas.ttf | "Europe" - ch. 7, 8 and "Americas" - ch 20 | +| GoNotoAfricaMiddleEast.ttf | "Middle East" - ch. 9, 10, 11 and "Africa" - ch. 19 | +| GoNotoSouthAsia.ttf | "South and Central Asia" - ch. 12 and 13 | +| GoNotoAsiaHistorical.ttf | "South and Central Asia" - ch. 14 and 15 | +| GoNotoSouthEastAsia.ttf | "Southeast Asia" - ch. 16 and "Indonesia and Ocenia" - ch 17 | +| GoNotoCJKCore2005.ttf | [Unihan IICore][1] subset of CJK (~10K ideographs). Use [Noto CJK][2] for full coverage | +| GoNotoEastAsia.ttf | "East Asia" - ch 18. everything other than Han (CJK) | Each of the above fonts includes LGC (Latin-Greek-Cyrillic) as default, same coverage as `Noto Sans Regular`. Each one also includes Noto Sans Math, Noto Sans Symbols and Noto Sans Symbols 2 to give @@ -67,9 +67,9 @@ Following are included: Devanagari (Hindi, Marathi, Nepali, etc), Bengali, Punja Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thaana, Sinhala, Newa, Tibetan, Limbu, Meetei Mayek, Mro, Warang Citi, Ol Chiki, Chakma, Lepcha, Saurashtra, Masaram Gondi, Gunjala Gondi, Wancho. -Urdu (NotoSansNastaliq), though not written in an Indic script, is included for practical -reasons. Mongolian is currently not included because of issue with vmtx (vertical metrics). Noto -fonts do not exist for Toto and Tangsa. +Urdu (NotoNastaliq), though not written in an Indic script and not part of "South Asia" chapters in +the Unicode spec, is included for practical reasons. Mongolian is currently not included because of +issue with vmtx (vertical metrics). Noto fonts do not exist for Toto and Tangsa. ### Go Noto Asia Historical @@ -111,13 +111,18 @@ Tibetan, Lisu, Marchen, Miao, Yi, etc. excluding Han/CJK (Chinese-Japanese-Korea Mongolian, Nushu and Tangut could not be included. -### Go Noto CJK Core 2003 +### Go Noto CJK Core 2005 -[Unicode IICore][1] is a minimal subset of CJK specified in 2003 for memory-constrained systems. It -standardizes about 9800 codepoints. The generated font has about 20000 glyphs. +[Unihan IICore][1] is a minimal, region-agnostic subset of Han/CJK specified in 2005 for +memory-constrained systems. It standardizes about 9800 codepoints, covering basic use cases of +Chinese (Traditional, Simplified), Japanese and Korean. Recently [Unihan Core +2020](https://unicode.org/charts/unihan.html) superseded and expanded the minimal subset to about +20000 codepoints. + +The generated font does _not_ contain Noto Sans Math, Noto Sans Symbols, Noto Sans Symbols 2 because +[fonttools does not support](https://fonttools.readthedocs.io/en/latest/merge.html) merging fonts +with CFF outlines (which is the case for .otf). -Recently [Unihan Core 2020](https://unicode.org/charts/unihan.html) upgrades the minimal subset to -about 20000 codepoints. ## Font Statistics @@ -129,8 +134,9 @@ about 20000 codepoints. | GoNotoAsiaHistorical.ttf | 114 | 10261 | 16767 | | GoNotoSouthEastAsia.ttf | 107 | 10168 | 14358 | | GoNotoEastAsia.ttf | 96 | 10522 | 15081 | +| GoNotoCJKCore2005.ttf | 20 | 10338 | 20099 | -Note that each of the above include statistics of: +Note that each of the above (except CJKCore2005) include statistics of: | Upstream font | Code blocks | Codepoints | Glyphs | |---------------------|-------------|-----------------|-------------| diff --git a/get_codepoints.py b/get_codepoints.py new file mode 100644 index 0000000..cc3e2d0 --- /dev/null +++ b/get_codepoints.py @@ -0,0 +1,11 @@ +#!/usr/bin/env python3 + +"""Dumps all the codepoints covered in a given font, one per line""" + +import sys +from fontTools.ttLib import TTFont + +with TTFont(sys.argv[1]) as ttf: + for x in ttf["cmap"].tables: + for code in x.cmap: + print("U+%08X" % code) diff --git a/run.sh b/run.sh index 18917df..f4f0ec6 100755 --- a/run.sh +++ b/run.sh @@ -37,7 +37,7 @@ main() { echo "Not overwriting existing font $font." continue fi - echo "Generating font $font. Current time: $(date).\n" + echo "Generating font $font. Current time: $(date)." mkdir -p cached_fonts time PYTHONPATH="nototools/nototools" python3 generate.py -o "$font" -d cached_fonts edit_font_info "$font" @@ -70,31 +70,39 @@ edit_font_info() { echo "Editing font metadata for $fontname..." "$VIRTUAL_ENV"/bin/ttx -o "$xml_file" "$fontname" 2> /dev/null [[ $? -ne 0 ]] && echo "ERROR: Could not dump $fontname to xml." && return 1 - sed -e "s/Noto Sans/$with_spaces/g" -e "s/NotoSans/$without_spaces/g" "$xml_file" > "$xml_file_bak" + sed -e "s/Noto Sans CJK SC/$with_spaces/g" \ + -e "s/NotoSansCJKsc/$without_spaces/g" \ + -e "s/Noto Sans/$with_spaces/g" \ + -e "s/NotoSans/$without_spaces/g" \ + "$xml_file" > "$xml_file_bak" mv "$xml_file_bak" "$xml_file" "$VIRTUAL_ENV"/bin/ttx -o "$fontname" "$xml_file" 2> /dev/null [[ $? -ne 0 ]] && echo "ERROR: Could not dump xml to $fontname." && return 2 rm -f "$xml_file" } -# Unicode IICore 2003 is a small subsetof CJK (~10k codepoints). -# Recently is has been superceded by UnihanCore2020. +# Unihan IICore 2005 is a small subset of CJK (~10k codepoints). +# Recently it has been superseded by UnihanCore2020, which is double in size. create_cjk_subset() { - local fontname=GoNotoCJKCore2003.otf + local input_font=NotoSansCJKsc-Regular.otf + local output_font=GoNotoCJKCore2005.otf + local subset_codepoints=unihan_iicore.txt cd cached_fonts/ [[ ! -e Unihan.zip ]] && wget https://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip python3 -m zipfile -e Unihan.zip . - grep kIICore Unihan_IRGSources.txt | cut -f1 > unicode_points.txt - if [[ ! -e NotoSansCJKsc-Regular.otf ]]; then - wget https://github.com/googlefonts/noto-cjk/raw/main/Sans/OTF/SimplifiedChinese/NotoSansCJKsc-Regular.otf + grep kIICore Unihan_IRGSources.txt | cut -f1 > "$subset_codepoints" + python3 ../get_codepoints.py NotoSans-Regular.ttf >> "$subset_codepoints" + if [[ ! -e "$input_font" ]]; then + wget https://github.com/googlefonts/noto-cjk/raw/main/Sans/OTF/SimplifiedChinese/"$input_font" fi cd "$OLDPWD" - echo "Generating font $fontname." - "$VIRTUAL_ENV"/bin/pyftsubset cached_fonts/NotoSansCJKsc-Regular.otf \ - --unicodes-file=cached_fonts/unicode_points.txt \ - --output-file="$fontname" - edit_font_info "$fontname" + + echo "Generating font $output_font. Current time: $(date)." + "$VIRTUAL_ENV"/bin/pyftsubset cached_fonts/"$input_font" \ + --unicodes-file=cached_fonts/"$subset_codepoints" \ + --output-file="$output_font" + edit_font_info "$output_font" } # execution starts here