Skip to content

Commit

Permalink
Update GoNotoCJKCore2005 to include coverage from NotoSans-Regular
Browse files Browse the repository at this point in the history
  • Loading branch information
Satish B committed Dec 13, 2021
1 parent a2f281b commit aa87233
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 31 deletions.
42 changes: 24 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,15 @@ Fonts are merged/combined as per the regions defined in the [Unicode Standard
(pdf)](https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf). Chapter numbers
refer to that spec.

| Regional font | Coverage |
|----------------------------|-------------------------------------------------------------------------------------------|
| GoNotoEuropeAmericas.ttf | "Europe" - ch. 7, 8 and "Americas" - ch 20 |
| GoNotoAfricaMiddleEast.ttf | "Middle East" - ch. 9, 10, 11 and "Africa" - ch. 19 |
| GoNotoSouthAsia.ttf | "South and Central Asia" - ch. 12 and 13 |
| GoNotoAsiaHistorical.ttf | "South and Central Asia" - ch. 14 and 15 |
| GoNotoSouthEastAsia.ttf | "Southeast Asia" - ch. 16 and "Indonesia and Ocenia" - ch 17 |
| GoNotoEastAsia.ttf | "East Asia" - ch 18. everything other than Han (CJK) |
| GoNotoCJKCore2003.ttf | [Unicode IICore][1] subset of CJK (~10K ideographs). See [Noto CJK][2] for full coverage |
| Regional font | Coverage |
|----------------------------|-----------------------------------------------------------------------------------------|
| GoNotoEuropeAmericas.ttf | "Europe" - ch. 7, 8 and "Americas" - ch 20 |
| GoNotoAfricaMiddleEast.ttf | "Middle East" - ch. 9, 10, 11 and "Africa" - ch. 19 |
| GoNotoSouthAsia.ttf | "South and Central Asia" - ch. 12 and 13 |
| GoNotoAsiaHistorical.ttf | "South and Central Asia" - ch. 14 and 15 |
| GoNotoSouthEastAsia.ttf | "Southeast Asia" - ch. 16 and "Indonesia and Ocenia" - ch 17 |
| GoNotoCJKCore2005.ttf | [Unihan IICore][1] subset of CJK (~10K ideographs). Use [Noto CJK][2] for full coverage |
| GoNotoEastAsia.ttf | "East Asia" - ch 18. everything other than Han (CJK) |

Each of the above fonts includes LGC (Latin-Greek-Cyrillic) as default, same coverage as `Noto Sans
Regular`. Each one also includes Noto Sans Math, Noto Sans Symbols and Noto Sans Symbols 2 to give
Expand All @@ -67,9 +67,9 @@ Following are included: Devanagari (Hindi, Marathi, Nepali, etc), Bengali, Punja
Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thaana, Sinhala, Newa, Tibetan, Limbu, Meetei
Mayek, Mro, Warang Citi, Ol Chiki, Chakma, Lepcha, Saurashtra, Masaram Gondi, Gunjala Gondi, Wancho.

Urdu (NotoSansNastaliq), though not written in an Indic script, is included for practical
reasons. Mongolian is currently not included because of issue with vmtx (vertical metrics). Noto
fonts do not exist for Toto and Tangsa.
Urdu (NotoNastaliq), though not written in an Indic script and not part of "South Asia" chapters in
the Unicode spec, is included for practical reasons. Mongolian is currently not included because of
issue with vmtx (vertical metrics). Noto fonts do not exist for Toto and Tangsa.

### Go Noto Asia Historical

Expand Down Expand Up @@ -111,13 +111,18 @@ Tibetan, Lisu, Marchen, Miao, Yi, etc. excluding Han/CJK (Chinese-Japanese-Korea

Mongolian, Nushu and Tangut could not be included.

### Go Noto CJK Core 2003
### Go Noto CJK Core 2005

[Unicode IICore][1] is a minimal subset of CJK specified in 2003 for memory-constrained systems. It
standardizes about 9800 codepoints. The generated font has about 20000 glyphs.
[Unihan IICore][1] is a minimal, region-agnostic subset of Han/CJK specified in 2005 for
memory-constrained systems. It standardizes about 9800 codepoints, covering basic use cases of
Chinese (Traditional, Simplified), Japanese and Korean. Recently [Unihan Core
2020](https://unicode.org/charts/unihan.html) superseded and expanded the minimal subset to about
20000 codepoints.

The generated font does _not_ contain Noto Sans Math, Noto Sans Symbols, Noto Sans Symbols 2 because
[fonttools does not support](https://fonttools.readthedocs.io/en/latest/merge.html) merging fonts
with CFF outlines (which is the case for .otf).

Recently [Unihan Core 2020](https://unicode.org/charts/unihan.html) upgrades the minimal subset to
about 20000 codepoints.

## Font Statistics

Expand All @@ -129,8 +134,9 @@ about 20000 codepoints.
| GoNotoAsiaHistorical.ttf | 114 | 10261 | 16767 |
| GoNotoSouthEastAsia.ttf | 107 | 10168 | 14358 |
| GoNotoEastAsia.ttf | 96 | 10522 | 15081 |
| GoNotoCJKCore2005.ttf | 20 | 10338 | 20099 |

Note that each of the above include statistics of:
Note that each of the above (except CJKCore2005) include statistics of:

| Upstream font | Code blocks | Codepoints | Glyphs |
|---------------------|-------------|-----------------|-------------|
Expand Down
11 changes: 11 additions & 0 deletions get_codepoints.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env python3

"""Dumps all the codepoints covered in a given font, one per line"""

import sys
from fontTools.ttLib import TTFont

with TTFont(sys.argv[1]) as ttf:
for x in ttf["cmap"].tables:
for code in x.cmap:
print("U+%08X" % code)
34 changes: 21 additions & 13 deletions run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ main() {
echo "Not overwriting existing font $font."
continue
fi
echo "Generating font $font. Current time: $(date).\n"
echo "Generating font $font. Current time: $(date)."
mkdir -p cached_fonts
time PYTHONPATH="nototools/nototools" python3 generate.py -o "$font" -d cached_fonts
edit_font_info "$font"
Expand Down Expand Up @@ -70,31 +70,39 @@ edit_font_info() {
echo "Editing font metadata for $fontname..."
"$VIRTUAL_ENV"/bin/ttx -o "$xml_file" "$fontname" 2> /dev/null
[[ $? -ne 0 ]] && echo "ERROR: Could not dump $fontname to xml." && return 1
sed -e "s/Noto Sans/$with_spaces/g" -e "s/NotoSans/$without_spaces/g" "$xml_file" > "$xml_file_bak"
sed -e "s/Noto Sans CJK SC/$with_spaces/g" \
-e "s/NotoSansCJKsc/$without_spaces/g" \
-e "s/Noto Sans/$with_spaces/g" \
-e "s/NotoSans/$without_spaces/g" \
"$xml_file" > "$xml_file_bak"
mv "$xml_file_bak" "$xml_file"
"$VIRTUAL_ENV"/bin/ttx -o "$fontname" "$xml_file" 2> /dev/null
[[ $? -ne 0 ]] && echo "ERROR: Could not dump xml to $fontname." && return 2
rm -f "$xml_file"
}

# Unicode IICore 2003 is a small subsetof CJK (~10k codepoints).
# Recently is has been superceded by UnihanCore2020.
# Unihan IICore 2005 is a small subset of CJK (~10k codepoints).
# Recently it has been superseded by UnihanCore2020, which is double in size.
create_cjk_subset() {
local fontname=GoNotoCJKCore2003.otf
local input_font=NotoSansCJKsc-Regular.otf
local output_font=GoNotoCJKCore2005.otf
local subset_codepoints=unihan_iicore.txt

cd cached_fonts/
[[ ! -e Unihan.zip ]] && wget https://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip
python3 -m zipfile -e Unihan.zip .
grep kIICore Unihan_IRGSources.txt | cut -f1 > unicode_points.txt
if [[ ! -e NotoSansCJKsc-Regular.otf ]]; then
wget https://github.com/googlefonts/noto-cjk/raw/main/Sans/OTF/SimplifiedChinese/NotoSansCJKsc-Regular.otf
grep kIICore Unihan_IRGSources.txt | cut -f1 > "$subset_codepoints"
python3 ../get_codepoints.py NotoSans-Regular.ttf >> "$subset_codepoints"
if [[ ! -e "$input_font" ]]; then
wget https://github.com/googlefonts/noto-cjk/raw/main/Sans/OTF/SimplifiedChinese/"$input_font"
fi
cd "$OLDPWD"
echo "Generating font $fontname."
"$VIRTUAL_ENV"/bin/pyftsubset cached_fonts/NotoSansCJKsc-Regular.otf \
--unicodes-file=cached_fonts/unicode_points.txt \
--output-file="$fontname"
edit_font_info "$fontname"

echo "Generating font $output_font. Current time: $(date)."
"$VIRTUAL_ENV"/bin/pyftsubset cached_fonts/"$input_font" \
--unicodes-file=cached_fonts/"$subset_codepoints" \
--output-file="$output_font"
edit_font_info "$output_font"
}

# execution starts here
Expand Down

0 comments on commit aa87233

Please sign in to comment.