-
Notifications
You must be signed in to change notification settings - Fork 114
Description
Geedge Cases: Censorship Measurement Insights from the Geedge Networks Leak
Jade Sheffey, Ali Zohaib, Mingshi Wu, Amir Houmansadr
https://www.petsymposium.org/foci/2026/foci-2026-0006.php
PDF
This paper extracts domain names from the files of the 2025 Geedge Networks leak and analyzes the set of domain names in comparison with domain lists that are commonly used in censorship measurement.
The Geedge leak is over 500 GB and consists of many kinds of data, including RPM packages, Git repositories, pcaps, and documentation and data files. The authors recursively unpacked archive files (including the commit history of Git repositories) and did OCR on image files. Then they did a big regular expression search over everything, using the regular expression ((a-zA-Z0-9-_)+.)+(TLDs).?, where TLDs is all the suffixes in the public suffix list. The regular expression is broad: it catches domain names as well as non–domain name strings like libstdcpp.so. In order to remove the non-domains, they did DNS queries for all the regular expression matches, which reduced 14 million matches to 7 million known resolvable domains.
They compared this list of 7 million domains to the Tranco list (which is based on popularity) and the Citizen Lab test lists (which are curated for censorship measurement). They tested for censorship of the domain names (SNI filtering and DNS response injection) from vantage points in Myanmar, Pakistan, Algeria, and Guangzhou and Nanjing in China. There are a lot of censored domains in the Geedge-derived list that are in neither the Tranco nor the Citizen Lab lists—up to 212 k in China and 99 k in Pakistan—though the great majority of the 7 million domains were not censored at any of the vantage points.
The leak does not appear to contain any raw blocklists that might actually be deployed in any country where Geedge operates. The authors of the paper hypothesize that such blocklists must be stored in some other place that was not included in the leak. But some of the files in the leak are more dense in censored names than others. Table 3 (below) shows some of the files that contain the greatest number of censored domains. The E21-SNI-Top200w.txt and E21-SNI-Top120W-20221020.txt files (previously discussed at #519 (comment)) appear to come from network taps (recall that "E21" is code for Ethiopia). 48048462_attachments_白名单网站.txt (白名单网站 means "whitelisted websites") is an attachment to geedge_docs/TSGEN/2021-10-24.html, which is about a Geedge deployment in Quanzhou, China.
| Location | Count | Path | Description |
|---|---|---|---|
| Common | 57362 | mesalab_git/galaxy/platform/galaxy-qgw-service/benchmark/entity_dataset/E21-SNI-Top200w.txt | E21=Ethiopia |
| Common | 36467 | mesalab_git/galaxy/platform/galaxy-qgw-service/benchmark/entity_dataset/E21-SNI-Top120W-20221020.txt | E21=Ethiopia |
| Common | 24219 | mesalab_git/tsg/tsg-deploy/tsg-web/TSG_v2.01_Pro_191030v1版本_界面.zip/docker/categoryinit/clf/porn.csv | Adult websites |
| Common | 13604 | mesalab_git/galaxy/platform/galaxy-qgw-service/benchmark/entity_dataset/XJ-CUCC-SNI-Top200w.txt | XJ=Xinjiang? |
| Common | 10163 | mesalab_git/tango/maat/test/tsgrule/TSG_OBJ_FQDN.E21 | E21=Ethiopia |
| China | 7016 | mesalab_git/intelligence-learning-engine/vpn-finder-plugins/* | VPN host discovery |
| China | 4810 | geedge_docs/OM/attachments/143922253_attachments_Nord VPN server List.txt | NordVPN servers |
| China | 475 | geedge_docs/TSGEN/attachments/48056407_attachments_白名单域名20211025.txt | Quanzhou block/allowlists |
| Myanmar | 27 | geedge_docs/TSGEN/M22-VPN List.html | M22=Myanmar |
| Pakistan | 68 | geedge_docs/TSGEN/attachments/129093191_attachments_Psiphon-Apps.zip/Psiphon-CDN_20240430.json | Psiphon domains |
| Algeria | 11 | mesalab_docs/shu/attachments/27700503_attachments_starttls.zip/starttls/ssl_test/xml/mail.alakhbar.press.ma | Moroccan mail servers |
Thanks to the authors for reviewing a draft of this summary.