Skip to content

Commit

Permalink
ingest domains, add web api, document configuration, add tests
Browse files Browse the repository at this point in the history
  • Loading branch information
stitch committed Oct 29, 2024
1 parent 23b4e09 commit b2f5eb7
Show file tree
Hide file tree
Showing 32 changed files with 1,004 additions and 3 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
*.sqlite3*
.idea
.venv
*.pyc
.pytest_cache
.coverage
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2024 Internet.nl

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
108 changes: 106 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,106 @@
# Internet.nl-ct-log-subdomain-suggestions-api
Internet.nl ct-log subdomain suggestions api
# Internet.nl Certificate Transparency Log Subdomain Suggestions

## What does this do / Intended use case
The goal is to replace subdomain suggestions from crt.sh with higher uptime and faster response times. This way it can
be used in other applications, such as the internet.nl dashboard, to suggest possible subdomains to end users.

## How does it work
This tool ingests subdomains from public certificate transparency logs using a connection from a certstream server. A
web interface allows for querying the stored data, which results in a list of known subdomains.

There are several key-optimizations performed that reduce the amount of subdomains stored in the database. The most
important one is the list of allowed tlds that are being stored. By default only domains relevant to the Kingdom of
the Netherlands are being stored.

## What are the limits of this tool
The limits have not yet been discovered and no optimizations have been performed yet, aside from a few proactive
database indexes. It is expected to being able to store about a years worth of data from the .nl zone. This means
about 5 million domains with an estimated 50 million subdomains, each which will have a new certificate every 90 days.
In total about 200 million records per year. This is the same in most EU countries. There is no expectation that this
tool will work quickly on the combined com/net/org zones. Although some partitioning and smarter inserting might just
do the trick. For the Netherlands the total number of certificate renewals seems to be much lower for subdomains,
between 0.5 to 2 per second.

The goal is to being able to run this on medium sized virtual machines with just a few cores and a few gigabytes of
ram. That should be enough for the Netherlands and most EU countries. We've not tried to see if this solution is 'web
scale'.

## How to ingest data from cerstream
Configure `CTLSSA_CERTSTREAM_SERVER_URL` to point to a certstream-server instance. The default points to a certstream
server hosted by the creator of certstream, calidog. This is great for testing and development, but don't use it for
production purposes.

Read more about setting up a certstream server here: https://github.com/CaliDog/certstream-server

After configuration run the following command:
```python manage.py migrate```
```python manage.py ingest```

This command should run forever. In case your certstream server is down it will patiently wait until the server is up.

## How to query the results
The webserver can be started with the command:
```python manage.py runserver```

When you visit the web interface at http://localhost:8000/ you will see a blank JSON response. Use the following
parameters to retrieve data: `http://localhost:8000/?domain=example&suffix=nl&period=365`


## Further configuration options
Configuration is done via environment variables, but can also be hardcoded in the settings.py file if need be.

Everything is configured with environment variables and fallbacks. Environment variables of the app are prefixed with
CTLSSA_, so they stand out in your `env`.

CTLSSA_ACCEPTED_TLDS: Comma separted string with the zones you want subdomains from.
The default is set to "nl,aw,cw,sr,sx,bq,frl,amsterdam,politie". Mileage will vary with .com, .net, .org zones and
we expect ingestion not to be fast enough.

DEQUE_LENGTH: Configure this to be around the amount of domains you ingest in a few hours to a day, but in a way that
it doesn't hit the database limit. This value is used to deduplicate certificate renewal requests. It's very common to
see certificate renewals containing the same domain for every subdomain. It's also very common to see the same request
happening over and over again because the administrator made some configuration mistake and needs to repeat the process.
The default is 100.000 domains.

There are various database settings so any django-supported database can be used. We recommend postgres as it has more
options regarding optimization than mysql. Either should be fine. Sqlite might also work, as there is only one process
that writes to the database.

Database settings:

- CTLSSA_DB_ENGINE
- CTLSSA_DB_NAME
- CTLSSA_DB_USER
- CTLSSA_DB_PASSWORD
- CTLSSA_DB_HOST
- CTLSSA_DJANGO_DATABASE


## Expectations in database size and performance

This package assumes that insertions in the database are faster than the amount of newly found domains. This will not
hold true for every zone, especially when combining .com, .net and .org.

Once this assumption doesn't hold optimizations are needed. There are several options that might help: bulk insert,
parallel inserts from multiple processes, database partitioning, index ordering, reducing the amount of indexes by
merging domain+suffix and so on. Other solutions might work as well. None of these have been tried yet, but you might
need them. If you do, please get in touch with the repository owner so this project can be optimized for everyone.


## Development
This project does not have a managed virtual environment yet. This might be added in the future if need be.

### Linting
Run these commands before checking in. These should all pass without error.
```
isort .
black .
pytest
```

### Dependency management
Run these commands to create a dependency hierarchy
```
pip-compile requirements.in --output-file=requirements.txt
pip-compile requirements-dev.in --output-file=requirements-dev.txt
```
Empty file added app/__init__.py
Empty file.
16 changes: 16 additions & 0 deletions app/asgi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""
ASGI config for api project.
It exposes the ASGI callable as a module-level variable named ``application``.
For more information on this file, see
https://docs.djangoproject.com/en/4.2/howto/deployment/asgi/
"""

import os

from django.core.asgi import get_asgi_application

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "api.settings")

application = get_asgi_application()
211 changes: 211 additions & 0 deletions app/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
"""
Django settings for app project.
Generated by 'django-admin startproject' using Django 4.2.2.
For more information on this file, see
https://docs.djangoproject.com/en/4.2/topics/settings/
For the full list of settings and their values, see
https://docs.djangoproject.com/en/4.2/ref/settings/
"""
import os
import time
from pathlib import Path

# Build paths inside the project like this: BASE_DIR / 'subdir'.
BASE_DIR = Path(__file__).resolve().parent.parent


# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/4.2/howto/deployment/checklist/

# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = "django-insecure-6b!vz+!y)9b%8mm)=$a4wc-vh!--7l%-925o7l19asa0r$2h2a"

# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = True

ALLOWED_HOSTS = []


# Application definition

INSTALLED_APPS = [
"suggestions",
"django.contrib.admin",
"django.contrib.auth",
"django.contrib.contenttypes",
"django.contrib.sessions",
"django.contrib.messages",
"django.contrib.staticfiles",
]

MIDDLEWARE = [
"django.middleware.security.SecurityMiddleware",
"django.contrib.sessions.middleware.SessionMiddleware",
"django.middleware.common.CommonMiddleware",
"django.middleware.csrf.CsrfViewMiddleware",
"django.contrib.auth.middleware.AuthenticationMiddleware",
"django.contrib.messages.middleware.MessageMiddleware",
"django.middleware.clickjacking.XFrameOptionsMiddleware",
]

ROOT_URLCONF = "app.urls"

TEMPLATES = [
{
"BACKEND": "django.template.backends.django.DjangoTemplates",
"DIRS": [],
"APP_DIRS": True,
"OPTIONS": {
"context_processors": [
"django.template.context_processors.debug",
"django.template.context_processors.request",
"django.contrib.auth.context_processors.auth",
"django.contrib.messages.context_processors.messages",
],
},
},
]

WSGI_APPLICATION = "app.wsgi.application"


# Database
# https://docs.djangoproject.com/en/4.2/ref/settings/#databases
DATABASE_OPTIONS = {}

DB_ENGINE = os.environ.get("CTLSSA_DB_ENGINE", "postgresql")
DATABASE_ENGINES = {"postgresql": "django.db.backends.postgresql"}

DATABASES_SETTINGS = {
# persist local database used during development
"dev": {
"ENGINE": "django.db.backends.sqlite3",
"NAME": os.environ.get("CTLSSA_DB_NAME", "db.sqlite3"),
},
# sqlite memory database for running tests without storing them permanently
"test": {
"ENGINE": "django.db.backends.sqlite3",
"NAME": os.environ.get("CTLSSA_DB_NAME", "db.sqlite3"),
},
# for production get database settings from environment (eg: docker)
"production": {
"ENGINE": DATABASE_ENGINES.get(DB_ENGINE, f"django.db.backends.{DB_ENGINE}"),
"NAME": os.environ.get("CTLSSA_DB_NAME", "ctlssa"),
"USER": os.environ.get("CTLSSA_DB_USER", "ctlssa"),
"PASSWORD": os.environ.get("CTLSSA_DB_PASSWORD", "ctlssa"),
"HOST": os.environ.get("CTLSSA_DB_HOST", "postgresql"),
"OPTIONS": DATABASE_OPTIONS.get(os.environ.get("CTLSSA_DB_ENGINE", "postgresql"), {}),
},
}
# allow database to be selected through environment variables
DATABASE = os.environ.get("CTLSSA_DJANGO_DATABASE", "dev")
DATABASES = {"default": DATABASES_SETTINGS[DATABASE]}


# Password validation
# https://docs.djangoproject.com/en/4.2/ref/settings/#auth-password-validators

AUTH_PASSWORD_VALIDATORS = [
{
"NAME": "django.contrib.auth.password_validation.UserAttributeSimilarityValidator",
},
{
"NAME": "django.contrib.auth.password_validation.MinimumLengthValidator",
},
{
"NAME": "django.contrib.auth.password_validation.CommonPasswordValidator",
},
{
"NAME": "django.contrib.auth.password_validation.NumericPasswordValidator",
},
]


# Internationalization
# https://docs.djangoproject.com/en/4.2/topics/i18n/

LANGUAGE_CODE = "en-us"

TIME_ZONE = "UTC"

USE_I18N = True

USE_TZ = True


# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/4.2/howto/static-files/

STATIC_URL = "static/"

# Default primary key field type
# https://docs.djangoproject.com/en/4.2/ref/settings/#default-auto-field

DEFAULT_AUTO_FIELD = "django.db.models.BigAutoField"

# .an has been dissolved, but this page lists the other options: https://en.wikipedia.org/wiki/.an
# .nl is managed by the SIDN and is the domain of the Netherlands.
# .aw, .cw, .sr, .sx, .bq are the special municipalities and countries within the kingdom of the Netherlands.
# .frl is a province with their own recognized language
# .amsterdam is the capitcal city of the Netherlands which provides this extension
ACCEPTED_TLDS = os.environ.get("CTLSSA_ACCEPTED_TLDS", "nl,aw,cw,sr,sx,bq,frl,amsterdam,politie")
ACCEPTED_TLDS = ACCEPTED_TLDS.split(",")

if not ACCEPTED_TLDS:
print(
"Warning: no filter set on ACCEPTED_TLDS, will try to import all subdomains of everything to the database. "
"This tool has not been developed for this use case and might not perform well with this amount of data. "
)
print("This script will continue in 10 seconds. We're excited how far this solution scaled for you. For science!")
time.sleep(10)

DEQUE_LENGTH = os.environ.get("CTLSSA_DEQUE_LENGTH", 100000)

CERTSTREAM_SERVER_URL = os.environ.get("CTLSSA_CERTSTREAM_SERVER_URL", "wss://certstream.calidog.io/")


LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"handlers": {
"console": {
"class": "logging.StreamHandler", # sys.stdout
"formatter": "color",
},
},
"formatters": {
"debug": {
"format": "%(asctime)s\t%(levelname)-8s - %(filename)-20s:%(lineno)-4s - " "%(funcName)20s() - %(message)s",
},
"color": {
"()": "colorlog.ColoredFormatter",
# to get the name of the logger a message came from, add %(name)s.
"format": "%(log_color)s%(asctime)s\t%(levelname)-8s - " "%(message)s",
"datefmt": "%Y-%m-%d %H:%M:%S",
"log_colors": {
"DEBUG": "green",
"INFO": "white",
"WARNING": "yellow",
"ERROR": "red",
"CRITICAL": "bold_red",
},
},
},
"loggers": {
"django": {
"handlers": ["console"],
"level": os.getenv("CTLSSA_DJANGO_LOG_LEVEL", "INFO"),
},
"app": {
"handlers": ["console"],
"level": os.getenv("CTLSSA_APP_LOG_LEVEL", "DEBUG"),
},
"suggestions": {
"handlers": ["console"],
"level": os.getenv("CTLSSA_SUGGESTIONS_LOG_LEVEL", "DEBUG"),
},
},
}
24 changes: 24 additions & 0 deletions app/urls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""
URL configuration for api project.
The `urlpatterns` list routes URLs to views. For more information please see:
https://docs.djangoproject.com/en/4.2/topics/http/urls/
Examples:
Function views
1. Add an import: from my_app import views
2. Add a URL to urlpatterns: path('', views.home, name='home')
Class-based views
1. Add an import: from other_app.views import Home
2. Add a URL to urlpatterns: path('', Home.as_view(), name='home')
Including another URLconf
1. Import the include() function: from django.urls import include, path
2. Add a URL to urlpatterns: path('blog/', include('blog.urls'))
"""
# from django.contrib import admin
from django.urls import include, path

urlpatterns = [
path("", include("suggestions.urls")),
# No administration tools have been developed for now.
# path("admin/", admin.site.urls),
]
16 changes: 16 additions & 0 deletions app/wsgi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""
WSGI config for api project.
It exposes the WSGI callable as a module-level variable named ``application``.
For more information on this file, see
https://docs.djangoproject.com/en/4.2/howto/deployment/wsgi/
"""

import os

from django.core.wsgi import get_wsgi_application

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "api.settings")

application = get_wsgi_application()
Loading

0 comments on commit b2f5eb7

Please sign in to comment.