Generates dlt pipelines from OpenAPI 3.x documents.
This generator does not support OpenAPI 2.x FKA Swagger. If you need to use an older document, try upgrading it to version 3 first with one of many available converters.
🕳️ This is only a demo.
- will generate resources for all endpoints that return lists of objects
- will use a few heuristics to find list wrapped in responses
- will generate transformers from all endpoints that have a matching list resource (same object type returned)
- will use a few heuristics to find the right object id to pass to the transformer
- user can select endpoints using
questionary
lib in CLI- endpoints that have the most central data types (tables linking to many other tables) will be listed first
- the structure of the code is not optimized!
- there's no pagination added. use our GPT-4 playground to do that
Generating a Pokemon dlt pipeline from Open API Spec 🚀
This is a heavily hacked fork of openapi-python-client
- You need
poetry
to install dependencies
poetry install
poetry shell
- Create new
dlt
pipeline from PokeAPI spec and place it in thepokemon-pipeline
dlt-init init pokemon --url https://raw.githubusercontent.com/cliffano/pokeapi-clients/main/specification/pokeapi.yml
- After executing of the command, you can pick the endpoints that you want to add to your source and then load with the pipeline. The endpoints are grouped by returned data type (table) and ordered by centrality (a measure how many other tables, the given table links to):
? Which resources would you like to generate? (Use arrow keys to move, <space> to select, <a> to toggle, <i> to invert)
PokemonSpecies endpoints:
○ pokemon_species_list /api/v2/pokemon-species/
» ○ pokemon_species_read /api/v2/pokemon-species/{id}/
EvolutionChain endpoints:
○ evolution_chain_list /api/v2/evolution-chain/
○ evolution_chain_read /api/v2/evolution-chain/{id}/
MoveAilment endpoints:
○ move_ailment_list /api/v2/move-ailment/
○ move_ailment_read /api/v2/move-ailment/{id}/
Move endpoints:
○ move_list /api/v2/move/
○ move_read /api/v2/move/{id}/
Pokemon endpoints:
○ pokemon_list /api/v2/pokemon/
○ pokemon_read /api/v2/pokemon/{id}/
-
Pick your endpoints and press ENTER to generate pipeline. Now you are ready to load data.
-
Enter the
pokemon-pipeline
folder and execute thepipeline.py
script. This will load your endpoints to localduckdb
. Below we useenlighten
to show fancy progress bars:
cd pokemon-pipeline
PROGRESS=enlighten python pipeline.py
- Inspect the pipeline to see what got loaded
$ dlt pipeline pokemon_pipeline info
Found pipeline pokemon_pipeline in /home/rudolfix/.dlt/pipelines
Synchronized state:
_state_version: 2
_state_engine_version: 2
pipeline_name: pokemon_pipeline
dataset_name: pokemon_data
default_schema_name: pokemon
schema_names: ['pokemon']
destination: dlt.destinations.duckdb
Local state:
first_run: False
_last_extracted_at: 2023-06-12T11:50:16.171872+00:00
Resources in schema: pokemon
pokemon_species_read with 8 table(s) and 0 resource state slot(s)
Working dir content:
Has 1 completed load packages with following load ids:
1686570616.17882
Pipeline has last run trace. Use 'dlt pipeline pokemon_pipeline trace' to inspect
- Launch the streamlit app to preview the data (we copy a streamlit config to make it work on codespaces)
cp -r ../.streamlit .
pip install pandas streamlit
dlt pipeline pokemon_pipeline show
When you run the command above, following files will be generated:
pokemon-pipeline
a folder with all the files- a folder
pokemon
with the Python module containing dlt source, resources and the Python client. __init__.py
in that folder with the dlt source- the
pipeline.py
file that loads the resources to duckdb .dlt
folder with theconfig.toml
There's still work needed to make things useful:
- We will fully restructure the underlying Python client. We'll compress all the files in
pokemon/api
folder into a single, nice and extendable client. - We'll allow to easily add pagination and other injections into client. GPT-4 friendly
- Many more heuristics to extract resources and their dependencies
- Integration with existing
dlt init
command
If you want to experiment, features below still work
- All HTTP Methods
- JSON and form bodies, path and query parameters
- File uploads with multipart/form-data bodies
- float, string, int, date, datetime, string enums, and custom schemas or lists containing any of those
- html/text or application/json responses containing any of the previous types
- Bearer token security
You can pass a YAML (or JSON) file to openapi-python-client with the --config
option in order to change some behavior.
The following parameters are supported:
Pass the source
in command line to create pipeline instead!
When generating properties, the name
attribute of the OpenAPI schema will be used. When the name
is not a valid Python identifier (e.g. begins with a number) this string will be prepended. Defaults to "field_". It will also be used to prefix fields in schema starting with "_" in order to avoid ambiguous semantics.
Example:
field_prefix: attr_
Specify the package version of the generated client. If unset, the client will use the version of the OpenAPI spec.
Example:
package_version_override: 1.2.3
In the config file, there's an easy way to tell openapi-python-client
to run additional commands after generation. Here's an example showing the default commands that will run if you don't override them in config:
post_hooks:
- "autoflake -i -r --remove-all-unused-imports --remove-unused-variables --ignore-init-module-imports ."
- "isort ."
- "black ."
By default, the timeout for retrieving the schema file via HTTP is 5 seconds. In case there is an error when retrieving the schema, you might try and increase this setting to a higher value.