A CLI for running Github Actions workflows that launch and manage Autonomi networks.
These workflows are defined in the sn-testnet-workflows repository. This is a private repository that can only be accessed by being a member of the maidsafe
organisation.
The workflows use actions, that in turn use
the testnet-deploy CLI. It is possible to do
everything via testnet-deploy
directly, but there are several reasons for using this tool:
- Running through workflows gives us traceability and logs that everyone can see.
- The
testnet-deploy
CLI requires configuration and setup. It is best to use it directly for development work, when the deployment needs to be extended in some way. - There are a myriad of options for launching networks, making the CLI cumbersome to use.
- Due to all these options, running the workflows directly is also cumbersome and error prone.
- There are little extensions in the workflows that perform additional steps after deploys.
- Python 3.6 or higher
- Github personal access token with permission to run workflows
Follow these steps using the Github web UI:
- Click on your profile icon on the top right, then click on
Settings
. - From the list on the left, click on
Developer settings
. - Expand the
Personal access tokens
section and click onFine-grained tokens
. - Click on the
Generate new token
button. - Give the token any name you want.
- Optionally provide a description. It can be useful to say what repository the token relates to and what permissions have been applied.
- Use the
Resource owner
drop down to select themaidsafe
organisation. - For
Expiration
, a year should be reasonable. - In the
Repository access
section, click onOnly select repositories
then select thesn-testnet-workflows
repository. - Under
Permissions
, expandRepository permissions
, scroll toWorkflows
at the bottom, then selectRead and write
from theAccess
drop down. - Finally, click on the
Generate token
button.
Now keep a copy of your token somewhere.
There are a few tools you will want to setup for launching and working with environments.
This is the tool provided by this repository. Follow the steps below.
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate
- Install the package in development mode:
pip install -e .
- Provide your personal access token:
export WORKFLOW_RUNNER_PAT=your_github_token_here
Now try using runner --help
to confirm the runner is available.
The doctl CLI is recommended to help quickly smoke test environments. Refer to its documentation to configure the required authentication.
If you need access to our account, speak to one of the team.
For smoke testing and other things, it is necessary to have access to our monitoring and logging solutions.
Speak to a member of the team to obtain access.
We will walk through an example that provides a network for a test.
Very often, the test will be related to a PR, so that is the scenario we will consider.
The command for launching the network is as follows:
runner workflows launch-network --path <inputs-file-path>
The inputs file provides the configuration options for the network. To see all the available
options, the example-inputs
directory has an example for each available workflow. It also
documents each of the options fairly comprehensively, so I will avoid reiterating that here.
Here is a somewhat typical example for launching a network to test a PR:
network-id: 21
network-name: DEV-02
environment-type: staging
rewards-address: 0x03B770D9cD32077cC0bF330c13C114a87643B124
description: "Test further adjustments on #2820 for redialing peers to see if this resolves 'not enough quotes' errors"
related-pr: 2820
branch: req_resp_fix
repo-owner: RolandSherwin
chunk-size: 4194304 # We are currently using 4MB chunks.
# 3 ETH and 100 tokens
initial-gas: "3000000000000000000"
initial-tokens: "100000000000000000000"
client-vars: ANT_LOG=v,libp2p=trace,libp2p-core=trace,libp2p-kad=trace,libp2p-identify=trace,libp2p-relay=trace,libp2p-swarm=trace
interval: 5000
max-archived-log-files: 1
max-log-files: 1
evm-network-type: custom
evm-data-payments-address: 0x7f0842a78f7d4085d975ba91d630d680f91b1295
evm-payment-token-address: 0x4bc1ace0e66170375462cb4e6af42ad4d5ec689c
evm-rpc-url: https://sepolia-rollup.arbitrum.io/rpc
This example will launch a development network, but with a staging configuration. This means it will
be the size of a typical staging environment, where size is the number of VMs/nodes. It is also
using the Arbitrum Sepolia infrastructure for EVM payments. The custom
type is used to explicitly
provide payment and token contract addresses. We are supplying 3 ETH and 100 ANT tokens to the
uploaders that will execute against the network. There is an Ethereum wallet that provides these
funds, whose secret key is in the workflow definition. The uploaders use the ant
binary to
continually upload files, thus providing some activity for the network we are testing. The
description and related PR are optional, but it is useful to provide these; as will be seen later,
they are used in the Slack post for the environment. Finally, the branch you want to test is
provided by the repo owner and branch name, since it will typically exist on a fork of the main
autonomi
repository. The binaries from the branch will be built as part of the deployment process.
Save this file to anywhere you like, though I typically use an inputs
directory at the root of
this repository. That directory is in the .gitignore
file. You can then run the command as
specified above, pointing to your inputs file.
When the command executes, the runner will return the URL for the workflow run that was dispatched.
We now need to wait for that run to complete. If you wish, the workflow
subcommand has a --wait
argument that will keep the runner in a loop that waits for the run to either complete or fail.
There is also a #deployment-alerts
channel in our Slack that you can join. A notification will be
fired in here when the workflow completes. I find this useful because it allows me to go and do
something else until I get a notification.
You can see all the workflow runs using runner workflows ls
. The tool is using a SQLite database
to keep track of things.
Once the deployment is complete, the next step is to perform a smoke test for the environment.
A smoke test is necessary because there is no guarantee the feature branch doesn't have issues.
First, run the following command:
runner deployments ls
The output will be something like this:
=============================================================
D E P L O Y M E N T S
=============================================================
ID Name Deployed PR# Smoke Test
----------------------------------------------------------------------
1 DEV-01 2024-12-30 16:00:13 #2577 ✓
https://github.com/maidsafe/sn-testnet-workflows/actions/runs/12548887014
<other output snipped>
312 DEV-02 2025-03-21 17:21:37 #2764 -
https://github.com/maidsafe/sn-testnet-workflows/actions/runs/13997330555
The most recent deployment will be at the bottom of the list. An optional --details
flag will show
the full specification.
Use the following command to guide yourself through a little smoke test:
runner deployments smoke-test --id 312
There will be a series of questions:
Smoke test for DEV-02
----------------------------------------
Branch: RolandSherwin/req_resp_fix
? Are all nodes running? (Use arrow keys)
» Yes
No
N/A
We will go through each of these questions to demonstrate how to obtain the answer and verify the deployment. This may seem a tedious exercise, but the reason all of them have been included is because the deploy can have a source of failure at any of these steps; more questions have been added as more failure points have been discovered.
There are several places in the test where we could potentially automate the verification. I will
point these out using an Optimisation:
note. The only reason I haven't done this so far is just
due to being busy.
This can be determined using the workflow run. Use the URL from the deployments list.
Click on the post-deploy tasks
job, then expand the network status
section, and scroll right to
the bottom of the log output. Typically you will see something like this (don't worry about the
specific numbers here; your own deploy could be different):
-------
Summary
-------
Total peer cache nodes (3x5): 15
Total generic nodes (19x25): 475
Total symmetric private nodes (0x0): 0
Total full cone private nodes (0x0): 0
Total nodes: 491
Running nodes: 488
Stopped nodes: 0
Added nodes: 3
Removed nodes: 0
Most of the time, there will be 1 to 3 nodes that have not started correctly. In this example, it is
indicated by the fact that 3 nodes are in an Added
state. If the Running nodes:
matches the
total, you can simply answer Yes
. If not, we perform a manual step to start these nodes.
Use the Search logs
textbox to locate the 3 nodes with ADDED
status:
DEV-02-node-7:
antnode1: 0.3.7 RUNNING
antnode2: 0.3.7 ADDED
In this case, DEV-02-node-7
is the hostname of the VM. You can use doctl
to quickly obtain the
VM list:
doctl compute droplet list --format "ID,Name,Public IPv4,Status,Memory,VCPUs" | grep "DEV-02"
Once you have this, SSH to the VM and run antctl start --service-name antnode2
. Repeat for each of
the nodes not running. After doing so, you can answer Yes
to this question and move on.
If there were lots of nodes that were not started here, you could answer No
. Whenever No
is
used, you will be given an option to abandon the test, and a failure result will be registered. If
lots of nodes failed to start, there is probably something wrong. You will need to abandon this
deploy and investigate with the developer.
Optimisation: we could add a start nodes
step to the job that would use testnet-deploy
to
start all the nodes. Nodes that have already been started would be skipped. Without further
optimisations, this would add a few minutes on to the deployment, but it would probably be worth it.
Next, you will be asked if the bootstrap cache files are accessible:
? Are the bootstrap cache files available? (Use arrow keys)
» Yes
No
N/A
Using the same workflow run as above, click on the launch to genesis
job. Expand the launch DEV-XX to genesis
section, then scroll down to the bottom of the logs, where you will see an
inventory report. You will see something like this:
=====================
Peer Cache Webservers
=====================
DEV-02-peer-cache-node-1: http://165.22.114.235/bootstrap_cache.json
DEV-02-peer-cache-node-2: http://209.38.165.32/bootstrap_cache.json
DEV-02-peer-cache-node-3: http://165.22.124.225/bootstrap_cache.json
Each of those URLs should be accessible. You can simply click on them to verify. If they are
available, answer Yes
and move on. If not, you should abandon the test and try to determine why.
Optimisation: it would be possible to automate this step. You could query DO to get the IP addresses of the peer cache hosts then just use an HTTP request to make sure they are accessible. A command could be added to the runner for this.
? Is the main dashboard receiving data? (Use arrow keys)
» Yes
No
N/A
The term "main dashboard" refers to a dashboard within Grafana in our monitoring solution.
When you have obtained access to Grafana, open the ANT Dashboard Default V3
dashboard. Use the
TestNet Name
drop down to select the DEV-02
environment. Then on the right-hand side, just above
the TestNet Name
selector, you will see a time range option. It will probably have the value Last 3 hours
by default. For the smoke test, change this to Last 15 minutes
, just to make sure we are
getting data for our current network. If you are seeing the panels populated by data, such as the # NODES RUNNING
panel with a non-zero value, you can answer Yes
and move on.
? Do nodes on generic hosts have open connections and connected peers? (Use arrow keys)
» Yes
No
N/A
This is just a simple indicator that the generic nodes within the deployment are functioning OK.
Using the same dashboard as above, you should see that the Host Role
selector has the value
GENERIC_NODE
by default, so we don't need to change it for this test. With Last 15 minutes
still
selected, check the Avg. # Open Connections
and Avg. # Connected Peers Per Node
panels and make
sure they have non-zero values. If there are no connected peers, something is generally wrong. In
that case, you should answer No
and abandon the test, then report to the developer and try and
figure out what is wrong. Otherwise, answer Yes
and move on.
? Do nodes on peer cache hosts have open connections and connected peers? (Use arrow keys)
» Yes
No
N/A
This is the same as the test above, except this time you should use the Host Role
selector to
select PEER_CACHE_NODE
. Again, inspect the panels to make sure there are connected peers, and
answer appropriately. If there are not, again there is some kind of problem that needs investigated.
? Do symmetric NAT private nodes have open connections and connected peers? (Use arrow keys)
» Yes
No
N/A
This time you should select NAT_RANDOMIZED_NODE
in the Host Role
selector. Again, you want to
observe connected peers.
Important note: there are some tests where we want to launch an environment without any private
nodes. In this case, you would see 0 connected peers, but you should answer N/A
, because private
nodes do not apply.
? Do full cone NAT private nodes have open connections and connected peers? (Use arrow keys)
» Yes
No
N/A
This time you should select NAT_FULL_CONE_NODE
in the Host Role
selector. If relevant, again you
want to observe connected peers. If not, use N/A
.
? Is ELK receiving logs? (Use arrow keys)
» Yes
No
N/A
We want to quickly make sure logs for the environment are being forwarded to ELK. When you access
ELK, in the top left, just to the left of the Search
textbox, there is a little Saved Queries
drop down. You should see a saved query for your dev environment. Click on it, and it will apply a
filter that gets the last 15 minutes of logs. You want to see a non-zero value in the ANT - ANTNODES - # Log Entries V1
panel. If there are an excessive number of logs, e.g., 500,000 or more,
you may want to flag this, but not necessarily fail the smoke test.
The next two questions will verify that the correct versions of these binaries have been used.
? Is `antctl` on the correct version? (Use arrow keys)
» Yes
No
N/A
? Is `antnode` on the correct version? (Use arrow keys)
» Yes
No
N/A
There is a little script included in the repository that can perform this check for you:
./resources/verify-node-versions.sh "DEV-02" "v0.12.0" "v0.3.7" "req_resp_fix"
Vary these argument values as appropriate to your environment. If you are using a versioned build,
the branch name will be something like stable
, or rc-*
. In the RC case, the version numbers will
also have RC-based suffixes, e.g., -rc.1
. It may seem silly to verify these, but it is definitely
possible that you could have specified incorrect values for the workflow inputs. It's happened to me
many times.
Optimisation: we could automate this by providing a Python implementation of the script as a command in the runner tool.
? Are the correct reserved IPs allocated? (Use arrow keys)
» Yes
No
N/A
This question only applies to the STG-01
and STG-02
environments. In this case, you can use the
Digital Ocean GUI to verify: go to MANAGE
-> Networking
-> Reserved IPs
. You should then see 3
IPs allocated to either of those environments. If you're not using a staging environment, use N/A
.
The next few questions relate to the uploaders.
? Is the uploader dashboard receiving data? (Use arrow keys)
» Yes
No
N/A
? Do uploader wallets have funds? (Use arrow keys)
» Yes
No
N/A
Both these questions can be answered by accessing the ANT Uploaders V1
. As with the main
dashboard, use the TestNet Name
selector to select your environment and change the time period to
Last 15 minutes
. You should see the panels populated with some data. If you see this, you can
answer Yes
to the first question. The first panel can provide the answer to the second question.
The values of Last Gas Balance
and Last Token Balance
should roughly correspond to the funding
values provided in the workflow for launching the network. If uploads have been running
successfully, a little bit of the funds will have been used.
? Is `ant` on the correct version? (Use arrow keys)
» Yes
No
N/A
There is another little script that can verify this:
./resources/verify-ant-version.sh "DEV-02" "v0.3.8" "req_resp_fix"
It will check all uploaders have the correct version of ant
.
? Do the uploaders have no errors? (Use arrow keys)
» Yes
No
N/A
Another helper script can assist with this:
./get-uploader-logs.sh "DEV-02" 1
This just uses SSH to run the journalctl
command and get you the logs. The integer argument is to
indicate how many uploaders are running per VM. Vary this as required. Since the smoke test
generally takes place not too long after the environment has been launched, there won't be that many
uploads yet. So you can eyeball the logs to check there are no errors so far. If there are errors,
you may want to fail the test and investigate. Sometimes there could be errors that are expected
that are not related to the branch being tested, and those would be acceptable.
Optimisation: the checks using the scripts could be subsumed as commands within the runner.
With the test concluded, whether it passed or failed, there is now an opportunity to post it to Slack. In most cases you will want to do this, because either you will most likely want to investigate and discuss the failure, or you will want to indicate to the owner of the PR that the environment is ready and that the test is running. The post can then serve to document any results or discuss them. Very often, changes will need to be made to the PR, and that will be discussed within the thread.
To post the environment details to Slack:
runner deployments post --id 312