Skip to content

Commit

Permalink
Fix route definition
Browse files Browse the repository at this point in the history
  • Loading branch information
fasmat committed Jan 7, 2025
1 parent abfb3ed commit e8ce770
Show file tree
Hide file tree
Showing 3 changed files with 58 additions and 29 deletions.
63 changes: 40 additions & 23 deletions k2pow-service/README.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,74 @@
## Remote K2PoW service worker
# Remote K2PoW service worker

This binary is tasked with the role of performing k2pow calculations necessary for PoST in the spacemesh protocol.

K2pow is expensive and is not required except for specific phases of the protocol, therefore, it is a good candidate
for being pulled out into an ephemeral environment that could predictibly be spun-up, used, then turned off (aka rented).
for being pulled out into an ephemeral environment that could predictably be spun-up, used, then turned off (aka rented).

The workers are generally imagined to be used behind a pseudo-smart load-balancer that can try different workers for a single task.
The workers are generally imagined to be used behind a pseudo-smart load-balancer that can try different workers for a
single task.

On the post service side, one can throttle how many workers are intended to be tried simultaniously, and then k2pow can be done in parallel,
on different machine. This is controlled by the `parallelism` setting in the post service. E.g. if there are `10` workers and one post service,
one should set the parallelism setting to `10`. If there are multiple workers and multiple post services, use the relative amount (`workers/post services`).
On the post service side, one can throttle how many workers are intended to be tried simultaneously, and then k2pow can
be done in parallel, on different machine. This is controlled by the `parallelism` setting in the post service. E.g. if
there are `10` workers and one post service, one should set the parallelism setting to `10`. If there are multiple
workers and multiple post services, use the relative amount (`workers/post services`).

The number of cores, randomx mode and randomx large pages settings are CPU and setup dependent.

Every worker supports having only _one_ job executing at the time. Queuing of future tasks is not possible at the moment. Requests are served therefore in a first-come-first-served manner.
Every worker supports having only _one_ job executing at the time. Queuing of future tasks is not possible at the
moment. Requests are served therefore in a first-come-first-served manner.

### API
## API

The service uses a simple HTTP API with the following endpoints:

#### Health endpoint
### Health endpoint

`GET /` - health endpoint, returns an `HTTP 200 OK` with a basic response

#### Job endpoint
### Job endpoint

`GET "/job/{miner}/{nonce_group}/{challenge}/{difficulty}"` - the main endpoint that provides the functionality in question, where
`GET "/job/{miner}/{nonce_group}/{challenge}/{difficulty}"` - the main endpoint that provides the functionality in
question, where

- `miner` is the miner id, `32` bytes encoded in hex (no preceding `0x` needed).
- `nonce_group` is the nonce group `uint8` as a regular string.
- `challenge` is the challenge, `8` bytes encoded in hex (no preceding `0x` needed).
- `difficulty` is the difficulty, `32` bytes encoded in hex (no preceding `0x` needed).

This endpoint may yield different responses depending on the state of the node:
- `HTTP 201 CREATED` - the job has been created and is processing (the first call and subsequent calls will yield the same status code)

- `HTTP 201 CREATED` - the job has been created and is processing (the first call and subsequent calls will yield the
same status code)
- `HTTP 200 OK` - the job has been completed and the result is then encoded in the body as a `uint64` encoded as a string.
- `HTTP 500 INTERNAL SERVER ERROR` - the job had encountered an error. The error is written to the response as a string.
- `HTTP 429 TOO MANY REQUESTS` - the worker is busy and cannot accept the job at the moment. The client should backoff and retry later. It will be returned when worker is doing the job for OTHER than requested params (if params match and the job is still being processed it will return `201` as written above)
- `HTTP 429 TOO MANY REQUESTS` - the worker is busy and cannot accept the job at the moment. The client should backoff
and retry later. It will be returned when worker is doing the job for OTHER than requested params (if params match
and the job is still being processed it will return `201` as written above)

Note: the `miner` prefix is first in order to allow for flexibility in how to route requests within the load-balancer.

### Setup

While a single post service can use a single k2pow service as a processing backend, this is a rather specific use case where one uses one high-performance k2pow service machine with multiple post-services that are significantly less powerful. It's also worth noting that k2pow requires RandomX-optimized hardware, while the post-service requires AES-NI optimized hardware.

More advanced setup would interact through a load balancer. The load balancer should try sequencially to send the job between the different
workers, ideally sweeping through them until a vacant one is found. Once it sweeps through all of them, it should propagate the error back to
the post service. The post service knows to backoff and wait before trying again to send the job. The backoff period is also configurable.
The load balancer needs to remember which node was queried so that the same request can later scrape the result (instead of sending the job to a new node). One of the ways to "remember" it is using sharding based on the `miner` part of the URI.
The post service will keep requesting using the same endpoint always, this means that even in the case of a worker restart, the job can eventually get through.
The one caveat here is that if a worker is restarted, the load balancer behavior may be affected (it keeps forwarding the same GET request to the same worker, which is _not necessarily_ executing that job.
While a single post service can use a single k2pow service as a processing backend, this is a rather specific use case
where one uses one high-performance k2pow service machine with multiple post-services that are significantly less
powerful. It's also worth noting that k2pow requires RandomX-optimized hardware, while the post-service requires AES-NI
optimized hardware.

More advanced setup would interact through a load balancer. The load balancer should try sequentially to send the job
between the different workers, ideally sweeping through them until a vacant one is found. Once it sweeps through all of
them, it should propagate the error back to the post service. The post service knows to backoff and wait before trying
again to send the job. The backoff period is also configurable. The load balancer needs to remember which node was
queried so that the same request can later scrape the result (instead of sending the job to a new node). One of the
ways to "remember" it is using sharding based on the `miner` part of the URI.
The post service will keep requesting using the same endpoint always, this means that even in the case of a worker
restart, the job can eventually get through. The one caveat here is that if a worker is restarted, the load balancer
behavior may be affected (it keeps forwarding the same GET request to the same worker, which is _not necessarily_
executing that job).

The individual k2pow workers have no persistence enabled.
Individual k2pow results are remembered and kept within the duration of a session, but not across sessions. This means that if they crashed/restarted no previous results would be remembered.
Individual k2pow results are remembered and kept within the duration of a session, but not across sessions. This means
that if they crashed/restarted no previous results would be remembered.

There is example configuration for HAProxy loab balancer in the [haproxy.cfg](./examples/haproxy/haproxy.cfg) with the [README.md](./examples/haproxy/README.md)
There is example configuration for HAProxy load balancer in the [haproxy.cfg](./examples/haproxy/haproxy.cfg) with the
[README.md](./examples/haproxy/README.md)
22 changes: 17 additions & 5 deletions k2pow-service/examples/haproxy/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,36 @@
# Example Load Balancer Configuration for k2pow-service

This readme demonstrates the example Load Balancer configuration for the k2pow service.

Using this or a similar approach, one can:

* Use multiple workers for a k2pow-service to process multiple k2pow operations simultaneously.
* Utilize multiple machines to do so and hide them behind a single address.

This example uses HAProxy as a load balancer, but any other load balancer can be used.

The example configuration file is in `haproxy.cfg`.

The configuration file is set up to use 3 workers, but this can be adjusted by changing the `server` lines in the `backend k2pow` section.
The configuration file is set up to use 3 workers, but this can be adjusted by changing the `server` lines in the
`backend k2pow` section.

Theoretically, you can run that HAProxy config in a Docker container with the following command:

```bash
docker run --net host -d --name my-running-haproxy -v `pwd`/hpx:/usr/local/etc/haproxy haproxy:3.0
```
However, you will NOT be able to hot reload the HAProxy process while keeping the sticky information. Therefore, it is recommended NOT to run it as a Docker container in production (or at least not as that simple container from above).

However, you will NOT be able to hot reload the HAProxy process while keeping the sticky information. Therefore, it is
recommended NOT to run it as a Docker container in production (or at least not as that simple container from above).

## Config explanation

The main aspect that requires explanation is the sticky part.
Given that the jobs are sent to the k2pow service as `GET "/job/{miner}/{nonce_group}/{challenge}/{difficulty}"`, we configure `acl uri_parts path_reg ^/job/([^/]+)/([^/]+)` so we can set the sticky session to the miner and nonce_group part of the URL.

Then, thanks to `retry-on 429 503 response-timeout conn-failure`, we keep retrying on normal errors AND 429, which is by default sent by the k2pow service when it's already calculating a proof. This way, HAProxy will retry the request to another server for up to `retries 4`. In the end, because of `hash-type consistent`, HAProxy will remember which server was used for the given miner and nonce_group and will always send the request to the same server.
Given that the jobs are sent to the k2pow service as `GET "/job/{miner}/{nonce_group}/{challenge}/{difficulty}"`, we
configure `acl uri_parts path_reg ^/job/([^/]+)/([^/]+)` so we can set the sticky session to the miner and nonce_group
part of the URL.

Then, thanks to `retry-on 429 503 response-timeout conn-failure`, we keep retrying on normal errors AND 429, which is
by default sent by the k2pow service when it's already calculating a proof. This way, HAProxy will retry the request to
another server for up to `retries 4`. In the end, because of `hash-type consistent`, HAProxy will remember which server
was used for the given miner and nonce_group and will always send the request to the same server.
2 changes: 1 addition & 1 deletion k2pow-service/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ fn router<T: GetOrCreate + Send + Sync + 'static>(job_manager: Arc<T>) -> Router
Router::new()
.route("/", get(root))
.route(
"/job/:miner/:nonce_group/:challenge/:difficulty",
"/job/{miner}/{nonce_group}/{challenge}/{difficulty}",
get(get_job),
)
.with_state(job_manager)
Expand Down

0 comments on commit e8ce770

Please sign in to comment.