- 1. Code Structure
- 2. Core Design Patterns
- 3. Task Assignment
- 4. Graceful Shutdown
- 5. Autoscaler Design
- 6. Provider Integration
- 7. Background Jobs
- 8. Testing
waverless/
├── app/ # Application layer
│ ├── handler/ # HTTP handlers (thin)
│ ├── middleware/ # Auth, logging, recovery
│ └── router/ # Route definitions
│
├── cmd/ # Entry points
│ ├── main.go # Application entry
│ ├── initializers.go # Dependency injection
│ └── jobs.go # Background jobs
│
├── internal/ # Internal packages
│ ├── service/ # Business logic
│ │ ├── endpoint/ # Endpoint CRUD + deployment
│ │ ├── lifecycle/ # Worker lifecycle
│ │ ├── task_service.go
│ │ └── worker_service.go
│ ├── model/ # Domain models
│ └── vo/ # Value objects (DTOs)
│
├── pkg/ # Shared packages
│ ├── autoscaler/ # Autoscaling engine
│ │ ├── manager.go # Control loop
│ │ ├── decision_engine.go
│ │ ├── executor.go
│ │ └── metrics_collector.go
│ ├── provider/ # Deployment providers
│ │ ├── k8s/ # Kubernetes
│ │ ├── novita/ # Novita Serverless
│ │ └── docker/ # Docker (dev)
│ ├── interfaces/ # Shared interfaces
│ └── store/ # Data access
│ ├── mysql/ # MySQL repositories
│ └── redis/ # Redis operations
│
└── config/ # Config files
| Layer | Responsibility |
|---|---|
| Handler | HTTP request/response, validation |
| Service | Business logic, orchestration |
| Provider | Infrastructure abstraction |
| Store | Data persistence |
All dependencies wired in cmd/initializers.go:
flowchart TB
Config --> Stores
Stores --> Providers
Providers --> Services
Services --> Handlers
Handlers --> Router
flowchart LR
Handler --> Service
Service --> Store[(MySQL/Redis)]
Service --> Provider
Service --> OtherService
Database operations that need atomicity are wrapped in transactions:
- Multiple operations execute as a single unit
- Automatic rollback on any failure
- Commit only when all operations succeed
Non-critical operations (like statistics updates) run asynchronously to avoid blocking the main request flow.
Prevents race conditions when workers are draining:
sequenceDiagram
Worker->>Service: PullJobs(worker_id)
Service->>MySQL: Check worker status
alt DRAINING
Service-->>Worker: Empty
else ONLINE
Service->>Redis: RPOP task
Service->>MySQL: Update IN_PROGRESS
Service->>MySQL: Re-check status
alt Became DRAINING
Service->>MySQL: Revert PENDING
Service->>Redis: LPUSH back
Service-->>Worker: Empty
else Still ONLINE
Service-->>Worker: Task
end
end
Task selection uses database-level locking to prevent race conditions:
- Select pending tasks with row-level lock
- Update status atomically
- Ensures no duplicate task assignment
sequenceDiagram
K8s->>Informer: SIGTERM
Informer->>Lifecycle: OnWorkerDraining
Lifecycle->>MySQL: Set DRAINING
loop Until done
Worker->>Worker: Complete tasks
end
K8s->>Informer: Pod Deleted
Informer->>Lifecycle: OnWorkerDelete
Lifecycle->>MySQL: Set OFFLINE
When deployment changes:
flowchart LR
A[Deployment Change] --> B[OnDeploymentChange]
B --> C{Worker Status}
C -->|Idle| D[Cost = -1000<br/>Delete First]
C -->|Busy| E[Cost = 1000<br/>Delete Last]
- Informer detects pod termination signal
- Worker marked DRAINING immediately
- No new tasks assigned to DRAINING workers
- Worker completes current tasks
- Pod deleted after tasks complete
- Worker marked OFFLINE
flowchart TB
subgraph "Every 30s"
A[Acquire Lock] --> B[Collect Metrics]
B --> C[Calculate Resources]
C --> D[Make Decisions]
D --> E[Execute]
E --> F[Release Lock]
end
flowchart TB
subgraph "Per Endpoint"
M1[Pending Tasks]
M2[Replicas]
M3[Idle Time]
M4[Priority]
end
M1 & M2 & M3 & M4 --> D{Decision}
D -->|pending >= threshold| Up[Scale Up]
D -->|idle >= timeout| Down[Scale Down]
D -->|otherwise| None[No Action]
Uses Redis distributed lock to ensure only one instance makes scaling decisions at a time, preventing conflicting operations in multi-replica deployments.
- Sort scale-up requests by priority
- Minimum guarantee: 1 replica per endpoint with tasks
- Remaining resources by priority
- High priority can preempt low priority idle workers
If endpoint waits > 5min without resources:
- Temporarily elevate priority
- Ensure eventual resource allocation
The provider interface defines standard operations:
- Deploy: Create new worker deployment
- GetApp/ListApps: Query deployment status
- DeleteApp: Remove deployment
- ScaleApp: Adjust replica count
- UpdateDeployment: Update deployment configuration
- ListSpecs/GetSpec: Query available resource specs
- WatchReplicas: Monitor replica changes
- IsPodTerminating: Check termination status
-
Create package:
pkg/provider/myprovider/ -
Implement the DeploymentProvider interface with all required methods
-
Implement lifecycle callbacks (recommended):
- OnWorkerStatusChange: Handle status transitions
- OnWorkerDelete: Cleanup on worker removal
- OnWorkerDraining: Stop task assignment
- OnWorkerFailure: Record failure information
- OnEndpointStatusChange: Track deployment health
-
Register in provider factory: Add initialization logic
-
Register in lifecycle manager: Connect lifecycle callbacks
| Provider | Use Case | Lifecycle |
|---|---|---|
| K8s | Production | Full (Informers) |
| Novita | Serverless GPU | Full (Polling) |
| Docker | Development | Basic |
| Job | Interval | Purpose |
|---|---|---|
| CleanupOfflineWorkers | 30s | Mark stale workers offline |
| CleanupOrphanedTasks | 60s | Re-queue tasks from dead workers |
| CleanupTimedOutTasks | 60s | Fail tasks exceeding timeout |
flowchart TB
A[Find IN_PROGRESS tasks] --> B{Worker exists?}
B -->|No| C[Re-queue]
B -->|Yes| D{Worker OFFLINE?}
D -->|Yes| C
D -->|No| E[Keep]
go test ./...
go test -cover ./...
go test ./pkg/autoscaler/...go test -tags=integration ./...Located in *_property_test.go files, these tests verify system behavior across randomly generated inputs to catch edge cases.
- Task submission events
- Job pull operations
- Worker draining notifications
- Scaling decisions
- Fork repository
- Create feature branch
- Write tests
- Submit pull request
- Follow Go conventions
- Use
gofmtandgolint - Add comments for exported functions
- Meaningful commit messages
Document Version: v3.0
Last Updated: 2026-02