Skip to content

Commit e5d7574

Browse files
authored
Merge pull request #22 from DistributedScience/fix18
monitor prints spot fleet metrics
2 parents 7919446 + 0d5e4a5 commit e5d7574

File tree

2 files changed

+26
-1
lines changed

2 files changed

+26
-1
lines changed

documentation/DS-documentation/step_4_monitor.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,32 @@ Distributed-Something will keep an eye on a few things for you at this point wit
1212
* Each individual job processed will create a log of the CellProfiler output, and each Docker container will create a log showing CPU, memory, and disk usage.
1313

1414
If you choose to run the monitor script, Distributed-Something can be even more helpful.
15-
The monitor can be run by entering `python run.py monitor files/APP_NAMESpotFleetRequestId.json`; the JSON file containing all the information Distributed-Something needs will have been automatically created when you sent the instructions to start your cluster in the previous step.
15+
The monitor can be run by entering `python run.py monitor files/APP_NAMESpotFleetRequestId.json`.
1616

1717
(**Note:** You should run the monitor inside [Screen](https://www.gnu.org/software/screen/), [tmux](https://tmux.github.io/), or another comparable service to keep a network disconnection from killing your monitor; this is particularly critical the longer your run takes.)
1818

1919
***
2020

21+
## Monitor file
22+
23+
The JSON monitor file containing all the information Distributed-Something needs will have been automatically created when you sent the instructions to start your cluster in the [previous step](step_3_start_cluster).
24+
The file itself is quite simple and contains the following information:
25+
26+
```
27+
{"MONITOR_FLEET_ID" : "sfr-9999ef99-99fc-9d9d-9999-9999999e99ab",
28+
"MONITOR_APP_NAME" : "2021_12_13_Project_Analysis",
29+
"MONITOR_ECS_CLUSTER" : "default",
30+
"MONITOR_QUEUE_NAME" : "2021_12_13_Project_AnalysisQueue",
31+
"MONITOR_BUCKET_NAME" : "bucket-name",
32+
"MONITOR_LOG_GROUP_NAME" : "2021_12_13_Project_Analysis",
33+
"MONITOR_START_TIME" : "1649187798951"}
34+
```
35+
36+
For any DS run where you have run [`startCluster`](step_3_start_cluster) more than once, the most recent values will overwrite the older values in the monitor file.
37+
Therefore, if you have started multiple spot fleets (which you might do in different subnets if you are having trouble getting enough capacity in your spot fleet, for example), monitor will only clean up the latest request unless you manually edit the `MONITOR_FLEET_ID` to match the spot fleet you have kept.
38+
39+
***
40+
2141
## Monitor functions
2242

2343
### While your analysis is running

run.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -696,6 +696,11 @@ def monitor(cheapest=False):
696696
#1-10 jobs with errors are keeping it rattling around for hours.
697697
if curtime[-1:]=='9':
698698
downscaleSpotFleet(queue, fleetId, ec2)
699+
# Print spot fleet metrics.
700+
spot_fleet_info = ec2.describe_spot_fleet_requests(SpotFleetRequestIds=[fleetId])
701+
target = spot_fleet_info['SpotFleetRequestConfigs'][0]['SpotFleetRequestConfig']['TargetCapacity']
702+
fulfilled = spot_fleet_info['SpotFleetRequestConfigs'][0]['SpotFleetRequestConfig']['FulfilledCapacity']
703+
print(f'Spot fleet has {target} requested instances. {fulfilled} are currently fulfilled.')
699704
time.sleep(MONITOR_TIME)
700705

701706
# Step 2: When no messages are pending, stop service

0 commit comments

Comments
 (0)