This is an intelligent, autonomous PTZ camera application that uses advanced vision-language models (YOLO or Florence-2) to detect, frame, and analyze objects of interest in real-time. It is designed for deployment on edge computing nodes within the Sage project.
-
Florence-2 Enhancements
- Automatic scene context generation at the start of scans.
- Support for manual prompt injection with
--prompt_prefixto guide detections.
-
Enhanced Data Logging
- Publishes scene captions (Florence-2) and raw detection data (label, confidence, position) with each scan.
- Improved debug-level logging.
-
Pipeline Updates
- Added Scene Analysis step (Florence-2 only).
- Data Publishing step now explicitly includes metadata publishing.
The algorithm performs the following steps:
-
Initialization
Sets up the object detection model (YOLOorFlorence-2) based on user parameters. -
Scene Analysis (Florence-2 Only)
At the start of a scan, Florence-2 can automatically generate a detailed text caption of the current scene to use as a dynamic, contextual prompt. -
Contextual Area Scanning
Systematically scans the environment by rotating the PTZ camera in pan steps (default: 15°) through a full 360° rotation at the specified tilt and zoom level.
When using Florence-2, the generated (or user-specified) context is incorporated to improve detection relevance. -
Object Detection
At each camera position, captures an image and runs object detection to identify specified objects (e.g., person, car, dog). -
Filtering
Filters detections based on confidence threshold (default: 0.1). -
Object Tracking
When an object of interest is detected with sufficient confidence, the algorithm:- Centers the camera on the detected object
- Adjusts zoom to maximize the object in the frame
-
Data Publishing
Saves and publishes the optimized images of detected objects.
Publishes rich metadata—including scene captions (Florence-2), raw detection data (labels, confidence, position), and logging outputs—to the Sage data portal. -
Iteration
Repeats the process for the specified number of iterations with configurable delay between scans.
sudo docker buildx build --platform=linux/amd64,linux/arm64/v8 -t your_docker_hub_user_name/ptzapp -f Dockerfile --push .Then pull the container from dockerhub in the node:
sudo docker image pull your_docker_hub_user_name/ptzappsudo docker run --gpus all -it --rm your_docker_hub_user_name/ptzapp:latest -ki -it 5 -un camera_user_name -pw camera_password -ip camera_ip_address -obj person,carsudo docker run -it --rm your_docker_hub_user_name/ptzapp:latest -ki -it 5 -un camera_user_name -pw camera_password -ip camera_ip_address -obj person,carsudo docker run --gpus all -it --rm your_docker_hub_user_name/ptzapp:latest --model Florence-base --iterations 5 --username username --password 'password' --cameraip 130.202.23.92 --objects 'person,car'When using Florence-2 without a manual prompt, the application will automatically analyze the scene to generate its own context before searching for objects:
sudo docker run --gpus all -it --rm your_docker_hub_user_name/ptzapp:latest --model Florence-base --objects 'animal,bird,deer' --username <user> --password '<pass>' --cameraip <ip>You can provide your own context to the model using the --prompt_prefix argument to guide detections:
sudo docker run --gpus all -it --rm your_docker_hub_user_name/ptzapp:latest --model Florence-base --objects 'animal,bird' --prompt_prefix 'A photo from a trail camera in a wilderness environment' --username <user> --password '<pass>' --cameraip <ip>By default, the application uses the YOLO model (yolo11n) for object detection. Specify objects by name:
sudo docker run --gpus all -it --rm your_docker_hub_user_name/ptzapp:latest --objects 'person,car,dog'When using Florence models, you have more powerful detection capabilities:
sudo docker run --gpus all -it --rm your_docker_hub_user_name/ptzapp:latest --model Florence-base --objects 'person,car'To detect all objects using Florence models, use the asterisk:
sudo docker run --gpus all -it --rm your_docker_hub_user_name/ptzapp:latest --model Florence-base --objects '*'Note: When using '*' with Florence models, the application runs in the <OD> task mode, which enables general object detection without filtering for specific classes. This can be useful for inventorying all objects in a scene but may produce more diverse results than when targeting specific objects.
| Argument | Short | Description | Default |
|---|---|---|---|
--model |
-m |
Model to use (e.g., 'yolo11n', 'Florence-base') | yolo11n |
--iterations |
-it |
Number of iterations (PTZ rounds) to run | 5 |
--username |
-un |
PTZ camera username | "" |
--password |
-pw |
PTZ camera password | "" |
--cameraip |
-ip |
PTZ camera IP address | "" |
--objects |
-obj |
Objects to detect (comma-separated or '*' for everything) | "person" |
--keepimages |
-ki |
Keep collected images in persistent folder | False |
--panstep |
-ps |
Step of pan in degrees | 15 |
--tilt |
-tv |
Tilt value in degrees | 0 |
--zoom |
-zm |
Zoom value | 1 |
--confidence |
-conf |
Confidence threshold (0-1) | 0.1 |
--iterdelay |
-id |
Minimum delay in seconds between iterations | 60.0 |
--prompt_prefix |
Manual text prompt for Florence-2 context | "" | |
--debug |
Enable debug level logging | False |
PLANTNET_API_KEY— for PlantNet API callsBLUR_MIN— Laplacian variance threshold to trigger a focus retry (default ~120)SPECIES_MIN_SCORE— minimum PlantNet confidence to treat as “confident” (e.g., 0.25)
When you launch the app (either with python main.py … or via the Docker command shown above), you’ll see three kinds of outputs:
- Console logs
- Saved images under /imgs (mount this to a host folder to persist)
- Published messages (via Waggle plugin) containing detections & species results
You should see a sequence like:
- Scene caption (Florence only, if enabled)
Generating dynamic context caption for the scene... Scene Context: "The image shows a red fence with multiple rows of small holes..."
- PTZ sweep & detection
Trying PTZ: 0 0 1 Published detection: ptz.detection.p0t0z1 Plant detected (trees). Starting species identification workflow...
- Centering & zoom math (in degrees) and a best-of-N capture with blur score
CAMERA MOVEMNET zoom_level: current_h_fov: current_v_fov: Move the camera to center the object Pan: Tilt: Taking final snapshot(s) for PlantNet... (Example output) [PLANTNET] using image -> /imgs/50.19,-13.11,11.38_plantnet_try_2025-09-10_23:19:17.327619.jpg (blur=9242.4)
- PlantNet result (success)
Species: Quercus garryana Common Names: ['Garry oak', 'Oregon oak', 'Oregon white oak'] Score: 0.3217
- PlantNet error example (no match / 404) (Does not publish misclassification)
PlantNet identification failed: PlantNet API request failed with status 404: {"statusCode":404,"error":"Not Found","message":"Species not found"}
All captured frames land in /imgs inside the container. Mount it to your host to persist. Filename format:
<pan>,<tilt>,<zoom>_<action>_<YYYY-MM-DD_HH:MM:SS.ffffff>.jpgWithout --keepimages, interim candidates may be cleaned up; the final selection is kept when you mount /imgs
- Scene caption (Florence): ptz.scene.caption — free-text description
- Per-position detection: ptz.detection.p{pan}t{tilt}z{zoom}
- Blur/sharpness telemetry: ptz.image.blur
- PlantNet species (if any): ptz.plantnet.species
- Plain score: ptz.plantnet.score
- Alerts (optional, via alert_system.py): ptz.alert.<ALERT_TYPE> with the species JSON
- Multiple
Trying PTZ: …lines per iteration - At least one
ptz.detection.p...with confidence ≥ your--confidence - For plant labels: centering/zoom logs, blur telemetry, PlantNet success block or a clear error
- Images written to
imgs/(mount or--keepimages)
- No species shown: PlantNet may return 404/no match. Ensure
PLANTNET_API_KEYis set; improve view (more leaves/flowers, less backlight), increase--species_zoom, or adjust framing. - Detections but no centering/zoom: Detection didn’t pass
--confidence. Lower it slightly or ensure your--objectsinclude plant terms (plant,tree,flower,bush,wildflower…). - No images on host: Mount
/imgs(-v "$(pwd)/imgs":/imgs) or use--keepimages. - Soft images (low blur): Increase settle delays, try a larger
--species_zoom, or lowerBLUR_MINto reduce retries.
- Detect objects with YOLO or Florence-2.
- Route plants via a keyword map (tree, bush, flower, plant, …).
- Center & maximize the bbox using FOV-based pan/tilt and relative zoom.
- Best-of-N capture with Laplacian variance; pick the sharpest (optionally focus-jiggle retry).
- PlantNet identify and publish results + blur telemetry + optional alerts.