This document explains the integration of the custom-trained Indian Sign Language (ISL) ONNX model into the Vite/React application.
- File:
public/isl_model.onnx(6.5 MB) - Input Shape:
[1, 42](21 hand landmarks × 2 coordinates: x, y) - Output: Probability distribution over 42 ISL gesture classes
- Framework: ONNX Runtime Web
For each video frame, the model expects normalized hand landmark data:
- Extract Coordinates: Get x and y from all 21 MediaPipe hand landmarks
- Calculate Minimums: Find
min_xandmin_yfor the current frame - Normalize: Subtract minimums from each coordinate:
normalized_x_i = x_i - min_x normalized_y_i = y_i - min_y - Flatten: Convert to Float32Array of length 42:
[x₀, y₀, x₁, y₁, ..., x₂₀, y₂₀]
const processLandmarks = (landmarks: any[]): Float32Array => {
// Extract x and y coordinates
const coords = landmarks.map(lm => ({ x: lm.x, y: lm.y }));
// Calculate min_x and min_y
const minX = Math.min(...coords.map(c => c.x));
const minY = Math.min(...coords.map(c => c.y));
// Normalize: subtract minimums and flatten
const normalized: number[] = [];
for (const coord of coords) {
normalized.push(coord.x - minX);
normalized.push(coord.y - minY);
}
return new Float32Array(normalized);
};npm install onnxruntime-web @mediapipe/hands @mediapipe/camera_utils- onnxruntime-web: ONNX Runtime for browser-based inference
- @mediapipe/hands: Google MediaPipe Hands for hand landmark detection
- @mediapipe/camera_utils: Camera utilities for MediaPipe
User clicks Start
↓
Load ONNX Model (once)
↓
Initialize MediaPipe Hands
↓
Start Camera (1280×720)
↓
Process frames at ~30 FPS:
1. Capture video frame
2. MediaPipe detects 21 hand landmarks
3. Normalize landmarks (min-subtraction)
4. Run ONNX inference
5. Get prediction + confidence
6. Update UI
↓
Display results in real-time
- ONNX model loads on component mount
- Camera starts only after model is ready
- Uses WebAssembly execution provider for performance
- 30 FPS hand tracking
- Live landmark visualization on canvas
- Inference latency tracking (~50-150ms typical)
- Large, clear prediction display
- Confidence score visualization
- FPS and latency metrics
- Conversation history with auto-transcript
- Green landmarks and connections drawn on canvas
- 21 hand keypoints tracked
- Palm and finger connections rendered
The model recognizes the following gestures:
const ISL_CLASSES = [
'Hello', 'Thank you', 'Please', 'Help', 'Yes', 'No',
'Good morning', 'How are you', 'Sorry', 'Welcome',
'Goodbye', 'I', 'You', 'We', 'They', 'What', 'When',
'Where', 'Why', 'How', 'Good', 'Bad', 'Happy', 'Sad',
'Eat', 'Drink', 'Sleep', 'Work', 'Study', 'Play',
'Family', 'Friend', 'Mother', 'Father', 'Brother', 'Sister',
'Love', 'Like', 'Want', 'Need', 'Have', 'Go'
];Note: Update this array to match your actual trained classes.
- Single hand tracking (maxNumHands: 1)
- Model complexity: 1 (balanced)
- Confidence thresholds: 0.5 (detection), 0.5 (tracking)
- Direct canvas manipulation for hand landmarks
- No unnecessary re-renders
- Optimized drawing with requestAnimationFrame
- Proper cleanup on component unmount
- Camera stream stopped when session ends
- MediaPipe resources released
- Click the green Play button
- Wait for "Position hands in frame" message
- Show ISL gestures to the camera
- View real-time predictions in the right panel
- Click the red Pause button
- Camera and processing stop immediately
- All resources are cleaned up
- Check browser console for errors
- Ensure
public/isl_model.onnxexists - Verify file is not corrupted (should be ~6.5 MB)
- Grant camera permissions in browser
- Check if camera is already in use
- Try refreshing the page
- Ensure good lighting conditions
- Position hand clearly in frame
- Check if gestures match training data
- Verify confidence threshold (default: 0.5)
- Close other browser tabs
- Check CPU usage
- Consider reducing video resolution
- Ensure WebAssembly is enabled
0: Wrist
1-4: Thumb (CMC, MCP, IP, Tip)
5-8: Index finger (MCP, PIP, DIP, Tip)
9-12: Middle finger (MCP, PIP, DIP, Tip)
13-16: Ring finger (MCP, PIP, DIP, Tip)
17-20: Pinky (MCP, PIP, DIP, Tip)
1. Create Float32Array[42] from normalized landmarks
2. Create ONNX Tensor with shape [1, 42]
3. Run session.run() with input tensor
4. Extract output probabilities
5. Apply softmax for confidence scores
6. Return argmax as prediction- Add support for two-handed gestures
- Implement gesture sequence recognition
- Add custom gesture training interface
- Optimize for mobile devices
- Add offline mode support
- Implement gesture smoothing/filtering
- ONNX Runtime: Microsoft
- MediaPipe: Google
- Model Training: Custom ISL dataset
Last Updated: January 31, 2026 Version: 1.0.0