ML-Powered Gesture Recognition for Apple Vision Pro
June 2026 · Trifork
Houdini · PyTorch · Core ML · visionOS · Swift
When you build virtual training applications and user testing tools for Apple Vision Pro, you quickly run into a problem: the interactions people perform in XR are highly specific. A technician grabbing a virtual valve. A trainee pinching a component. A user giving a thumbs up to confirm a selection. Recognising these gestures reliably matters — and the rule-based approaches we'd been using, measuring distances between finger joints and checking angle thresholds, only get you so far.
The goal was a machine learning classifier that could recognise a defined set of hand gestures directly from Apple Vision Pro's hand tracking data, running on-device via Core ML.
The pipeline
Data collection. The first challenge with any ML project is getting labelled training data. For gesture recognition on Vision Pro, that means someone has to actually perform the gestures while the system records what the hand joints are doing. I built a dedicated visionOS data collection app for this — more on that below. The app captures a snapshot of all 27 hand joint positions from Apple's HandSkeleton at the moment a gesture is performed, and logs them as JSON with a label. The result is a structured dataset of real hand poses, collected directly on the device the model will eventually run on.
Preprocessing. Raw joint positions are in world space — they depend on where in the room the user is standing and where their hand is. That's not useful for a classifier that needs to recognise shape, not location. Each sample is normalised by subtracting the wrist position from all joints, so the wrist becomes the origin and everything else is relative to it. The model learns what the hand looks like, not where it is.
World space also means the data is orientation-dependent — gestures recorded facing one direction fail when the user faces another. The fix is Y-axis rotation augmentation: each sample is duplicated with eight random rotations around the vertical axis, multiplying the dataset size ninefold without collecting a single additional real sample.
Model. Gesture recognition from a static hand pose is a classification problem, not a sequence problem — unlike my 3D handwriting project, there's no temporal dimension to model. The input is a flat array of 81 floats (27 joints × 3 coordinates), and the output is one of (to begin with) six gesture classes: Neutral, Pinch, Grab, Claw, Thumbs Up, Thumbs Down. A small MLP with two hidden layers handles this cleanly. Training takes seconds.
Deployment. The trained model exports directly from PyTorch to a .mlpackage via torch.jit.trace and Core ML Tools. In Swift, inference is a matter of normalising the current joint positions, packing them into an MLMultiArray, and calling model.prediction(). The model is cached at module level so it doesn't reload on every call. Predictions run every 200ms in a background task — fast enough to feel responsive, light enough not to compete with hand tracking.
A confidence threshold filters out uncertain predictions: if the margin between the top logit and the second-highest is below a threshold, the result defaults to Neutral. This prevents the model from committing to a gesture when the hand is in an ambiguous position.
The data collection app
Building the training pipeline required a purpose-built tool. The visionOS app has two modes: Training and Inference, each in its own window.
In Training mode, a segmented picker selects the active hand — left or right — which determines what gets tracked, visualised, and logged. Small spheres render on all 27 joints of the active hand in the immersive space, replacing the passthrough hand with a joint skeleton so you can verify tracking is working before collecting data. Six buttons correspond to the six gesture classes. Tapping one captures a snapshot and appends it to a JSON file on device, with per-hand sample counters so you know when each class has enough coverage.
One practical detail worth noting: the capture buttons are filtered by chirality. The hand being tracked cannot trigger its own capture buttons — only the opposite hand can. This prevents two specific problems: accidental logging when the tracked hand drifts near the UI, and — more critically — training gestures like Pinch being picked up by visionOS as a native pinch interaction, accidentally activating a button the user happens to be looking at.
An Undo button removes the last logged entry in case of a mistap. A Share button exports the JSON file via the system share sheet for AirDrop transfer to a Mac for training.
Inference mode runs the Core ML model continuously, displaying the current prediction and logit margin in a separate window with its own hand picker. This allows you to verify the quality of the model in practice — testing it on different people, different hand sizes, and different ways of performing the same gesture. Loss and validation accuracy tell you how well the model fits your training data. This tells you whether it actually works.
Results
Accuracy on the six gesture classes is strong with relatively little training data, and improves further as more samples are collected. The wrist normalisation and rotation augmentation together mean the classifier handles different users, different hand sizes, and different orientations without retraining — which matters in a tool that needs to work reliably across a variety of users and environments.
The same pipeline — data collection app, preprocessing, MLP classifier, Core ML deployment — can be extended to new gesture classes by adding a button, collecting samples, and retraining. For virtual training and user testing applications where the set of meaningful interactions is known in advance, that's a practical and maintainable approach.