3D Handwriting Recognition for Apple Vision Pro

May 2026 · Personal project

Houdini · PyTorch · Core ML · visionOS · Swift

This project started with a simple observation: I already have an app for drawing in 3D space on Apple Vision Pro. When you pinch, you draw. When you release, the stroke ends. Each pinch-to-release gives you a precise sequence of 3D coordinates — a trajectory through space that represents whatever the user intended to draw.

The question is: can a machine learn to read it?

Back in primary school, during handwriting lessons, I would demonstratively start my letters from the wrong end. I thought it was absurd that we all had to do it the same way — my small act of rebellion. Stick it to the man. Think outside the box. I was successful, in the sense that I now have completely illegible handwriting. So I am either entirely the wrong person, or entirely the right person, for this task.

The idea

The goal is to recognise hand-drawn letters in 3D space and substitute them with proper 3D geometry — so you draw an M in the air, and an M appears. It sounds simple. It isn't.

Classifying handwritten characters from image data is one of the oldest problems in machine learning — MNIST, a dataset of handwritten digits, was a standard exercise when I first studied the field at university. But 2D is so last decade. We live in a spatial computing world now, and that world needs a classifier with an extra dimension.

No pretrained model exists for this specific problem — classifying a letter from the 3D trajectory a fingertip traces through the air. Gesture recognition models work on continuous joint movement over time, not on the accumulated shape of a stroke. Training from scratch wasn't just the more interesting choice; it was the only practical one.

See it in action

Watch until the end. Justice soundtrack included. You're welcome.

or scroll down to read about the process.

The pipeline

The pipeline covers the full ML lifecycle: synthetic data generation, preprocessing, model training, and on-device deployment.

Rather than collecting thousands of real hand-drawn strokes, training data is generated procedurally in Houdini. Each letter is defined as a curve and run through a process that produces thousands of plausible variations — different sizes, orientations, and levels of noise — all correctly labelled. Adding a new letter to the training set takes minutes rather than hours.

The classifier is trained in PyTorch and deployed via Core ML, running entirely on-device on Apple Vision Pro. Particular attention was paid to making the model generalise beyond the training data — handling letters drawn at different scales, positions and angles, by different people.

Before deploying to the Vision Pro app, the full inference loop was first verified inside Houdini itself: feed a raw stroke in, get a predicted letter out. That tight iteration loop — Houdini for data generation, PyTorch for training, Houdini for testing — made it possible to move quickly.

The result

Draw a letter in the air. The stroke disappears and is replaced by proper 3D geometry — the same size and orientation as what you drew, with ambient occlusion, facing you. An unknown class handles nonsense input gracefully: if the model isn't confident, the stroke stays as a stroke rather than misfiring.

The underlying approach works on any platform with hand tracking data. Vision Pro today. Android XR tomorrow.

Technical details

Recognizing handwriting in 3D is a different problem from image-based OCR. The input is a sequence of spatial coordinates over time — not pixels — which means standard approaches don't transfer. I built the full pipeline from scratch: synthetic data generation, model training, and on-device deployment.

Data

Real 3D handwriting data is scarce. Rather than collect it manually, I generated synthetic training data procedurally in Houdini — each letter represented as a curve resampled to exactly 64 points in 3D space, with controlled variation in stroke style and scale. This gave me a clean, balanced dataset of 2,000 samples per class without the noise and class imbalance typical of collected data.

Architecture

The input is a sequence of (x, y, z) coordinates, so I chose an LSTM over an MLP — letter shape is encoded in the order of points, not just their distribution. The model is a 3-layer LSTM (hidden size 64, dropout 0.2) with a linear classifier on the final hidden state. Local space normalization keeps the representation view-independent. For augmentation I used Y-axis rotation rather than PCA-based alignment — PCA introduces sign ambiguity across samples, which would create inconsistent orientations in the training data.

The classifier covers 7 letters plus an explicit "unknown" class for noise rejection. Confidence thresholding via logit margin means the system declines to classify ambiguous input rather than guessing wrong silently.

Results

Peak validation accuracy: 97.75% across 8 classes (80/20 train/val split). The model stabilizes around 96.5–97% over later epochs. No separate held-out test set was used — a known limitation if the project is extended to the full alphabet.

Deployment

Exported to ONNX for validation testing inside Houdini — feeding a raw stroke in and reading the predicted letter back out. For deployment to Apple Vision Pro, the model is separately exported directly from PyTorch to Core ML via torch.jit.trace and deployed as an .mlpackage. Inference is near-instant on-device — no server round-trip.

Status

Personal research project. The approach is currently being evaluated for integration into a production visionOS application at Trifork.