Multimodal tokenization for real-time social perception

How do you give a robot the ability to read a room? Not just detect faces or track movement, but understand the social dynamics of the space it occupies?

The problem

Current robotic perception systems treat modalities in isolation. Vision processes frames. Audio processes waveforms. Proprioception handles balance. But social perception requires fusing these streams into something coherent — and doing it fast enough to respond naturally.

The approach

I'm exploring a tokenization scheme that converts multimodal sensor data into a unified token sequence. The key insight is that not all tokens need the same resolution. A raised eyebrow matters more than a static wall. By dynamically allocating attention based on social salience, we can keep the token count manageable for edge deployment.

Edge constraints

The target hardware is an NVIDIA Jetson Orin. That means roughly 275 TOPS of INT8 inference. Sounds like a lot until you're running vision, audio processing, motor control, and a transformer backbone simultaneously. Every unnecessary token is latency we can't afford.

What's next

The current prototype handles vision-only tokenization at 30fps on the Jetson. Next step is fusing audio embeddings without blowing the latency budget. The goal is sub-100ms end-to-end for a socially-aware response.