Multimodal Perception

// overview

An active research project exploring how autonomous robots can perceive and respond to human social cues in real time. The core innovation is a tokenization scheme that fuses vision, audio, and proprioceptive data streams into a unified sequence, using dynamic attention allocation based on social salience to keep the token count manageable for edge deployment on NVIDIA Jetson hardware.

// highlights

Unified multimodal tokenization scheme for vision + audio + proprioception

Dynamic attention allocation based on social salience scoring

Target: sub-100ms end-to-end latency on Jetson Orin (275 TOPS INT8)

Vision-only prototype running at 30fps on edge hardware