← Back to projects
Multimodal Perception
// overview
An active research project exploring how autonomous robots can perceive and respond to human social cues in real time. The core innovation is a tokenization scheme that fuses vision, audio, and proprioceptive data streams into a unified sequence, using dynamic attention allocation based on social salience to keep the token count manageable for edge deployment on NVIDIA Jetson hardware.
// highlights
Unified multimodal tokenization scheme for vision + audio + proprioception
Dynamic attention allocation based on social salience scoring
Target: sub-100ms end-to-end latency on Jetson Orin (275 TOPS INT8)
Vision-only prototype running at 30fps on edge hardware
