Back to projects

Robotics

Multimodal Perception

Research
TransformersROS2PythonCUDAEdge AI
Multimodal Perception preview

An active research project exploring how autonomous robots can perceive and respond to human social cues in real time. The core innovation is a tokenization scheme that fuses vision, audio, and proprioceptive data streams into a unified sequence, using dynamic attention allocation based on social salience to keep the token count manageable for edge deployment on NVIDIA Jetson hardware.

Unified multimodal tokenization scheme for vision + audio + proprioception

Dynamic attention allocation based on social salience scoring

Target: sub-100ms end-to-end latency on Jetson Orin (275 TOPS INT8)

Vision-only prototype running at 30fps on edge hardware

EOF

Joey Schnepel — Phoenix, AZ — 2026