EvMic: Event-based Non-contact Sound Recovery from Effective Spatial-temporal Modeling

1Shanghai AI Laboratory,2Dalian University of Technology,
3The Chinese University of Hong Kong,4Beihang University

*Indicates Equal Contribution,Indicates Corresponding Author

We propose a novel non-contact sound recovery system based on event cameras.

Abstract

When sound waves hit an object, they induce vibrations that produce high-frequency and subtle visual changes, which can be used for recovering the sound. Early studies always encounter trade-offs related to sampling rate, bandwidth, field of view, and the simplicity of the optical path. Recent advances in event camera hardware show good potential for its application in visual sound recovery, because of its superior ability in capturing high-frequency signals. However, existing event-based vibration recovery methods are still sub-optimal for sound recovery. In this work, we propose a novel pipeline for non-contact sound recovery, fully utilizing spatial-temporal information from the event stream. We first generate a large training set using a novel simulation pipeline. Then we designed a network that leverages the sparsity of events to capture spatial information and uses Mamba to model long-term temporal information. Lastly, we train a spatial aggregation block to aggregate information from different locations to further improve signal quality. To capture event signals caused by sound waves, we also designed an imaging system using a laser matrix to enhance the gradient and collected multiple data sequences for testing. Experimental results on synthetic and real-world data demonstrate the effectiveness of our method. Our code and data will be publicly available upon acceptance.

Method

Framework Image
(a) Overview of our proposed network architecture. The event stream is processed into voxel grids, from which patches centered around the speckles are selected. First, the patches are input into a sparse convolution-based lightweight backbone to extract visual features. Next, a spatial attention block aggregates the information in the different patches. Finally, Mamba is employed to model long-term temporal information and reconstruct the audio that caused the object’s vibration. (b) and (c) illustrate the detailed structure of SAB and SSM. (c) At time t gt is the input feature, ot is the output and ht denotes the hidden state. A, B, and C are the gating weights optimized by Mamba. Δ is used to discretize the continuous parameters A and B.

Video Presentation

Experiment Results

Experiment Setup
Experiment Setup
Visualized Events
Visualized Events
Experiment Setup
Experiment Setup
Visualized Events
Visualized Events
Capturing a chipbag while playing MIDI/speech audio of "Mary has a little lamb..."
Capturing a speaker while playing MIDI/speech audio of "Mary has a little lamb..."
Microphone
Microphone Spectrogram
Recovered
Recovered Spectrogram
Microphone
Microphone Spectrogram
Recovered
Recovered Spectrogram
Microphone Spectrogram
Recovered Spectrogram
Microphone Spectrogram
Recovered Spectrogram

BibTeX

@misc{yin2025evmiceventbasednoncontactsound,
        title={EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling}, 
        author={Hao Yin and Shi Guo and Xu Jia and Xudong XU and Lu Zhang and Si Liu and Dong Wang and Huchuan Lu and Tianfan Xue},
        year={2025},
        eprint={2504.02402},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2504.02402}, 
      }