Original link: https://www.codewoody.com/posts/34553/

The title of this article is Adaptive Multi-scale Detection of Acoustic Events . This is the work of the team of teacher He Liang from the Department of Electronics, Tsinghua University in 2019.

The goal of sound event detection (AED or SED) is to predict the temporal location of a target event in a given audio segment. This task plays an important role in security monitoring, acoustic early warning and other scenarios. However, the lack of data and the diversity of sources of acoustic events make the AED task a daunting problem, especially for the prevailing data-driven approach. This paper starts with the analysis of the time-frequency domain characteristics of acoustic events, and shows that different acoustic events have different time-frequency scale characteristics. Inspired by this analysis, we propose an Adaptive Multiscale Detection (AdaMD) method. By utilizing an hourglass neural network and a gated recurrent unit (GRU) module, our AdaMD produces multiple predictions at different temporal and frequency resolutions. An adaptive training algorithm is then employed to combine multi-scale predictions to enhance the overall capability. Experimental results on the 2017 Detection and Classification of Acoustic Scenes and Events (DCASE 2017) task 2, DCASE 2016 task 3 and DCASE 2017 task 3 show that AdaMD outperforms existing methods on the indicators of event error rate (ER) and F1 score. Published state-of-the-art competitor. Validation experiments on our collected factory machinery dataset also demonstrate the noise immunity of AdaMD, providing proof of practical application.

1 Introduction

The main challenges of the AED problem:

The data is extremely unbalanced;
Events have diverse characteristics;
Different time scales in the frequency domain: This sentence means that different times have different lengths in the time domain, and there are problems with the validity of the detection model based on fixed-length audio input;

The network architecture of AdamMD is shown in the figure above. This network consists of a CNN network and an RNN network. Among them, the network architecture of the CNN part is called Hourglass. This network architecture is widely used in key point detection in the field of computer vision. Its advantage is that it can perform feature extraction under various time-frequency resolution conditions. In the RNN part, the author adopts the Gate Recurrent Unit (GRU) module to process each channel output by the CNN model to process the time domain information. The output of the GRU is processed by an upsampling process so that the output of each channel has the same size.

2 Categories of Sound Event Detection Tasks

Multi-event detection: Multiple events occur within the same time period, and the detection module not only detects whether an event occurs, but also needs to give the category of the event.
Weakly supervised event monitoring: Ideally, the labeled data should contain the category and start and end time of the event. However, there will be a lot of labeling work in this way. How to let the model learn the start and end time of events in the case of only labeling categories is a challenge. At present, this problem has not been solved very well;
Anomaly detection: How to detect anomalies we don’t know about?

This article is transferred from: https://www.codewoody.com/posts/34553/
This site is only for collection, and the copyright belongs to the original author.