Skip to main content Skip to search

YU News

YU News

Researchers Develop Powerful AI Method to Filter Out Noise from Bird Song

Sahil Kumar, a master's student in artificial intelligence, is lead author of the paper that will be presented in Greece at the InterSpeech 2024 conference in September.

By Dave DeFusco

 

Researchers have developed a method using a powerful technology to remove unwanted noise from the audio recordings of bird sounds.

 

The method, called ViTVS, uses an image processing technology to divide audio signals into distinct parts, or segments, for isolating clean bird sounds from a noisy background. The approach, explained in the paper, “Vision Transformer Segmentation for Visual Bird Sound Denoising,” has been accepted for presentation at a conference on the science and technology of spoken language processing—InterSpeech 2024—by researchers from the Katz School’s Department of Computer Science and Engineering and Cornell University’s School of Public Policy.

 

Youshan Zhang, assistant professor of artificial intelligence and computer science, is co-author of the paper and Sahil Kumar's faculty mentor.

“The vision transformer architecture is a powerful tool that can look at small parts of a whole, like pieces of a puzzle, and understand how they fit together, which helps in identifying and separating sounds from noise,” said Sahil Kumar, the first author of the paper and a student in the Katz School’s M.S. in Artificial Intelligence.

 

ViTVS helps their model understand and represent the audio comprehensively and in detail, capturing patterns and features that are both small and large, as well as those that occur over short and long periods of time. The model can also capture fine details in the audio, which helps in distinguishing subtle differences between sounds, such as the nuances in bird calls.

 

“This is important for understanding sounds that change slowly or have a broad context,” said Youshan Zhang, a co-author of the paper and assistant professor of artificial intelligence and computer science at the Katz School. “This method enhances the model’s ability to process and understand audio by capturing detailed, extensive and varied patterns, which is crucial for tasks like separating clean bird sounds from noisy backgrounds.”

 

The team used sophisticated algorithms, specifically full convolutional neural networks, to automatically learn how to distinguish between noise and the actual bird sounds, leading to more effective noise removal. Additionally, techniques, like Short-Time Fourier Transform (STFT) and Inverse Short-Time Fourier Transform (ISTFT), were employed to convert audio into a visual format.

 

STFT converted the audio signal into a visual representation, similar to an image, showing how the frequency content of the signal changes over time. After the noise was identified and removed in the visual format, the ISTFT converted the cleaned visual representation back into the original audio format.

 

“This makes it easier to see and identify patterns in the noise and the actual bird sounds,” said Kumar.

 

By using these techniques, the process of cleaning up the audio, or removing noise, became more manageable because it transformed the audio into a format where patterns and differences between noise and actual sounds were more apparent.

 

“Traditional and deep-learning methods often struggle with certain types of noise, especially artificial and low-frequency noises,” said Zhang. “Extensive testing shows that ViTVS outperforms existing methods. It sets a new standard for cleaning up bird sounds, making it a benchmark solution for real-world applications.”