Cameron MacLeod wrote a nice piece on how Shazam actually works. I’ve been curious since I’ve used Shazam to gather timestamps when I listen to music at the HiFi shows. Sometimes Shazam’s able to recognize a song within a second. It’s impressive. Check out MacLeod’s analysis here.

Here’s a bird’s eye view of how it works:

The Challenge of Song Recognition

At first glance, one might question why identifying a song is considered a challenging problem. To comprehend the complexities involved, consider a graphical representation of a song’s audio waveform. Each song is essentially a collection of sound waves, and when visualized, these waves can appear intricate and irregular.

For instance, take a brief section of a song’s waveform. To determine if this audio snippet matches a particular song, a brute-force approach would involve sliding this section along the entire song, checking for a match at every point. This method would be computationally intensive and time-consuming, particularly when dealing with vast music libraries.

Furthermore, the challenge intensifies when dealing with real-world audio recordings, which are often affected by background noise, changes in amplitude, frequency variations, and other distortions. The simplistic sliding approach becomes inadequate under these conditions.

The Shazam Solution: Spectrograms and Fingerprinting

Shazam employs a more sophisticated approach to tackle these challenges. Here’s an overview of how it works:

  1. Calculating a Spectrogram: Shazam starts by converting the audio signal into a spectrogram. A spectrogram is a graphical representation that displays how the frequencies in the audio signal change over time. This provides a detailed snapshot of the song’s audio characteristics.
  2. Fingerprinting Peaks: Rather than analyzing the entire spectrogram, Shazam focuses on identifying significant peaks in the spectrogram. These peaks represent the most pronounced frequencies at specific moments in the song. Peaks are valuable because they are less susceptible to noise and distortions.
  3. Hashing Peaks: To create a unique fingerprint for a song, Shazam pairs these peaks together and hashes them into a compact representation. This hashing process combines the frequency and timing information of each peak, resulting in a robust fingerprint for the song.
  4. Database Matching: When a user requests a song identification, Shazam records a short snippet of the audio and repeats the process to create a fingerprint. It then searches its extensive database of precomputed fingerprints for a match. The song with the closest matching fingerprint is considered the identification.

Why Spectrogram Peaks?

The choice of using spectrogram peaks as the foundation of Shazam’s fingerprinting technique is deliberate. Spectrogram peaks are less susceptible to noise and can withstand various audio distortions. Moreover, they provide a more concise representation of the audio, reducing the computational load and storage requirements.

Matching and Scoring

The final step involves matching the audio sample’s fingerprint with those in the database. Shazam groups matching fingerprints by songs and calculates a score for each potential match. The song with the highest score is likely the correct identification. This scoring process considers the time alignment of peaks, ensuring an accurate match.

In essence, Shazam’s technology is akin to a musical detective. It listens to a song, extracts unique audio features, and then hunts for the song’s identity within a vast music library. The result is a seamless user experience that transforms the magic of song recognition into a technological reality.

For more in-depth information, check out MacLeod’s awesome article.