Tutorial: Detecting Near Duplicates with AudioDB

AudioDB Tutorial 02 - Detecting Near Duplicates

The Question

Say we suspect that a particular recording has been duplicated, digitally altered, and marketed as a 'new' recording. We want to use audioDB to search our collection of recordings to find near-duplicate tracks and identify suspect recordings. (Alternatively, this search can be used to indentify distinct masterings of a single source recording:

Feature Extraction

We use fftextract to generate chromagram features for our audio collection, as described in the previous tutorial. In this example we use 36 bin chromagram features calculated every second.

Creating The Database

See accompanying document

Formulating the Query

Let us say we suspect the performance corresponding to track_337.wav has been duplicated and the duplicate is somewhere in our collection. Instead of searching for matching
segments we want to return entire tracks that have much of the same content. To do so, we wish to count sequences in the query track which have at least one close-enough match within a particular database track and return the tracks with the highest counts. Here “close-enough” is defined by the overall statistics of the feature space (see tutorial XXX and/or IEEE TASLP paper for how to determine this threshold automatically). We provide a threshold radius with the -R flag

audioDB -d piano.adb -Q sequence -e -n 1 -l 3 -r 5 -R 0.03 -f track_337.chr36

A list of tracks with matching more than one matching sequence will be returned:


track_337 228
track_228 224
track_16 10
track_142 9
track_255 8

In our example above we see that track_337 has 228 matching segments with itself (not surprising) but we also see that track_228 has 224 matching segments; many more than the next closest matches. We can infer that track_228 is probably a near-duplicate of track_337, and that the other returned tracks probably are not.