In the last article we saw how to reduce a complex spectrum that evolves over time to a simple one dimensional function called the spectral flux. On the way we did some improvements to the spectral flux function like rectifying it and enabling Hamming smoothing of the audio samples. In this article we will talk about a so called threshold function which we derive from the spectral flux function.

That’s a rectified, Hamming smoothed spectral flux function. It’s for an excerpt of the song “Judith” by “A Perfect Circle” and already shows quiet some peaks which we could directly interpret as onsets. However, there’s still some noise in that function, especially when the crashs are hit for example. We can’t do any better filtering, like for example smoothing out the function a bit as this would get rid of a lot of the small peaks. Instead we do something totally simple: we calculate a threshold function which is derived from the spectral flux function. This threshold function is then later used to discard spectral flux values that are noise.

The threshold function is actually pretty simple. Given our spectral flux function we calculate the mean or average value of a window around each spectral flux value. Say we have 3000 spectral flux values, each specifying the spectral flux of a 1024 sample window. 1024 samples at a sampling rate of 44100Hz equal a time span of around 43ms. The window size we use for the threshold function should be derived from some time value, say we want the average spectral flux of a time span of 0.5 seconds. That’s 0.5 / 0.043 = 11 sample windows or 11 spectral flux values. For each spectral flux value we take the 5 samples before it, 5 samples after it and the current value and calculate their average. We thus get for each spectral flux value a single value we call the threshold (for reasons that will become apearant shortly). Here’s some code that does this:

public class Threshold 
{
   public static final String FILE = "samples/explosivo.mp3";	
   public static final int THRESHOLD_WINDOW_SIZE = 10;
   public static final float MULTIPLIER = 1.5f;
 
   public static void main( String[] argv ) throws Exception
   {
      MP3Decoder decoder = new MP3Decoder( new FileInputStream( FILE  ) );							
      FFT fft = new FFT( 1024, 44100 );
      fft.window( FFT.HAMMING );
      float[] samples = new float[1024];
      float[] spectrum = new float[1024 / 2 + 1];
      float[] lastSpectrum = new float[1024 / 2 + 1];
      List<Float> spectralFlux = new ArrayList<Float>( );
      List<Float> threshold = new ArrayList<Float>( );
 
      while( decoder.readSamples( samples ) > 0 )
      {			
         fft.forward( samples );
         System.arraycopy( spectrum, 0, lastSpectrum, 0, spectrum.length ); 
         System.arraycopy( fft.getSpectrum(), 0, spectrum, 0, spectrum.length );
 
         float flux = 0;
         for( int i = 0; i < spectrum.length; i++ )	
         {
            float value = (spectrum[i] - lastSpectrum[i]);
            flux += value < 0? 0: value;
         }
         spectralFlux.add( flux );					
      }		
 
      for( int i = 0; i < spectralFlux.size(); i++ )
      {
         int start = Math.max( 0, i - THRESHOLD_WINDOW_SIZE );
         int end = Math.min( spectralFlux.size() - 1, i + THRESHOLD_WINDOW_SIZE );
         float mean = 0;
         for( int j = start; j <= end; j++ )
            mean += spectralFlux.get(j);
         mean /= (end - start);
         threshold.add( mean * MULTIPLIER );
      }
 
      Plot plot = new Plot( "Spectral Flux", 1024, 512 );
      plot.plot( spectralFlux, 1, Color.red );		
      plot.plot( threshold, 1, Color.green ) ;
      new PlaybackVisualizer( plot, 1024, new MP3Decoder( new FileInputStream( FILE ) ) );
   }
}

Not a lot of new things in there. First we calculate the spectral flux function as we did in the last article. Based on this spectral flux function we then calculate the threshold function. For each spectral flux value in the ArrayList spectralFlux we take the THRESHOLD_WINDOW_SIZE spectral flux values before and after it and calculate the average. The resulting mean value is then stored in an ArrayList called threshold. Note that we also multiply each threshold value by MULTIPLIER which is set to 1.5 in this example. After we finished calculating all our functions we plot them. The result looks like this:

What we just did is calculating a running average of the spectral flux function. With this we can detect so called outliers. Anything above the threshold function is an outlier and marks an onset of some kind! It should also be clear why the threshold function values are multiplied by a constant > 1. Outliers must be bigger than the average, in this case 1.5 times bigger than the average. This multiplier is an important parameter of our onset detector as it governs the sensitivity. However, when i apply the detector to songs i try to come up with a single value that works for all of them. I actually did that and arrived at the magic 1.5 multiplier used above. It works well for all the samples that you can find in the svn as well as a lot of other songs i tested.

Now let us combine the spectral flux function and the threshold function. Basically we want a prunned spectral flux function that only contains values that are bigger or equal to the threshold function. We extend the above program by the following lines:

for( int i = 0; i < threshold.size(); i++ )
{
   if( threshold.get(i) <= spectralFlux.get(i) )
      prunnedSpectralFlux.add( spectralFlux.get(i) - threshold.get(i) );
   else
      prunnedSpectralFlux.add( (float)0 );
}

The variable prunnedSpectralFlux is just another ArrayList. The loop is pretty simple, we add zero to the prunnedSpectralFlux list of the current spectral flux is less than the corresponding threshold function value or we add the spectrul flux value minus the threshold value at position i. The resulting prunned spectral flux function looks like this:

Awesome, we are nearly finished! All that is left is to search for peaks in this prunned spectral flux. A Peak is a value that is bigger then the next value. That’s all there is to peak detection. We are nearly done. Let’s write the small code for the peak detection producing a peak ArrayList:

for( int i = 0; i < prunnedSpectralFlux.size() - 1; i++ )
{
   if( prunnedSpectralFlux.get(i) > prunnedSpectralFlux.get(i+1) )
      peaks.add( prunnedSpectralFlux.get(i) );
   else
      peaks.add( (float)0 );				
}

And that’s it. Any value > 0 in the ArrayList peaks is an onset or beat now. To calculate the point in time for each peak in peaks we simply take its index and multiply it by the time span that the original sample window takes up. Say we used a sample window of 1024 samples at a sampling rate of 44100Hz then we have the simple forumula time = index * (1024 / 44100). That’s it. Here’s the output:

We are done. Kind off. That’s the most basic onset detector we can write. Let’s review some of it’s properties. As with many systems we have a few parametes we can tweak. The first one is the sample window size. I almost always use 1024. Lowering this will result in a more fine grained spectrum but might not gain you much. The next parameter is wheter to use Hamming smoothing on the samples before doing the FFT on them. I did see a bit of improvement turning that on but it will also cost you some cycles calculating it. This might be of importance if you plan on doing this on a mobile device with not FPU. Every cycle saved is a good cycle there. Next is the threshold window size. This can have a huge impact on the outcome of the detection stage. I stick to a window of 20 spectal flux values in total as in the example above. This is motivated by calculating how big of a time span those 20 values represent. In the case of a sample window 1024 samples at a sampling rate of 44100Hz that’s roughly have a second worth of data. The value worked well on all the genres i tested the simple detector on but your mileage may vary. The last and probably most important parameter is the multiplier for the threshold function. To low and the detector is to sensitive, to high and nothing get’s cought. In my day job i sometimes have to do outlier analysis which is pretty similar to what we do here. Sometimes i use simple threshold based methods and a value between 1.3-1.6 seems to be some sort of magic number when doing statistics based outlier detection relative to some means. I’d say that the threshold multiplier and the threshold window size should be the only parameters you tweak. Forget about the rest and use the defaults of 1024 for sample window size and Hamming smoothing on. Doing a parameter seach for to many parameters is no fun so boiling it down to “only” two makes sense in this case. The defaults i gave you above work reasonably well with the detector we developed in this series. If you do multi-band analysis you might have to tweak them individually for each band.

What’s left is some improvements like doing this complete process not for the whole spectrum but sub-bands so that we can do onset detection for various instruments (sort of, as we know they overlap in frequency range). Another improvement is to use overlapping sample windows, say by 50%. This smooths out the Fourier transform a bit more and gives better results. One could also try to do dynamic range compression on the whole audio signal before passing it to the FFT and so on and so forth. The code in the framework allows you to experiment easily with various additional steps. You will also want to maybe clean up the resulting peak list and remove any peaks that are to close together, say < 10ms. Learning by doing is the big mantra. Now go out and write awesome music games!

Here's a couple of papers and links i found very useful in the process of getting this to work:

http://www.dspguide.com/ An awesome online book (also available as hard cover) called “The Scientist and Engineer’s Guide to Digital Signal Processing” featuring anything you’ll ever want to know about digital signal processing (as an engineer :) ).
https://ccrma.stanford.edu/~jos/dft/ Another online book called “Mathematics of the discrete Fourier Transform with audio applications” coping with anything related to the discrete Fourier Transform. Pretty heavy on math but a very good read.
http://old.lam.jussieu.fr/src/Membres/Daudet/Publications_files/Onset_Tutorial.pdf A very good scientific paper called “A tutorial on onset detection for musical signals”. Features various onset detection functions (we used the un-banded spectral flux here) as well as some data on the performance of various functions on a standardized data set (Mirex 2006 Onset Detection Challenge). Look out for any other papers by Bello, he’s a big player in the field it seems.
http://www.dafx.ca/proceedings/papers/p_133.pdf “Onset detection revisited” by Dixon, another big name in the field (at least it seems so from the citation counts, but we all know how much they really tell you…). Very good paper, must read.

If you check out the citation graph of the two papers mentioned on google schoolar you will many more papers talking about onset detection that might interest you. Google *is* your friend in this case. Always read your fine papers!

  • Share/Bookmark
 

In an earlier post i talked about the scientific view of onset/beat detection. There’s a lot of different schemes out there that do the job more or less well. However, one approach was so intriguingly simple and performed so well compared to other more elaborate algorithms that i chose to use it for my purposes. It’s called spectral flux or spectral difference analysis and i’ll try to give you an insight of it workings in this article.

One of the best papers on the matter of onset detection is by Bello at al. which you can view here. I’ll review some of the concepts mentioned in there here and will steal a couple of images.

First of what is an onset? Looky:

At the top we see the sample amplitudes or wave form of a single note. The onset is the first thing to happen when playing a note, followed by an attack until it reaches its maximum amplitude followed by a decay phase. I refrain from defining the transient as it is of no concern for us.

The audio signals we want to process are of course a tiny bit more complex than a single note. We want to detect onsets in polyphonic music containing pitched instruments like guitars or violins, vocals and non pitched drums. As we have seen in the frequency chart a couple of articles back most of the pitched instruments overlap in their frequency range (mostly between 70Hz and 4000Hz). Perfect onset detection is thus impossible, however we can try to get an approximate estimation.

From what i could gather from the papers i read on the topic onset detection almost always follows the same schema. First you read in your audio signal, next you transform it to a onset detection function (just another array of floats corresponding to your original audio samples in some way) and finally you pick the peaks in this detection function as onsets/beats. The first stage we already talked about and it turned out to be pretty easy. The second stage of the process is often pretty elaborate doing some magic to transform the audio signal into another simpler, compressed one dimensional function. The last stage is not covered in detail in most of the literature. However, while i was working on a prototype it turned out that this stage is probably the most important one, even more important than the onset detection function calculation. In a future article we’ll concentrate on peak detection, for now let’s stick to onset detection functions.

Here’s a nice little chart for the general process of onset detection, again taken from Bello et al.

The pre-processing stage could for example do dynamic range compression, that is making everything equally loud, a shitty habit used in almost all mixes of modern day music. However shitty DRC might sound it is benefial for some onset detection functions. We won’t use one that benefits from it so we don’t care for pre-processing.

The reduction phase is equal to constructing the onset detection function. The goal of this stage is it to get a compressed view of the structure of the musical signal. The onset detection function should be able to catch musical events and make it easier to detect onsets as compared to trying that on the plain wave form of the audio signal. The output of the onset detection function are values indicating the presence or absence of a musical event for successive time spans. Remember how we read in samples for playback. We read in 1024 samples in each iteration. For such a sample window an onset detection function would output a value. For one second of music sampled at 44100Hz and using 1024 samples as the time span for the onset detection function we’d get roughly 43 values (44100/1024) that model the structure of the audio signal. Coincidentially the sample window size of 1024 can be processed by our FFT class. To summarize: generating the values of the onset detection function means calculating a value for each successive sample window which gives us a hopefully nice curve we perform peak detection on.

Now, i introduced the FFT in the last article so we are going to use it for generating the onset detection function. What’s the motivation for this? Listen to the following excerpt from “Judith” by “A Perfect Circle” and have a look at the image below:

Audio clip: Adobe Flash Player (version 9 or above) is required to play this audio clip. Download the latest version here. You also need to have JavaScript enabled in your browser.

The image depicts the spectra of successive 1024 sample windows of the audio signal. Each vertical pixel colum corresponds to one spectrum. Each pixel in a column stands for a frequency bin. The color tells us how much this frequency bin contributes to the audio signal in the sample window the spectrum was taken from. The values range from blue to red to white, increasing in that order. For reference the y-axis is labeled with the frequencies, the x-axis is labeled with the time. And holy shit batman we can see beats! I took the freedom to mark them with awesome arrows in Paint. The beats are visible only in the high frequency range from a bit about 5000Hz to 16000Hz. They correspond to the hi-hat being played. When you hit a hi-hat it produces something called white noise, a sound that spans the compelte frequency spectrum. Compared to other instruments it has a very high energy when hit. And we can clearly see that in the spectrum plot.

Now, processing this spectrum plot wouldn’t be a lot easier than going for the wave form. For comparison here’s the waveform of the same song excerpt:

We can see the hi-hat hits in the waveform too. So what’s the benefit of going to Fourier transform route? Well, it allows us to focus on specific frequencies. For example, if we are only interested in detecting the beats of the hi-hat or the crash or ride we’d go for the very high frequency range only as this is were those instruments will leave a note of their presence. We can’t do that with the amplitudes alone. The same is true for other instruments like the kick drum or the bass which we’d look for in the very low frequencies only. If we inspect multiple frequency ranges of the spectrum seperately we call it multi-band onset detection. Just to give you some keywords for Mr. Google.

So we know why we do our onset detection in the frequency domain but we are not any wiser on how to compress all the information of the spectrum plot to a managable one dimensional array of floats we can easily do peak detection on. Let me introduce you to our new best friend the spectral difference vulgo spectral flux. The concept is incredibly simple: for each spectrum we calculate the difference to the spectrum of the sample window before the current spectrum. That’s it. What we get is a single number that tells us the absolute difference between the bin values of the current spectrum and the bin values of the last spectrum. Here’s a nice little formula (so i get to use the latex wordpress plugin):

SF(k) =\displaystyle \sum_{i=0}^{n-1} s(k,i) - s(k-1,i)

SF(k) gives us the spectral flux of the kth spectrum. s(k,i) gives us the value of the ith bin in the kth spectrum, s(k-1,i) is analogous for the spectrum before k. We substract the values of each bin of the previous spectrum from the values of the corresponding bin in the current spectrum and sum those differences up to arrive at a final value, the spectral flux of spectrum k!

Let’s write a small program that calculates and visualizes the spectral flux of an mp3:

public class SpectralFlux 
{
   public static final String FILE = "samples/judith.mp3";	
 
   public static void main( String[] argv ) throws Exception
   {
      MP3Decoder decoder = new MP3Decoder( new FileInputStream( FILE  ) );							
      FFT fft = new FFT( 1024, 44100 );
      float[] samples = new float[1024];
      float[] spectrum = new float[1024 / 2 + 1];
      float[] lastSpectrum = new float[1024 / 2 + 1];
      List<Float> spectralFlux = new ArrayList<Float>( );
 
      while( decoder.readSamples( samples ) > 0 )
      {			
         fft.forward( samples );
         System.arraycopy( spectrum, 0, lastSpectrum, 0, spectrum.length ); 
         System.arraycopy( fft.getSpectrum(), 0, spectrum, 0, spectrum.length );
 
         float flux = 0;
         for( int i = 0; i < spectrum.length; i++ )			
            flux += (spectrum[i] - lastSpectrum[i]);			
         spectralFlux.add( flux );					
      }		
 
      Plot plot = new Plot( "Spectral Flux", 1024, 512 );
      plot.plot( spectralFlux, 1, Color.red );		
      new PlaybackVisualizer( plot, 1024, new MP3Decoder( new FileInputStream( FILE ) ) );
   }
}

We start of by instantiating a decoder that will read in the samples from an mp3 file. We also instantiate an FFT object telling it to expect sample windows of 1024 samples and a sampling rate of 44100Hz. Next we create a float array for the samples we read from the decoder. We use a window size of 1024 samples. The next two lines instantiate auxiliary float arrays that will contain the current spectrum and the last spectrum. Finally we also need a place to write the spectral flux for each sample window to, an ArrayList will do the job here.

The following loop does the nasty job of calculating the spectral flux’. First we read in the next 1024 samples from the decoder. If that succeeded we calculate the Fourier transform and spectrum of this sample window via a call to fft.forward(samples). We then copy the last current spectrum to the array lastSpectrum and the spectrum we just calculated to the array spectrum. What follows is the implementation of the formula above. For each bin we substract the value of the last spectrum’s bin from the current spectrum’s bin and add it to a sum variable called flux. When the loop has finished flux contains… the spectral flux of the current spectrum. This value is added to the spectralFlux ArrayList and we continue with decoding the next sample window. We repeat this for each sample window we can read in from the decoder. After the loop is done spectralFlux contains a spectral flux for each sample window we read in. The last three lines just create a plot and visualize the result. Note: i created a small helper class called PlaybackVisualizer which takes a plot, the number of samples per pixel and a decoder that will playback the audio fromt he decoder while setting the marker in the plot correspondingly. Nothing to complex, we saw how to do that in one of the earlier articles. Here’s the output for the song “Judith” we used above already:

Hurray! we can again see peaks. This time they not only correspond to the hi-hat but also to the snare and the kick drum. The reason for this is that use the complete spectrum so the spectral flux values are governed by the instruments that have the most energy, in this case hi-hat, snare and kick drum. Now let’s review what we just did. Take the original spectrum image from above were i marked the hits of the hi-hat. What the code does is essentially creating a sum over the differences of the bins of two successive spectra. When the current spectrum has more overall energy than the previous spectrum the spectral flux function will rise. If the current spectrum has less energy than the previous spectrum the spectral flux function will fall. So we successfully mapped the 2D spectrum plot to a 1D function we can perform peak detection on. Well, sort of…

Before we do any peak detection we have to clean up the spectral flux function a little bit. Let’s start with ignoring negative spectral flux values. We are not interested in falling spectral flux but only in rising spectral flux. The process of ignoring the spectral flux values that are negative is called rectifying. Here’s the modified loop for calculating the spectral flux:

   float flux = 0;
   for( int i = 0; i < spectrum.length; i++ )			
   {
      float value = (spectrum[i] - lastSpectrum[i]);			
      flux += value < 0? 0: value;
   }
   spectralFlux.add( flux );

And the corresponding output:

A bit nicer. Rectifying the spectral flux will aid us when we calculate the a threshold function for peak detection.

Another clean up process is using a so called Hamming window. This Hamming window is applied to the samples before calculating the Fourier transform. It’s basically a smoothing function that will make our samples look a bit nicer. The FFT class has a function by which we can enable Hamming window smoothing:

fft.window(FFT.HAMMING);

That’s all there is to do, here’s the result:

Not a big difference but we can see how it equalizes the first 4 hits a little bit while removing one at the end of the sample. In other songs the hamming window has a bigger effect on the resulting spectral flux but it’s really up to you wheter to use it or not. It doesn’t cost a lot of performance.

With this i want to conclude this article. We have seen how to go from samples to spectrums to calculating a 1D function that we’ll later use to detect the peaks and therefore onsets in an audio signal. From the images above we can already see a lot of peaks which seem easy to detect. In the next article we’ll go a step further and calculate the spectral flux of different frequency bands. By this we hope to be able to extract various instruments as oposed to focusing on the loudest thing in the complete spectrum. We’ll also employ a technique called hopping where we calculate the spectral flux not on overlapping successive sample windows instead of non-overlapping sample windows. This will further smooth the resulting spectral flux function a little bit. We are then ready to create a simple but effective peak detector that gives us the final onset times.

As always you can find all the source code at http://code.google.com/p/audio-analysis/. Over and out.

  • Share/Bookmark
 

Warning: this is how i understood the discrete Fourier transform which might be wrong in a couple of places. For a correct explanation of the fourier transform i suggest reading http://www.dspguide.com/ch8.htm

So finally we’ve come to the dreaded Fourier transform article. Let’s review some of the things we already know. The first thing to remember is that our audio data was fetched from some kind of recording device or synthesized digitally. In any case what we have is measurements of amplitudes for discrete points in time which we call samples. We can either have mono or stereo samples, in the later case we just have two samples per point in time instead of only one. The sample frequency is the number of samples we measured per second (for stereo we count both samples for the left and right channel as a single one in this case). Here’s 1024 samples of the note A (440Hz) at a sampling rate of 44100Hz :

Note: the vertical line to the left is not part of the actual image. It’s wordpress fucking up

1024 samples at a sampling rate of 44100Hz equals a time span of 43,07ms. From this image we can see that the samples form something that resembles a sine wave pretty much. And in fact the samples were generated using Math.sin() as we did in the first article of this series, confirming that sound is nothing but a mixture of sine waves of different frequency and amplitude.

A smart mathemagicion called Fourier once proposed the theorem that all possible functions could be approximated by a series of so called sinusoids, that is sine waves with different frequencies and amplitudes (actually Gauss already knew this but Fourier got the fame).  He did all this magic for so called continuous functions which you will remember from your analysis course in school (you do remember analysis, right?). This was extended to discrete functions and guess what, our samples are just that, a discrete function in time. Now, a Fourier transform decomposes a signal into a set of sinusoids with different frequencies and amplitudes. If we apply a Fourier transform to a discrete audio signal we get the frequency domain representation of our time domain audio signal. This transform is invertable, so we can go one or the other way. From now on we only consider discrete Fourier transforms on audio signals to not overcomplicate matters.

What does the frequency domain of our time domain signal tell us? Well, put simply it tells us what frequencies contribute how much to the time domain signal. A frequency domain signal can thus tell us wheter there’s the note a at 440Hz in the audio signal and how loud this note is in respect to the complete tune. This is pretty awesome as it allows us to investigate specific frequencies for our beat detection purposes. From the frequency chart of the last article we know that various instruments have different frequencies they can produce. If we want to focus on say singing we’d have a look at the frequency range 80Hz to 1000Hz for example. Of course instruments overlap in their frequency ranges so that might lead to some problems. But we’ll care for that later.

Now, what we’ll use for our frequency analysis is called the discrete Fourier transform. Given a discrete time signal, e.g. samples measured at a specific sampling rate we get a discrete frequency domain out. As an intermediate stage we get the so called real and imaginary part of the frequency domain. There is some scary terms in the last sentence so let’s just go over that quickly. The fourier transform decomposes a time signal into coefficients for sinusoids. The real part holds the Fourier coefficients for cosines of specific frequencies (a cosine is a sinusoid) as well as for the sines of the same frequencies. For our use case this representation is not all that important. What we will use is the so called spectrum which is calculated from the real and imaginary parts of the original fourier transform.  The spectrum tells us for each frequency how much the frequency contributes to the original time domain audio signal. From now on we assume that we already have calculated this spectrum, which in the framework is done by a class called FFT (taken from Minim) which relieves some of the pain.  If you want to understand how to go from the real and imaginary parts of the original Fourier transform to the spectrum read the link above.

Not all possible frequencies will be present in the spectrum due to the discrete nature of the transform. Frequencies are binned, that is multiple frequencies are merged together into a single value. Also, there’s a maximum frequency the discrete Fourier transform can give us which is called the Nyquist frequency. It is equal to half the sampling rate of our time domain signal. Say we have an audio signal sampled at 44100Hz the Nyquist frequency is 22050Hz. When we transform the signal to the frequency domain we thus get frequency bins up to 22050Hz. How many of those bins are there? Half the number of samples plus one bin. When we transform 1024 samples we get 513 frequency bins each having a bandwidth (range of frequencies) of the Nyquist frequency divided by the number of bins except for the first and last bin which have half that bandwidth. Say we have 1024 samples we transform, sampled at 44100Hz. The first bin will represent frequencies 0 to 22050 / 513 / 2~ 21.5Hz (remember, the first bin has half the normal bandwidth). The next bin will represent the frequencies from 21.5Hz to 21.5Hz plus 22050 / 513 ~ 43Hz == 64.5 Hz (one more time, 21.5Hz to 64.5Hz, bandwidth is 43Hz). The next bin ranges from 64.5Hz to 107.5Hz and so on and so forth.

Whenever we do a discrete Fourier transform on our audio data we do it on a window of samples, e.g. 1024 samples is a common window size. The size of the window dictates the resolution of the resulting spectrum, that is the number of frequency bins in the spectrum will increase linearly with the number of samples we transform. The bigger the sample window the more fine grained the bins will be, that is the smaller their bandwidth. To calculate the spectrum we use a specific algorithm called the Fast Fourier Transform (FFT) which runs in O(n log n), pretty fast compared to a naive fourier transform implementatin which takes O(n*n). Almost all of the FFT implementations demand sample windows with a size being a power of two. So 256, 512, 1024, 2048 is fine, 273 is not. The FFT implementation we use in the framework also has this “limitation”.

So without further ado i present you the spectrum of the samples depicted in the image above

Note: the vertical lines to the left and right are not part of the actual image. It’s wordpress fucking up

Yes, that’s it. The x-axis corresponds to the frequency bins while the y-axis tells us the amplitude of a frequency bin. We clearly see a peek at the left side, the rest of the spectrum is zero. This peak corresponds to the frequency bin which contains the frequency 440Hz, the frequency of the note A we generated 1024 samples for. What we also see is that it’s not a clean peak, bins to the left and right also receive some amplitude. This is called leakage and is something we can’t solve. In our case it’s not a big concern really but in other application scenarios it might be. Here’s the program that generated the above two images:

public class FourierTransformPlot 
{
   public static void main( String[] argv )
   {
      final float frequency = 440; // Note A		
      float increment = (float)(2*Math.PI) * frequency / 44100;		
      float angle = 0;		
      float samples[] = new float[1024];      
 
      for( int i = 0; i < samples.length; i++ )
      {
         samples[i] = ((float)Math.sin( angle ));
         angle += increment;			
      }
 
      FFT fft = new FFT( 1024, 44100 );
      fft.forward( samples );
 
      Plot plotSamples = new Plot( "Samples", 512, 512 );
      plotSamples.plot( samples, 2, Color.white );
 
      Plot plotSpectrum = new Plot( "Spectrum", 512, 512);
      plotSpectrum.plot(fft.getSpectrum(), 1, Color.white );
   }
}

The first couple of lines should be familiar, we simply generate 1024 samples at a sampling rate of 44000Hz. There’s actually only 2 interesting lines in this program. The first one is the one where we instantiate the FFT object. As parameters for it’s constructor it wants the sample window size we’ll use to generate the spectrum as well as the sampling rate. The next line performs the Fourier transform as well as the spectrum calculation. All we have to do is pass in a float array of samples having the size we specified in the FFT constructor. That’s all we need to do to get our precious frequency bins. What’s left is plotting the samples as well as the spectrum. Note the call to fft.getSpectrum(). This method will return the spectrum of the last fourier transform we did by calling FFT.forward().

Pfuh, that was pretty nasty. I really suggest reading up on the discrete Fourier transform. It’s an essential mathematical tool for all audio engineers. And while most audio engineers will use some sort of toolbox as we did it doesn’t hurt to know your way around.

We now are equipped with all we need to get our onset/beat detector going. Stay tuned for the next article.

  • Share/Bookmark
 

As i said in part one of this series sound can be broken down into various frequencies that take part in the overall sound of song for example. In the next article we are going to dive into this topic in detail but for this article there’s something you need to know about frequencies first: Every instrument has a frequency range, the same is true for vocals. This a pitched instruments/voices. Drums and other percussive instruments are often non-pitched and spread over the whole frequency spectrum. Here’s a very interesting char taken from Wikipedia:

Before we dive into fourier transforms and similar topics i want to give you some impression of what to expect from the onset detector we develop here. For this i prepared a couple of videos for different genres that show what our onset detector will produce. Without explaining the details: in each video you will see 3 plots, the first for low frequencies, the next for mid frequencies and the last one at the bottom for very high frequencies. The red lines gives us an idea where an onset might be located. The green lines is a running average. Any red peak above a green line is a potential onset/beat. With that said let’s see how the detector works on various genres. Let’s start with Pop.

A pretty well known song i guess (yack…). In all three ranges the onsets are clearly visible. This comes from the fact that the song is largely composed of synthesized instruments. Especially the bass and drums leave their impression on the detector and it seems to work pretty well for this genre. On to the next genre: Rock

A song form the all-star band “A perfect circle” called weak and powerless. The drums are again pretty influencial on the overall onset/beat landscape. The detector does it’s job pretty good in this case too. Rock is a bit more difficult for any onset detection schema due its higher noise level. Distorted guitars and basses polute all frequencies pretty heavily and make the detection task a bit harder. However, if the sound engineer responsible for mixing the final song did his job well the frequencies of those instruments that dictate the rythm should stand out and be detectable by our algorithm. Next we have a look at acoustic rock:

A song by “Tenacious D” called Explosivo which starts of with an acoustic guitar and then goes into a more heavier part. The detector picks up the strokes of the guitar pretty well in the beginning. When the heavy part kicks in the guitar get’s into the background and the drums become the most influential rythmic section. In the top plot we also can see the drummer playing the kick pretty fast which i normally didn’t hear all that clearly without having the plot. Also note how the hi-hat fades in in the bottom plot before the heavy part starts. All drum parts like hi-hat, cymbals and so on will be recogniseable in that plot.

A pure acoustic song, performed by yours truely. The detector picks up the guitar extemely well. You can see the 3/4 rythm easily in all three plots. The voice does not have a big impact on the detector.

Jazz is said to be extremely hard for onset detectors. A lot of notes a player simultaniously or near simultaniously. Also, a lot of the frequency range is used up. In this example the instrument governing the rythm is again the drums as well as a little bit of bass which are picked up by the detector. However, in the beginning it’s the piano which does the rythm section. Again the detector does its job pretty well. Note how the clapping at the beginning makes the detector go a bit crazy. The clapping is basically just random noise as far as the detector is concerned, so he tries his best to pick what might be rythmical patterns.

Classic is the nightmare of every onset detector, especially when the composition is composed of string insturments only as in this play by Mozart. The beat detector picks up quiet some of the heavy attack strokes but anything with vibrato is a mess. It still performs better than Audiosurf in that regard though.

Finally some Metal by In-Flames. What’s interesting is that the snare has the most energy. Not so for the kick drum which is actually pretty silent compared to the rest of the song. If you’d look at the complete frequency spectrum of this song you’d see one big messy heatmap (at least i use a heatmap for frequency spectrums). The hi-hat is not really picked up by the detector that well as it is played open which produces a quiet noisy sound without a lot of energy.

The next video shows all the excerpts form above mixed down to a single mp3 and played in Audiosurf. The outcome might be different when feeding each excerpt seperately to Audiosurf but that would have been to tedious :)

For “Hit me baby one more time” Audiosurf generates blocks for the snare and sometimes for the kick. “Weak and Powerless” also produces blocks mostly for the hi-hat/cymbals/ride. “Explosivo” is a bit strange. At first sight it does not have any resamblance to the song. But if you look closer it becomes clear that the blocks are based on the singing! I guess Audiosurf takes the peaks which are the loudest, in “Explosivos” case this is the singing most of the time. The next song is a mix between singing and guitar strokes. For the jazz song the piano is not picked up in the beginning it seems but the clapping is. The rest is governed by the ride. The classic piece is complete fail in the beginning and gets better towards the end. This seems to coincide with the volume of the violins going up. The last piece picks up the crash most of the time with some snare sounds mixed in. All in all it seems that Audiosurf does a frequency band analysis for various bands and then picks the loudest band to generate the blocks. This works out pretty well for popular music but does not perform so well for classic music it seems. What’s interesting is that for cloud connected i really wonder how the crash got picked up as i played around in audacity high pass filtering it quiet a lot and still couldn’t see peaks. Strange indeed :)

The onset detector we develop in this series of articles seems to be able to pick up all the onsets audiosurf detects (if we adjust it a bit for singing if we want that which should be rather easy). Maybe we can even come up with something more robust that also handles classical pieces a bit better than Audiosurf does.

  • Share/Bookmark
 

In the first article of this series i told you to find out how to decode MP3 files yourself. Well, being the good sport i am i extended the framework with a nice little MP3 decoder based on JLayer. Here’s the class which is analogous to WaveDecoder:

public class MP3Decoder
{
   public MP3Decoder( InputStream in );
   public void readFrames( float[] samples );
}

It works exactly the same as the WaveDecoder. Proper object oriented programming would mean both decoders share the same interface, but i was kind of lazy. Feel free to add that to the source yourself if you feel fuzzier inside with proper interfacing.

Another thing i did to the framework today is giving us a possibility to set a marker line so that we can see where the playback is currently in a plot. Here’s the new method in the Plot class:

public class Plot
{
   ...
   public void setMarker( int position, Color color );
}

The first parameter specifies the x coordinate in the plot where the marker should be located. The second one is the color of the marker. I also changed the internal workings of the Plot class. The plot is now redrawn constantly and reacts to changes almost immediatly. To illustrate the use of the Plot.setMarker() method i wrote a simple example located in RealTimePlot. I only reproduce the novel parts here:

public class RealTimePlot 
{
   private static final int SAMPLE_WINDOW_SIZE = 1024;	
   private static final String FILE = "samples/sample.mp3";
 
   public static void main( String[] argv ) throws FileNotFoundException, Exception
   {
      float[] samples = readInAllSamples( FILE );
 
      Plot plot = new Plot( "Wave Plot", 1024, 512 );
      plot.plot( samples, SAMPLE_WINDOW_SIZE, Color.red );		
 
      MP3Decoder decoder = new MP3Decoder( new FileInputStream( FILE ) );
      AudioDevice device = new AudioDevice( );
      samples = new float[SAMPLE_WINDOW_SIZE];
 
      long startTime = 0;
      while( decoder.readSamples( samples ) > 0 )
      {
         device.writeSamples( samples );
         if( startTime == 0 )
            startTime = System.nanoTime();
         float elapsedTime = (System.nanoTime()-startTime)/1000000000.0f;
         int position = (int)(elapsedTime * (44100/SAMPLE_WINDOW_SIZE)); 
         plot.setMarker( position, Color.white );			
         Thread.sleep(15); // this is needed or else swing has no chance repainting the plot!
      }
   }
...
}

First we load all samples from the MP3 file as we’ve done previously. The samples are then plotted, using 1024 samples per pixel ( this is the SAMPLE_WINDOW_SIZE, we have a window of 1024 samples…). Now we want to playback the file and see in the plot where the playback currently is by observing a moving marker, a line from the top to the bottom of the plot that updates with the elapsed time.

Setting up the MP3Decoder and the AudioDevice should be quiet familiar by now. We also need a float array for the samples we read in from the MP3Decoder for playback. Then we enter the decoding/playback loop. We read in the current sample window and write it to the audio device as we’ve done before. Now comes the magic part. After we have written the first window to the device we start measuring the elapsed time. For this we have to know when the playback started, this is done in the body of the condition if( startTime == 0 ). The current time minus the start time divided by a billion gives us the elapsed time in seconds since the playback start. We calculate that in the next line. We record the start time after we decoded and wrote the first sample window as that synchs the audio output with our plotting. The AudioDevice.writeSamples() method blocks till all the samples are written to the sound device. The cursor will thus lag behind a pixel at most which is totally ok for our purposes.

All that is left is calculating the pixel position of the marker in the plot based on the elapsed time and set the marker accordingly. The magic formula for the pixel position of the marker is the elapsed time times the sample frequency divided by the number of samples per pixel in the plot. That’s it. If you think about it for a minute you should see how i arrived at this formula. If not, think harder! What’s left is setting the marker position of the plot which we do in the next line. Finally we put the decoding thread to sleep for 15 milliseconds to give the Swing thread that does all the drawing some time to do so. Otherwise the marker will not move smoothly.

Here’s an image of the resulting output:

Hurray! Real-time plotting! To see this example in action download the source code from http://audio-analysis.googlecode.com/svn/trunk/ via your SVN client of choice, import the project to eclipse and start the RealTimePlot.java example.

Next time we are going to finally get to do some mathmagic to finally detect some onsets. The real-time plotting we used here will be of help to visually see where onsets are located with respect to the audio playback.

  • Share/Bookmark
 

Processing and analysing audio can be quiet tedious without visualizing what’s going on. We could want to see the amplitudes of our samples for example. To facilitate this need i wrote a very simple Swing based class called Plot. It will allow us to easily plot one or more arrays of floats to a window, scaling the values along the way so we are guaranteed to see something. Here’s the signatures:

public class Plot
{
   public Plot( String title, int width, int height );
   public void clear( );
   public void plot( float[] samples, float samplesPerPixel, Color color );
}

As always a pretty tiny class with just enough methods to get things done. At construction time of the Plot you pass in its title and its width and height in pixels. To plot an array of values you use the Plot.plot() method. The first argument is the array of samples we want to plot. Note that samples here does not necessarily mean PCM samples. You could pass in anything you want really. The next parameter defines how many samples of the array should be used per pixel in the plot. This allows you to scale the plot on the x-Axis. The last parameter is the color of the plotted samples. If you want to erase what you have plotted so far you simply call Plot.clear(). Note that if you plot multiple arrays they all should be on the same scale value wise. It is for example ok to plot two arrays that have values somewhere in the range of [-1, 1]. It is not such a wise idea to plot arrays where one is say in the range [-1,1] and the other is in the range [0, 200]. Keep that in mind. In this series we will only plot arrays that are on the same scale. If you want to plot arrays of different scale you should use different Plot instances for each plot. Also, the Plot class uses Swing only and is totally slow. You don’t want to use it to visualize real-time data, it will not be in synch.

Let’s do something interesting and plot all the samples of the song stored in the “samples/” folder (PlotExample):

public class PlotExample 
{
   public static void main( String[] argv ) throws FileNotFoundException, Exception
   {
      WaveDecoder decoder = new WaveDecoder( new FileInputStream( "samples/sample.wav" ) );
      ArrayList<Float> allSamples = new ArrayList<Float>( );
      float[] samples = new float[1024];
 
      while( decoder.readSamples( samples ) > 0 )
      {
         for( int i = 0; i < samples.length; i++ )
            allSamples.add( samples[i] );
      }
 
      samples = new float[allSamples.size()];
      for( int i = 0; i < samples.length; i++ )
         samples[i] = allSamples.get(i);
 
      Plot plot = new Plot( "Wave Plot", 512, 512 );
      plot.plot( samples, 44100 / 100, Color.red );
   }
}

First we open a WaveDecoder that will give us the PCM data of the Wave file. The ArrayList allSamples will be used to (slowly) read in all the samples from the decoder. It will handle resizing the internal array for the samples for us, we are lazy again. Note that it is not a good idea to do it like this as we have to convert from float to its cousin Float which is an object instance. This is called auto-boxing and it’s nasty. After we read in all the samples we transfer them to a float[] array so we can plot them. The last two lines instantiate the Plot and draw the samples we just read in. Take a look at the second parameter of the invocation of Plot.plot(). It tells the Plot to use one Pixel for 44100 / 100 = 441 samples in the final output. So each pixel shows us the amplitude of a sound sequence of 0.001 seconds. You can play around with this parameter to get a finer grained insight into the waveform (warning: you will need to give more heap memory to the vm in case you set the samplePerPixel parameter very low as the produced image will be extremely large!). Here’s the output of this program:

This is an amplitude plot of all samples in the audio file. It doesn’t give us a hole lot of information concerning onsets. It looks like one big mess. Let’s see what we get when we set the samplesPerPixel to a lower value (thus giving us more resolution in the plot (samplesPerPixel = 44100/1000):

Now it seems that there is structure in the sound! As a first guess the peaks in this seem to indicate sudden bursts of sound, probably from the drums. However, it would be pretty hard to detect peaks in this form. In the next article we’ll see how to arrive at a form of the samples that makes our live a little bit easier.

  • Share/Bookmark
 

Well, i just put together a simple framework for all our onset detection tutorial needs. It is located at http://code.google.com/p/audio-analysis/. To get the code you will need an SVN client, on Windows Tortoise SVN is pretty good, you linux people should know your magic :) . Simply check out the project from the svn URL http://audio-analysis.googlecode.com/svn/trunk/ and import it into Eclipse. Now that you are ready to go let’s see what this “framework” has in store.

The first class is called AudioDevice. It looks like this

public class AudioDevice
{
   public AudioDevice( );
   public void writeSamples( float[] samples );
}

Pretty simple, eh? This will setup a connection to your primary audio card in 44100Hz mono mode. Just invoke the constructor. It will throw an exception if it couldn’t setup the hardware but “that should never happen”(TM). To output PCM data to the audio device you invoke AudioDevice.writeSamples and pass in a float array of your 44100Hz mono PCM samples.

I wrote a small example which will generate a sine wave at 440Hz and output it to the audio device. It looks like this:

public class NoteGenerator 
{
   public static void main( String[] argv ) throws Exception
   {
      final float frequency = 880; // 440Hz for note A
      float increment = (float)(2*Math.PI) * frequency / 44100; // angular increment for each sample
      float angle = 0;
      AudioDevice device = new AudioDevice( );
      float samples[] = new float[1024];
 
      while( true )
      {
         for( int i = 0; i < samples.length; i++ )
         {
            samples[i] = (float)Math.sin( angle );
            angle += increment;
         }
 
         device.writeSamples( samples );
      }
   }
}

It is basically a rewritten version of the code snippet in part 1 of this series. In each loop iteration we generate the next 1024 samples of our sine wave and feed it to the audio device. Play around with this code a little, e.g. change the frequency of the sine wave. For example, if you double the frequency you get the same note in the next octave. The inverse is also true. To stop the program you have to kill it in Eclipse by clicking on the stop button. A proper GUI would mean to much code for such simple examples. You can find the source code in the package com.badlogic.audio.samples.

The second class i provide you is a Wave file decoder. With this class you can read in Wave files in 16-bit mono/stereo with a sampling rate of 44100. Wave supports a couple of other formats/sampling rates but for the sake of simplicity i only support the mentioned configurations. The class that does all this magic is called WaveDecoder. It will convert the PCM data from the Wave file to 32-bit float samples on the flie. Here’s the signatures of that class:

public class WaveDecoder
{ 
   public WaveDecoder( InputStream stream ) throws Exception;
   public int readSamples( float[] samples );
}

Again pretty slick. In the constructor you pass in the InputStream from where the Wave file should be read. If an error occured an exception will be thrown. After you instantiated the class properly you can read the samples from the Wave file by a call to WaveDecoder.readSamples(). The method will try to read as many samples from the Wave file as there are elements in the samples array you pass in. It will return you the actual number of samples read in. If the Wave file is stereo the stereo samples get converted to a single mono sample as explained in part 1 of this series. If the method returns 0 you know that the end of the stream has been reached or an error occured. Here’s a simple example how to use the WaveDecoder in combination with the AudioDevice to output sound read in from a Wave file:

public class WaveOutput 
{
   public static void main( String[] argv ) throws Exception
   {
      AudioDevice device = new AudioDevice( );
      WaveDecoder decoder = new WaveDecoder( new FileInputStream( "samples/sample.wav" ) );
      float[] samples = new float[1024];
 
      while( decoder.readSamples( samples ) > 0 )
         device.writeSamples( samples );
   }
}

We first construct the AudioDevice and the WaveDecoder as well as a float array that holds the samples read in from the Wave file. In each loop iteration we try to read in 1024 samples (the length of the samples array) and write those samples to the AudioDevice. That’s it. Sound programming is really that easy. The code for the example can be found in the package com.badlogic.audio.samples. Play around with it. You could for example alter the amplitude of the samples, e.g. half the loudness by multiplying each sample with 0.5 before writting them to the AudioDevice.

I’ll add some helper classes to this project to visualize the data we process/generate over the course of the series. We’ll also extend the project by some nifty analysis classes that will allow us to do proper beat detection (can you say Fast Fourier Transform?). You will be surprised at how easy it is to do simple beat detection :)

Over and Out.

  • Share/Bookmark
 

With this article i want to start a small step by step series which will allow you to build your own onset detector for your music game needs. We’ll start with the basics. Some thing like reading in various music file formats i’ll leave to you as there is a lot of material out there for this already. What i will present are some of the steps i used to get my onset detector going. Here’s a couple of videos i took with fraps today which show you 90% of the beat detector for various genres:

King Crimson – Three of a perfect pair

Some Mozart

Audioslave - Cochise

Make sure to watch the videos in HD as you won’t be able to see the details otherwise. Also, Fraps fucked up a lot of times so there’s a lot of hangs in there. What this videos show you are 3 detection functions on 3 frequency subbands. We’ll see what that means later in this series. For now all you need to know that each red peak above a green line is essentially an onset, be it a drum onset, a guitar onset or a violing onset. The system is adaptive to the genre of music automatically and will pick the most representative rythmic onsets it can find. I checked around 20 different songs with this detection scheme and it worked out very well. Additionally it’s extremely simple to implement. Let’s start with the basics though.

Sound can be thought of as waves within a medium like air, water and so on. When we record sound we record the pressure of the medium on the microphone, more precisely the waves amplitude. Digital recording goes one step further as it has to discretize this recording process. The microphone is checked at a very high frequency, the higher the frequency the better the quality of the recording. This frequency is also known as sampling rate. Standard CD audio has a sampling rate of 44100Hz (Herz). This means that the we get a measurement of amplitude of our microphone 44100 times per second. Measuring in this context is usually called sampling, each sampling step gives us one so called sample. The sample represents the amplitude measured by the microphone at the given time. The value of a sample has to be stored somehow. This is were the so called bit-depth and encoding comes into play. The bit-depth tells us how many bits are used to represent the amplitude value of a sample. This could be a single byte, a short (2 bytes) or a word (4 byte). Anything above is usually not used. We of course also have to define what value range the amplitude can have. It can be signed or unsigned, most often signed formats are used. A sample can also be stored either as an integer or a normalized floating point value. Integer samples have a range dependant on the number of bytes used for a sample. A 16-bit signed sample would therefor have a value range of −32,768 to 32,767. Float samples are almost always normalized to a range between -1 and 1 which makes working with them a bit easier than with their integer counter parts. We can easily convert between 16-bit signed integer samples and 32-bit float samples by division and multiplication (i leave the details to you :) ).

To get a feeling for how much storage is needed for pure PCM data (== sample in some bit-depth with some encoding, e.g. integer, float etc.) Let’s assume a song in mono (only one PCM channel) with a sampling rate of 44100Hz. The samples are stored as 16-bit signed integers. The song has a total runtime of 180 seconds. For each second of music we have 44100 samples due to the sampling rate. Each sample takes up 16-bit or 2 bytes. That’s 88200 bytes for a second of music or roughly 88Kb. The song is 180 seconds in length so we have 88100*180 = 15876000 bytes or roughly 15Mb. Wow, quiet a lot of data for such a punny song. If we store the samples as 32-bit floats we double the needed memory, so that’s 30Mb.

Most popular formats therefor use some kind of compression for the PCM data to get that size down to a managable and transferable size. Formats like MP3 or Ogg Vorbis do this job quiet well at the cost of losing some of the original quality. The audio compression algorithms of these formats are lossy, meaning the original data can not be recovered a 100% after decompression. The compression algorithms are constructed so that this loss is not audible to most listeners.

When we want to analyse music a compressed format is of no use to us. We have to get the straight PCM data somehow. That’s what decoders are for. There’s a gazillion of libraries out there in Java, C, C#, what have you that do this job for you. I won’t go into details on how to use a specific decoder here as there’s a lot of material on that matter out there. For MP3s there’s libmad and libmpg123 written in C out there, in Java you have JLayer. For Ogg Vorbis there’s libvorbis for C and Jorbis for Java. I urge you to chose your poison and read the documentation and example codes that come with all of those libraries. Most of them are really rather easy to use.

Let’s assume we have the PCM data in memory, fetched from a decoder. Most songs will be in stereo so the first thing we do is merging the two channels by averaging each left sample with it’s corresponding right sample. The PCM data will also probably be in 16-bit signed integer format so we’ll also convert this to a float array. The result of our function will be a float array which holds the n successive PCM samples. Here’s a bit of code

public float[] convertPCM( short[] leftChannel, short[] rightChannel )
{
   float[] result = new float[leftChannel.length];
   for( int i = 0; i < leftChannel.length; i++ )
      result[i] = (leftChannel[i] + rightChannel[i]) / 2 / 32768.0f;
   return result;
}

We put in two short arrays, one holding the samples for the left channel the other for the right channel, average the two channels and divivde the averaged sample by 32768.0f to get the amplitude in the range [-1,1] as a float (technically we should divide negative samples by 32768 and positive samples by 32767 due to the signed range of a short but we are lazy).

So what can we do with this now? Let’s first see how we can get the sample for a specific point in time, say 10 seconds into the song. To do this we need to know the sampling rate. The decoder library will happily provide us with this information, usually we’ll get a value of 44100Hz for almost all MP3s and Oggs. From now on we’ll assume a fixed sampling rate of 44100Hz. We remember that the sampling rate tells us how many samples have been taken per second. At 44100Hz each sample encodes 1 / 44100th of a second or ~0.00002 seconds. If we have our desired sample position given in seconds calculating the index of the sample in the float array is a piece of cake:

float time = 10; // 10 seconds into the song
float samplingRate = 44100.0f;
int index = (int)(time / samplingRate);

We simply divide our time given in seconds by the sampling rate to get the index for the sample that encodes the amplitude of that point in time. And that my dear friends is almost everything you need to know if you only care for outputting PCM data to the sound card (which a library like Java Sound or the mediaframework on Android does internally for you). On Android you could send PCM data directly to the sound card via the AudioTrack class, a similar mechanism also exists for Java Sound.

Just for fun let us create the PCM data for a mono sound at 440Hz. This frequency is equivalent to the note A in music. Sound is nothing more than waves which we usually describe as a so called sinusoid. In reallife a sound signal is the sum of many many such sinusoids which, added together, form the overall audio of a sound signal. When we want to create a a sound for the note A we simply have to generate a sinusoid that has a period of 1/440 seconds.

The x-Axis in this graph is time, the y-Axis the amplitude of our sine wave. The period is 0.002 seconds. By period we mean the time needed from one peek (either on the positive y-Axis or the negative y-Axis) to the next one.

Let’s write a function which can generate PCM data for any note in mono given the note’s frequency, the sampling rate and the duration of the note in seconds:

public float[] generateSound( float frequency, float samplingRate, float duration )
{
   float[] pcm = new float[(int)(samplingRate*duration)];
   float increment = 2 * (float)Math.PI * frequency / samlpingRate;
   float angle = 0;
 
   for( int i = 0; i < pcm.length; i++ )
   {
      pcm[i] = Math.sin( angle );
      angle += increment;
   }
 
   return pcm;
}

Now that is not a lot of code but the few expressions in it seem somewhat mysterious. Let’s break it down. We first create a float array which will hold our PCM samples. The length of the array is given by the duration times the samplingRate, that is the numbers of samples per second. If we have a duration of 5 seconds we’d neet 44100 samples times 5. Not to hard. The next line calculates an increment. An increment of what? Of an angle we’ll later on pass to Math.sin(). As i said earlier we are working with sinusoids, or sine waves. To generate those we use the sine function. The sine function in Java takes an angle in radians and outputs the sine for this angle. A complete revolution and therefor period is 2 * PI radians long, 360° in angle notation. If we wanted a 1Hz sound we’d go one full circle for 44100 samples. For 440Hz we need to go 440 full circles and so on. Thus we first multiply 2 * PI times the frequency. The division by the sampling rate will give us the final increment we add to the current angle for the current sample. In the loop we set the current sample to the sine of the current angle and increment the angle by the value we calculated before. And tada we now know everything we need to write a synthesizer. A synthesizer does nothing more than generating PCM samples with methods very similar to the above. By adding together a couple of functions a synthesizer can produce quiet interesting and complex sounds. The base of it all is the sine though (or a saw function, or a rectangular function). What this function lacks is a way to manipulate the loudness or volume of the generated sine wave. The Math.sin() function gives us the sine of the unit circle, so we get the loudest possible amplitude in the range [-1,1]. If we wanted to make the generated sound half as loud we’d simply multiply the sine by 0.5. To get a quater of the full loudness we multiply by 0.25 and so on. Pretty simple huh?

That concludes the first part of this series. Next time we’ll see what all this fourier transform business is about. Don’t worry, we will not use any advanced math, not even sines :) . Now go out, research how the Java sound library works and make your ears blead with self generated sine waves!

  • Share/Bookmark
 

Being employed at the Know-Center in Graz i’m trainied on plowing through scientific literature to find solutions to the problems i encounter. For beat detection i first went the usual route of asking Mr. Google what he thinks is the most relevant info on the topic. Looking at how Audiosurf performs on various tracks and following some discussions on the Audiosurf forums i somehow got the impression that Audiosurf does nothing to sophisticated. The blocks seem to be generated based on beats solely in various frequency bands, <500hz for kick drum and at higher frequencies for cymbals and similar percussion instruments.

My search keywords thus consisted of the abvious "beat detection FFT spectogram rythm analysis" which turned ob this pretty badly written article on gamedev.net. Several people on the net reported that the approaches described in the article do not work well. This suspicion was verified by trying out the BeatDetector class of Minim, an excellent audio library for Processing. The beat detector is a direct implementation of the approaches described in the gamedev net article. I stripped all Processing related stuff from the Minim source so i can use it for initial analysis of the beat detection problem which worked out exceptionally well. However, the included beat detector does not perform all that well (which i don’t blame on the author of Minim, it’s really an excellent library, check it out!). I tested it on a variety of genres and the results were more than mixed. So the approach described is not suitable for my needs.

So i turned to the scientific literature instead. The term beat detection is not used there, instead onset detection is the term to search for. Onset detection is a more general notion of beat detection which tries to find any beginning of a note, be it for non-pitched instruments like drums or pitched instruments like flutes and so on. There’s a couple of interesting papers i found. I just want to mention to most promising.

First of there was a scientific competition a couple of years ago that focused on audio beat tracking, the Mirex Audio Beat Tracking Challenge 2006. A couple of keyplayers from the field of onset detection and beat tracking took part in this challenge which was executed on a standardized test set that is available on the site. Most noteable are the approaches by Simon Dixon, formerly employed at the Technical University of Vienna in the Artificial Intelligence Department. He basically uses a short time fourier transform (STFT) and computes a quantity he calls the spectral flux which allows to detect onsets. More information on this and the other approaches can be found in this paper. Follow the references in there for more interesting approaches.

Another very nice paper, or rather more a survey is called “A Comparison of Sound Onset Detection Algorithms with Emphasis on Psychoacoustically Motivated Detection Functions” by Collins. It summarizes and evaluates 13 different detection functions along with a peak detector (which is applied to the output of the detection functions to detect onsets). The most interesting approach in there for me is the one by Kapulari which again uses an STFT and does some magic on the result along with peak detection. The approach seems easy to implement and comparse very well to other more sophisticated approaches evaluated in the survey. Follow the references in this survey to get a good grasp of the state of the art in onset detection.

The last noteable paper i’d like to mention uses a pretty neat approach based on self-similarity. Basically an STFT is computed for overlapping sample windows and a vector is derived for each window. Next a similarity matrix is constructed (using cosine similarity) which gives insight on the structure of the song analysed. This reminds me a lot of a topic detection approach in natural language processing i once implemented at work. The beauty of such approaches is that you can use image processing algorithms to get the structure information out of the similarity matrix which is treated as a float grey scale image. You can find the original paper called “The Beat Spectrum: A New Approach To Rythm Analysis” here. Of course the approach is way to computationally complex for use on mobile devices but never the less worth mentioning.

One last thing i just discovered is BeatRoot, a java application for beat tracking/onset detection that can be found on Dixon’s site (see link above). It comes with source code which i will hack my way through tonight. The results i achieved with the gui frontend seem very promising and if the source contains the approach described in his paper (spectral flux) i have a nice template to base my own code on.

That’s it for now. There’s a lot more papers but the mentioned one should get anyone interested started on the state of the art.

  • Share/Bookmark