Rokon with Box2D support

Rokon, a nice 2D game library that’s been around for quiet some time now has Box2D support. They took the JNI wrapper from libgdx and integrated it with their lib. Way to go! Here’s a demo video

I was hoping for some sort of collaboration, e.g. porting rokon to libgdx so you can also run on the desktop for rapid development but for now this seems to be not an option. Oh well, at least something in libgdx was useful to someone.

  • Share/Bookmark
 

FFT & the Nose

Just found this gem:http://scott.wolchok.org/ctf2010/t500.html. It’s the tale of a guy who heard some highfrequency noise in an audio file and tracked it down. The outcome was pretty amazing. What would you do if you saw this spectrum?

  • Share/Bookmark
 

libgdx 0.2 release

Yay, i’ve finally finished the first production code release of libgdx. The architecture is now frozen, the only thing that will happen in the future is the inclusion of new classes and maybe additional methods to the base interfaces. Everything else will stay as it is and you can savely base your code on it now. All the tutorials i put up so far can now be tested out by you. I’ll of course continue writting posts on libgdx and how to use it’s various parts.

You can check out the source code from http://libgdx.googlecode.com/svn/trunk/ or browse it at http://code.google.com/p/libgdx/source/browse/#svn/trunk/gdx/src/com/badlogic/gdx. You can download the distribution package containing all jars and shared libraries from http://libgdx.googlecode.com/files/gdx-0.2.zip. A hello world project with a desktop Eclipse project and an Android Eclipse project can be downloaded from http://libgdx.googlecode.com/files/gdx-helloworld.zip

Every public class and interface is is fully documented so make sure you check out the Java docs!

  • Share/Bookmark
 

On my quest to write a nice soundeffect class for libgdx to be used on the desktop i tried a lot of things. Using Clips, a class from the Java Sound API didn’t work. First, you can’t tell a Clip to play itself in an overlapping manner twice. Second, generating new Clips everytime you want to play the same sound effect takes to much time even if you decode the audio data to PCM beforehand and let the Clip use the raw audio data instead of having to decode it first. So i had to write my own software sound mixer to play sound effects, much like what SoundPool does on Android. You can check the source at http://code.google.com/p/libgdx/source/browse/trunk/gdx/src/com/badlogic/gdx/backends/desktop/JoglSound.java which houses my own version of Clip so to speak and at http://code.google.com/p/libgdx/source/browse/trunk/gdx/src/com/badlogic/gdx/backends/desktop/JoglAudio.java which does all the mixing in a seperate daemon thread. It’s not rocket science and using floats as intermediate raw audio is not the most performant way to do this. However, it is fast enough even on my punny Netbook with 32 concurrent sounds. I also had to write a simple linear resampler so that sound files having a sampling rate other than 44100Khz can be played too. It’s not rocket science but the Java Sound API does not support resampling. The linear interpolation resampling is quick and dirty and good enough for game sound effects. Don’t use the resampling code for other purposes! It sucks!

Now that i have implemented the desktop Side of libgdx’s audio interfaces i did the same for Android. Easy i thought, being already familiar with the awesome MediaPlayer and SoundPool classes. Well, not so. In case MediaPlayer should play from an asset file i did the following at first:

AssetFileDescriptor descriptor = aHandle.getAssetManager().openFd( aHandle.getFileName() );
mediaPlayer.setDataSource( descriptor.getFileDescriptor() );
descriptor.close();
mediaPlayer.prepare();

Can you spot the error? MediaPlayer.setDataSource() has actually two methods that take a FileDescriptor. One with and the other without additional offset and length parameters. The audio i head was pretty creepy, so to avoid that on your side use this code:

AssetFileDescriptor descriptor = aHandle.getAssetManager().openFd( aHandle.getFileName() )
mediaPlayer.setDataSource( descriptor.getFileDescriptor(), descriptor.getStartOffset(), descriptor.getLength() );
descriptor.close();
mediaPlayer.prepare();

Maybe it is documented somewhere but not in the MediaPlayer docs.

I’m nearly done with the first proper release of libgdx. What’s left now is some cleaning up of the Files interface which should get a couple more methods as well as unifying keycode handling. That’s something i might be able to throw together tonight. After that the fun starts. Writting classes for sprite rendering, tile maps, model loaders etc. etc. A few things i can copy from old projects, others i have to implement anew. If you have any suggestions on what to include write a comment below.

  • Share/Bookmark
 

The Future…

… of this blog is something i’ve been thinking a bit about lately. As i stated earlier i intend to open source the game programming framework i’ve written. This framework allows me to quickly prototype things on my desktop PC without using the emulator or device. To run a project on an Android device or an emulator i need only a single Activity. Here’s the complete code of Newton on Android:

public class NewtonGame extends OpenGLESApplication {
    /** Called when the activity is first created. */
    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);        
        addGraphicListener( new Game( this ) );
    }
}

That’s really it. The rest is implemented in a normal Java project that references the framework project. In the Java project of newton i have another class similar to the above which allows starting Newton on the Desktop:

public class Newton
{
	public static void main( String[] argv )
	{
		JoglApplication app = new JoglApplication( "Newton", 480, 320 );
		app.addGraphicListener( new Game( app ) );
	}
}

And that’s it again. This is possible by abstracting input, sound and graphics for both plattforms. On Android i hook up listeners to a GLSurfaceView, setup callbacks for the sensors like the accelerometer and do all audio via MediaPlayer and AudioTrack. On the PC/Mac i use Jogl and Java Sound. This application specifics are encapsulated via an interface for which 2 implementations exist. This interface provides methods to create Meshes, Textures, Sound, Music as well as resource managment functions that allow me to transparently work either with assets or files on a file system.

The same is true for native code! The framework project is also a C++ project and produces a single shared library for all the native code that is used in the framework. At the moment only the MP3 decoder is done in native code but i plan on including other things like billboarding of massive arrays of particles and some audio analysis related things as native code.

This is the second iteration of the framework and i started reworking the GUI module. The old one didn’t support animated gui elements so i try to incorporate that in the new one. This will include transitions like fade ins and fade outs. I also rework the skinning a bit to allow 9-patch buttons. User input is limited to non-text input at the moment. If you want to get text from the user a plattform specific dialog box is presented. I’m not sure wheter i will change that to something that fits better with the overall look of a game.

So that’s my plan for the future. I will upload the current state of the framework to google code under the LGPL license later this day. All the upcoming articles will use this framework to reduce code bloat in the postings much like i did with the audio analysis tutorial earlier. The native code shared libraries will be prebuild so you don’t have to have cdt and mingw installed on Winblows (on lunix you usually will have gcc installed). I can only provide 32-bit shared libraries on Winblows as there’s only an experimental 64-bit mingw version which i don’t dare to use. It’s not the intention of the framework to be used for creating desktop games anyway :)

  • Share/Bookmark
 

Libmad on Android with the NDK

So i was porting all the decoders i had build for the onset detection tutorial to C++, using libmad as the mp3 decoder of choice. After getting that to work on the desktop properly i had to make it work on Android too. Now, there’s no build of libmad in the NDK for obvious reasons, so i had to build that myself. As the autotools configure script of libmad is not useable with the NDK toolchain i used the config.h file from http://gitorious.org/rowboat/external-libmad/blobs/raw/master/android/config.h, which has all the settings needed for building libmad on Android. Compiling libmad is then a simple matter of creating a proper Android.mk and Application.mk file. The Android.mk file looks like this:

LOCAL_PATH := $(call my-dir)
 
include $(CLEAR_VARS)
 
LOCAL_MODULE    := audio-tools
LOCAL_ARM_MODE := arm
LOCAL_SRC_FILES := NativeWaveDecoder.cpp NativeMP3Decoder.cpp mad/bit.c mad/decoder.c mad/fixed.c mad/frame.c mad/huffman.c mad/layer12.c mad/layer3.c mad/stream.c mad/synth.c mad/timer.c mad/version.c
LOCAL_CFLAGS := -DHAVE_CONFIG_H -DFPM_ARM -ffast-math -O3
 
include $(BUILD_SHARED_LIBRARY)

Now there’s a couple of things that initially bogged down the performance of this. I tested it with the song “Schism” by tool which is a 6:47min long song, encoded at 192kbps. The file weights in at 9.31mb, pretty big for an mp3 imo. NativeMP3Decoder is just a libmad based implementation of the MP3Decoder in the onset detection tutorial framework. So it has a simple NativeMP3Decoder.readSamples method which will fill a float array with as many samples as there are elements in the float array. If the input file is in stereo the channels get mixed down to mono by averaging. The NativeMP3Decoder.readSamples() method internally calls a native method with a similar signature. Instead of a float array i pass in a direct ByteBuffer that has enough storage to hold all the samples requested. The native wrapper then writes the samples to this direct buffer which in turn then gets copied to the float array passed into the NativeMP3Decoder.readSamples() method. It looks something like this:

public int readSamples(float[] samples) 
{	
   if( buffer == null || buffer.capacity() != samples.length )
   {
      ByteBuffer byteBuffer = ByteBuffer.allocateDirect( samples.length * Float.SIZE / 8 );
      byteBuffer.order(ByteOrder.nativeOrder());
      buffer = byteBuffer.asFloatBuffer();
   }
 
   int readSamples = readSamples( handle, buffer, samples.length );
   if( readSamples == 0 )
   {
      closeFile( handle );
      return 0;
   }
 
   buffer.position(0);
   buffer.get( samples );
 
   return samples.length;
}

The call to buffer.get( samples ) kills it all. Without any optimizations (thumb code, -O0, -DFPM_DEFAULT == standard fixed point math in libmad, no arm assembler optimized fp math) decoding the complete files takes 184 seconds on my Milestone. Holy shit, batman! If i eliminate the buffer.get( samples ) call that gets down to 44 seconds! Incredible. Now i still thought that is way to slow so i started adding optimizations. The first thing i did was compiling to straight arm instead of thumb code. You can tell the NDK toolchain to do so by placing this in the Android.mk file:

LOCAL_ARM_MODE := arm

With this enabled decoding takes 36 seconds. The next thing i did was agressive optimization via -O3 as a CFLAG. That shaved off only 2 more seconds, so nothing to write home about. The last optimization is libmad specific. The config.h file i linked to above does not define the fixed point math mode libmad should use. Now, when you have a look at fixed.h of libmad you can see quiet some options for fixed point math there. There’s also a dedicated option for arm processors that uses some nice little arm assembler code to do the heavy lifting. You can enable this by passing -DFPM_ARM as a CFLAG. Now that did wonders! i’m now down to 20 seconds for decoding 407 seconds of mp3 encoded audio. That’s roughly 20x real-time which is totally ok with me. The song i chose is at the extreme end of the song length spectrum i will have to handle in my next audio game project. A song a user uses will be processed once and waiting for that 20 seconds is ok in my book.

I’m afraid i won’t release the source of the ported audio framework as it’s a bit of a mess and would need some work to clean up. What i can give you is the plain source for the native side of the NativeMP3Decoder class if you guarantee me not to laugh. My C days are long over so there’s probably a shitload of don’ts in there. The “handle” system is also kind of creative but good enough for my needs. I learned how to use the low level libmad api by looking at the code here. I actually like doing it this way more than with the shitty callback high-level API. Your mileage may vary. So here it is, be afraid:

#include "NativeMP3Decoder.h"
#include "mad/mad.h"
#include <stdio.h>
#include <string.h>
 
#define SHRT_MAX (32767)
#define INPUT_BUFFER_SIZE	(5*8192)
#define OUTPUT_BUFFER_SIZE	8192 /* Must be an integer multiple of 4. */
 
/**
 * Struct holding the pointer to a wave file.
 */
struct MP3FileHandle
{
	int size;
	FILE* file;
	mad_stream stream;
	mad_frame frame;
	mad_synth synth;
	mad_timer_t timer;
	int leftSamples;
	int offset;
	unsigned char inputBuffer[INPUT_BUFFER_SIZE];
};
 
/** static WaveFileHandle array **/
static MP3FileHandle* handles[100];
 
/**
 * Seeks a free handle in the handles array and returns its index or -1 if no handle could be found
 */
static int findFreeHandle( )
{
	for( int i = 0; i < 100; i++ )
	{
		if( handles[i] == 0 )
			return i;
	}
 
	return -1;
}
 
static inline void closeHandle( MP3FileHandle* handle )
{
	fclose( handle->file );
	mad_synth_finish(&handle->synth);
	mad_frame_finish(&handle->frame);
	mad_stream_finish(&handle->stream);
	delete handle;
}
 
static inline signed short fixedToShort(mad_fixed_t Fixed)
{
	if(Fixed>=MAD_F_ONE)
		return(SHRT_MAX);
	if(Fixed<=-MAD_F_ONE)
		return(-SHRT_MAX);
 
	Fixed=Fixed>>(MAD_F_FRACBITS-15);
	return((signed short)Fixed);
}
 
 
JNIEXPORT jint JNICALL Java_com_badlogic_audio_io_NativeMP3Decoder_openFile(JNIEnv *env, jobject obj, jstring file)
{
	int index = findFreeHandle( );
 
	if( index == -1 )
		return -1;
 
	const char* fileString = env->GetStringUTFChars(file, NULL);
	FILE* fileHandle = fopen( fileString, "rb" );
	env->ReleaseStringUTFChars(file, fileString);
	if( fileHandle == 0 )
		return -1;
 
	MP3FileHandle* mp3Handle = new MP3FileHandle( );
	mp3Handle->file = fileHandle;
	fseek( fileHandle, 0, SEEK_END);
	mp3Handle->size = ftell( fileHandle );
	rewind( fileHandle );
 
	mad_stream_init(&mp3Handle->stream);
	mad_frame_init(&mp3Handle->frame);
	mad_synth_init(&mp3Handle->synth);
	mad_timer_reset(&mp3Handle->timer);
 
	handles[index] = mp3Handle;
	return index;
}
 
static inline int readNextFrame( MP3FileHandle* mp3 )
{
	do
	{
		if( mp3->stream.buffer == 0 || mp3->stream.error == MAD_ERROR_BUFLEN )
		{
			int inputBufferSize = 0;
			if( mp3->stream.next_frame != 0 )
			{
				int leftOver = mp3->stream.bufend - mp3->stream.next_frame;
				for( int i = 0; i < leftOver; i++ )
					mp3->inputBuffer[i] = mp3->stream.next_frame[i];
				int readBytes = fread( mp3->inputBuffer + leftOver, 1, INPUT_BUFFER_SIZE - leftOver, mp3->file );
				if( readBytes == 0 )
					return 0;
				inputBufferSize = leftOver + readBytes;
			}
			else
			{
				int readBytes = fread( mp3->inputBuffer, 1, INPUT_BUFFER_SIZE, mp3->file );
				if( readBytes == 0 )
					return 0;
				inputBufferSize = readBytes;
			}
 
			mad_stream_buffer( &mp3->stream, mp3->inputBuffer, inputBufferSize );
			mp3->stream.error = MAD_ERROR_NONE;
		}
 
		if( mad_frame_decode( &mp3->frame, &mp3->stream ) )
		{
			if( mp3->stream.error == MAD_ERROR_BUFLEN ||(MAD_RECOVERABLE(mp3->stream.error)))
				continue;
			else
				return 0;
		}
		else
			break;
	} while( true );
 
	mad_timer_add( &mp3->timer, mp3->frame.header.duration );
	mad_synth_frame( &mp3->synth, &mp3->frame );
	mp3->leftSamples = mp3->synth.pcm.length;
	mp3->offset = 0;
 
	return -1;
}
 
JNIEXPORT jint JNICALL Java_com_badlogic_audio_io_NativeMP3Decoder_readSamples__ILjava_nio_FloatBuffer_2I(JNIEnv *env, jobject obj, jint handle, jobject buffer, jint size)
{
	MP3FileHandle* mp3 = handles[handle];
	float* target = (float*)env->GetDirectBufferAddress(buffer);
 
	int idx = 0;
	while( idx != size )
	{
		if( mp3->leftSamples > 0 )
		{
			for( ; idx < size && mp3->offset < mp3->synth.pcm.length; mp3->leftSamples--, mp3->offset++ )
			{
				int value = fixedToShort(mp3->synth.pcm.samples[0][mp3->offset]);
 
				if( MAD_NCHANNELS(&mp3->frame.header) == 2 )
				{
					value += fixedToShort(mp3->synth.pcm.samples[1][mp3->offset]);
					value /= 2;
				}
 
				target[idx++] = value / (float)SHRT_MAX;
			}
		}
		else
		{
			int result = readNextFrame( mp3 );
			if( result == 0 )
				return 0;
		}
 
	}
	if( idx > size )
		return 0;
 
	return size;
}
 
JNIEXPORT jint JNICALL Java_com_badlogic_audio_io_NativeMP3Decoder_readSamples__ILjava_nio_ShortBuffer_2I(JNIEnv *env, jobject obj, jint handle, jobject buffer, jint size)
{
	MP3FileHandle* mp3 = handles[handle];
	short* target = (short*)env->GetDirectBufferAddress(buffer);
 
	int idx = 0;
	while( idx != size )
	{
		if( mp3->leftSamples > 0 )
		{
			for( ; idx < size && mp3->offset < mp3->synth.pcm.length; mp3->leftSamples--, mp3->offset++ )
			{
				int value = fixedToShort(mp3->synth.pcm.samples[0][mp3->offset]);
 
				if( MAD_NCHANNELS(&mp3->frame.header) == 2 )
				{
					value += fixedToShort(mp3->synth.pcm.samples[1][mp3->offset]);
					value /= 2;
				}
 
				target[idx++] = value;
			}
		}
		else
		{
			int result = readNextFrame( mp3 );
			if( result == 0 )
				return 0;
		}
 
	}
	if( idx > size )
		return 0;
 
	return size;
}
 
JNIEXPORT void JNICALL Java_com_badlogic_audio_io_NativeMP3Decoder_closeFile(JNIEnv *env, jobject obj, jint handle)
{
	if( handles[handle] != 0 )
	{
		closeHandle( handles[handle] );
		handles[handle] = 0;
	}
}

To compile that for Android all you have to do is download libmad and put the source files into your Android project’s jni folder along with the code above. Then use the Android.mk from above et voila you got yourself a native mp3 decoder for Android. You can use it in combination with the AndroidAudioDevice class of the last post. If you feel adventureous you could even extend it to return stereo data.

  • Share/Bookmark
 

Fun with AudioTrack

I mentioned the class AudioTrack a couple of times already. Let’s see how it works.

AudioTrack is available since Android 1.5 (API level 3) and offers an extremely simple way to send PCM data directly to the device’s audio hardware. You can use it in two modes: static and streaming. I will only look at the streaming mode here. Streaming mode means that you permanentley write new PCM data to the hardware, the framework will queue it in a buffer and play it back for you. AudioTrack supports various sampling rates and 2 PCM encodings, 8-bit and signed 16-bit PCM. An AudioTrack instance can either be in mono or stereo mode. Here’s a small class that can be used similar to the AudioDevice class of the onset detection tutorial:

public class AndroidAudioDevice
{
   AudioTrack track;
   short[] buffer = new short[1024];
 
   public AndroidAudioDevice( )
   {
      int minSize =AudioTrack.getMinBufferSize( 44100, AudioFormat.CHANNEL_CONFIGURATION_MONO, AudioFormat.ENCODING_PCM_16BIT );        
      track = new AudioTrack( AudioManager.STREAM_MUSIC, 44100, 
                                        AudioFormat.CHANNEL_CONFIGURATION_MONO, AudioFormat.ENCODING_PCM_16BIT, 
                                        minSize, AudioTrack.MODE_STREAM);
      track.play();        
   }	   
 
   public void writeSamples(float[] samples) 
   {	
      fillBuffer( samples );
      track.write( buffer, 0, samples.length );
   }
 
   private void fillBuffer( float[] samples )
   {
      if( buffer.length < samples.length )
         buffer = new short[samples.length];
 
      for( int i = 0; i < samples.length; i++ )
         buffer[i] = (short)(samples[i] * Short.MAX_VALUE);;
   }		
}

Pretty simple eh? Now let’s start with the constructor. The first thing we do is to get the minimum buffer size for the AudioTrack instance we are going to create. This is achieved by a call to AudioTrack.getMinBufferSize(), passing in the sampling rate, wheter we are mono or stereo and the PCM encoding, in this case 16-bit signed. This buffer size is used by the AudioTrack internally for a buffer it stores all the samples in we’ll write to it. If the buffer is full it is flushed to the audio hardware. Now, in the next line we instantiate an AudioTrack. The first parameter dictates which audio stream our samples are going to be written to. There’s a couple of audio streams in the Android system, we’ll almost always want to use AudioManager.STREAM_MUSIC here. For the other stream types refer to the documentation. The next three parameters say what sampling rate we want to have, wheter we want the track to be mono or stereo and which PCM encoding we want to use. As with the AudioDevice class from the onset detection tutorial we use 44100Hz, mono, 16-bit signed PCM. The last parameter says wheter this AudioTrack is static or a streaming one, we want it to be a streaming one. All that’s left is to call AudioTrack.play() and we are ready to write samples to the audio hardware.

All the code in the onset detection tutorial worked with PCM data encoded as floats in the range [-1,1]. We want to emulate this here so i wrote a little method called AndroidAudioDevice.writeSamples( ) which takes a float array of mono float PCM samples and writes it to the hardware. For this the float samples have to be converted to 16-bit signed PCM which is done in the AudioDevice.fillBuffer() method, no rocket science here. Once we have the converted PCM we simply write it to the AudioTrack via AudioTrack.write() which takes a short array (our PCM samples), an offset into the array and the number of samples to use from the offset on. Extremely simple, even more than the equivalent Java Sound class SourceDataLine.

Now there’s a couple of interesting things about AudioTrack. First off, if you don’t write to it constantly it will pause itself to not hog any further system resources. Upon the next write it will start playback again introducing a wake up lag. Another not so nice thing is that due to the internal buffer of AudioTrack which has to be filled up completely before it is send to the hardware there’s noticeable lag between the time you write your first samples and when you hear them being played back. The minimum buffer size i get on my Milestone for the configuration is 8192, sadly the documentation doesn’t tell wheter that’s bytes or samples. If it’s bytes we divide that by two and then by the sampling frequency to get the total lag introduced by the internal buffer: 8192 / 2 / 44100 = 0,092 seconds, so nearly 100ms. That’s noticeable. In case the minimum buffer size is in samples it get’s even worse. The lag will be 200ms in that case. So that’S what you have to expect when using this class. Writting a software mixer based on AudioTrack is possible as long as you don’t need low latency. Synthesizing sounds each time the screen is touched for example is a bad idea as the lag is more than noticeable. Another property of AudioTrack is that the AudioTrack.write() method blocks. If you want to use it in a game you should do all your audio mixing in a seperate thread.

Still, the class is pretty niffty and it makes it easy to port all the examples from the onset detection tutorial to Android. I tested it with the WaveDecoder class and it worked like a charm, not eating up to much system resources while doing its thing. Here’s the sine wave generator sample ported to android:

public class AudioTest extends Activity
{	    
   @Override
   public void onCreate(Bundle savedInstanceState) 
   {
      super.onCreate(savedInstanceState);                      
 
      new Thread( new Runnable( ) 
      {
         public void run( )
         {        		
            final float frequency = 440;
            float increment = (float)(2*Math.PI) * frequency / 44100; // angular increment for each sample
            float angle = 0;
            AudioDevice device = new AndroidAudioDevice( );
            float samples[] = new float[1024];
 
            while( true )
            {
               for( int i = 0; i < samples.length; i++ )
               {
                  samples[i] = (float)Math.sin( angle );
                  angle += increment;
               }
 
               device.writeSamples( samples );
            }        	
         }
      } ).start();
   }
}

You can also try to use the decoders included in the tutorial framework, however, the pure Java MP3 decoder will be to slow. Only the WaveDecoder works acceptably. When i’m done porting all decoders to C++ i might put out a small Android audio library so you can benefit from that a bit. Now go out and play :)

  • Share/Bookmark
 

In the last article we saw how to reduce a complex spectrum that evolves over time to a simple one dimensional function called the spectral flux. On the way we did some improvements to the spectral flux function like rectifying it and enabling Hamming smoothing of the audio samples. In this article we will talk about a so called threshold function which we derive from the spectral flux function.

That’s a rectified, Hamming smoothed spectral flux function. It’s for an excerpt of the song “Judith” by “A Perfect Circle” and already shows quiet some peaks which we could directly interpret as onsets. However, there’s still some noise in that function, especially when the crashs are hit for example. We can’t do any better filtering, like for example smoothing out the function a bit as this would get rid of a lot of the small peaks. Instead we do something totally simple: we calculate a threshold function which is derived from the spectral flux function. This threshold function is then later used to discard spectral flux values that are noise.

The threshold function is actually pretty simple. Given our spectral flux function we calculate the mean or average value of a window around each spectral flux value. Say we have 3000 spectral flux values, each specifying the spectral flux of a 1024 sample window. 1024 samples at a sampling rate of 44100Hz equal a time span of around 43ms. The window size we use for the threshold function should be derived from some time value, say we want the average spectral flux of a time span of 0.5 seconds. That’s 0.5 / 0.043 = 11 sample windows or 11 spectral flux values. For each spectral flux value we take the 5 samples before it, 5 samples after it and the current value and calculate their average. We thus get for each spectral flux value a single value we call the threshold (for reasons that will become apearant shortly). Here’s some code that does this:

public class Threshold 
{
   public static final String FILE = "samples/explosivo.mp3";	
   public static final int THRESHOLD_WINDOW_SIZE = 10;
   public static final float MULTIPLIER = 1.5f;
 
   public static void main( String[] argv ) throws Exception
   {
      MP3Decoder decoder = new MP3Decoder( new FileInputStream( FILE  ) );							
      FFT fft = new FFT( 1024, 44100 );
      fft.window( FFT.HAMMING );
      float[] samples = new float[1024];
      float[] spectrum = new float[1024 / 2 + 1];
      float[] lastSpectrum = new float[1024 / 2 + 1];
      List<Float> spectralFlux = new ArrayList<Float>( );
      List<Float> threshold = new ArrayList<Float>( );
 
      while( decoder.readSamples( samples ) > 0 )
      {			
         fft.forward( samples );
         System.arraycopy( spectrum, 0, lastSpectrum, 0, spectrum.length ); 
         System.arraycopy( fft.getSpectrum(), 0, spectrum, 0, spectrum.length );
 
         float flux = 0;
         for( int i = 0; i < spectrum.length; i++ )	
         {
            float value = (spectrum[i] - lastSpectrum[i]);
            flux += value < 0? 0: value;
         }
         spectralFlux.add( flux );					
      }		
 
      for( int i = 0; i < spectralFlux.size(); i++ )
      {
         int start = Math.max( 0, i - THRESHOLD_WINDOW_SIZE );
         int end = Math.min( spectralFlux.size() - 1, i + THRESHOLD_WINDOW_SIZE );
         float mean = 0;
         for( int j = start; j <= end; j++ )
            mean += spectralFlux.get(j);
         mean /= (end - start);
         threshold.add( mean * MULTIPLIER );
      }
 
      Plot plot = new Plot( "Spectral Flux", 1024, 512 );
      plot.plot( spectralFlux, 1, Color.red );		
      plot.plot( threshold, 1, Color.green ) ;
      new PlaybackVisualizer( plot, 1024, new MP3Decoder( new FileInputStream( FILE ) ) );
   }
}

Not a lot of new things in there. First we calculate the spectral flux function as we did in the last article. Based on this spectral flux function we then calculate the threshold function. For each spectral flux value in the ArrayList spectralFlux we take the THRESHOLD_WINDOW_SIZE spectral flux values before and after it and calculate the average. The resulting mean value is then stored in an ArrayList called threshold. Note that we also multiply each threshold value by MULTIPLIER which is set to 1.5 in this example. After we finished calculating all our functions we plot them. The result looks like this:

What we just did is calculating a running average of the spectral flux function. With this we can detect so called outliers. Anything above the threshold function is an outlier and marks an onset of some kind! It should also be clear why the threshold function values are multiplied by a constant > 1. Outliers must be bigger than the average, in this case 1.5 times bigger than the average. This multiplier is an important parameter of our onset detector as it governs the sensitivity. However, when i apply the detector to songs i try to come up with a single value that works for all of them. I actually did that and arrived at the magic 1.5 multiplier used above. It works well for all the samples that you can find in the svn as well as a lot of other songs i tested.

Now let us combine the spectral flux function and the threshold function. Basically we want a prunned spectral flux function that only contains values that are bigger or equal to the threshold function. We extend the above program by the following lines:

for( int i = 0; i < threshold.size(); i++ )
{
   if( threshold.get(i) <= spectralFlux.get(i) )
      prunnedSpectralFlux.add( spectralFlux.get(i) - threshold.get(i) );
   else
      prunnedSpectralFlux.add( (float)0 );
}

The variable prunnedSpectralFlux is just another ArrayList. The loop is pretty simple, we add zero to the prunnedSpectralFlux list of the current spectral flux is less than the corresponding threshold function value or we add the spectrul flux value minus the threshold value at position i. The resulting prunned spectral flux function looks like this:

Awesome, we are nearly finished! All that is left is to search for peaks in this prunned spectral flux. A Peak is a value that is bigger then the next value. That’s all there is to peak detection. We are nearly done. Let’s write the small code for the peak detection producing a peak ArrayList:

for( int i = 0; i < prunnedSpectralFlux.size() - 1; i++ )
{
   if( prunnedSpectralFlux.get(i) > prunnedSpectralFlux.get(i+1) )
      peaks.add( prunnedSpectralFlux.get(i) );
   else
      peaks.add( (float)0 );				
}

And that’s it. Any value > 0 in the ArrayList peaks is an onset or beat now. To calculate the point in time for each peak in peaks we simply take its index and multiply it by the time span that the original sample window takes up. Say we used a sample window of 1024 samples at a sampling rate of 44100Hz then we have the simple forumula time = index * (1024 / 44100). That’s it. Here’s the output:

We are done. Kind off. That’s the most basic onset detector we can write. Let’s review some of it’s properties. As with many systems we have a few parametes we can tweak. The first one is the sample window size. I almost always use 1024. Lowering this will result in a more fine grained spectrum but might not gain you much. The next parameter is wheter to use Hamming smoothing on the samples before doing the FFT on them. I did see a bit of improvement turning that on but it will also cost you some cycles calculating it. This might be of importance if you plan on doing this on a mobile device with not FPU. Every cycle saved is a good cycle there. Next is the threshold window size. This can have a huge impact on the outcome of the detection stage. I stick to a window of 20 spectal flux values in total as in the example above. This is motivated by calculating how big of a time span those 20 values represent. In the case of a sample window 1024 samples at a sampling rate of 44100Hz that’s roughly have a second worth of data. The value worked well on all the genres i tested the simple detector on but your mileage may vary. The last and probably most important parameter is the multiplier for the threshold function. To low and the detector is to sensitive, to high and nothing get’s cought. In my day job i sometimes have to do outlier analysis which is pretty similar to what we do here. Sometimes i use simple threshold based methods and a value between 1.3-1.6 seems to be some sort of magic number when doing statistics based outlier detection relative to some means. I’d say that the threshold multiplier and the threshold window size should be the only parameters you tweak. Forget about the rest and use the defaults of 1024 for sample window size and Hamming smoothing on. Doing a parameter seach for to many parameters is no fun so boiling it down to “only” two makes sense in this case. The defaults i gave you above work reasonably well with the detector we developed in this series. If you do multi-band analysis you might have to tweak them individually for each band.

What’s left is some improvements like doing this complete process not for the whole spectrum but sub-bands so that we can do onset detection for various instruments (sort of, as we know they overlap in frequency range). Another improvement is to use overlapping sample windows, say by 50%. This smooths out the Fourier transform a bit more and gives better results. One could also try to do dynamic range compression on the whole audio signal before passing it to the FFT and so on and so forth. The code in the framework allows you to experiment easily with various additional steps. You will also want to maybe clean up the resulting peak list and remove any peaks that are to close together, say < 10ms. Learning by doing is the big mantra. Now go out and write awesome music games!

Here's a couple of papers and links i found very useful in the process of getting this to work:

http://www.dspguide.com/ An awesome online book (also available as hard cover) called “The Scientist and Engineer’s Guide to Digital Signal Processing” featuring anything you’ll ever want to know about digital signal processing (as an engineer :) ).
https://ccrma.stanford.edu/~jos/dft/ Another online book called “Mathematics of the discrete Fourier Transform with audio applications” coping with anything related to the discrete Fourier Transform. Pretty heavy on math but a very good read.
http://old.lam.jussieu.fr/src/Membres/Daudet/Publications_files/Onset_Tutorial.pdf A very good scientific paper called “A tutorial on onset detection for musical signals”. Features various onset detection functions (we used the un-banded spectral flux here) as well as some data on the performance of various functions on a standardized data set (Mirex 2006 Onset Detection Challenge). Look out for any other papers by Bello, he’s a big player in the field it seems.
http://www.dafx.ca/proceedings/papers/p_133.pdf “Onset detection revisited” by Dixon, another big name in the field (at least it seems so from the citation counts, but we all know how much they really tell you…). Very good paper, must read.

If you check out the citation graph of the two papers mentioned on google schoolar you will many more papers talking about onset detection that might interest you. Google *is* your friend in this case. Always read your fine papers!

  • Share/Bookmark
 

In an earlier post i talked about the scientific view of onset/beat detection. There’s a lot of different schemes out there that do the job more or less well. However, one approach was so intriguingly simple and performed so well compared to other more elaborate algorithms that i chose to use it for my purposes. It’s called spectral flux or spectral difference analysis and i’ll try to give you an insight of it workings in this article.

One of the best papers on the matter of onset detection is by Bello at al. which you can view here. I’ll review some of the concepts mentioned in there here and will steal a couple of images.

First of what is an onset? Looky:

At the top we see the sample amplitudes or wave form of a single note. The onset is the first thing to happen when playing a note, followed by an attack until it reaches its maximum amplitude followed by a decay phase. I refrain from defining the transient as it is of no concern for us.

The audio signals we want to process are of course a tiny bit more complex than a single note. We want to detect onsets in polyphonic music containing pitched instruments like guitars or violins, vocals and non pitched drums. As we have seen in the frequency chart a couple of articles back most of the pitched instruments overlap in their frequency range (mostly between 70Hz and 4000Hz). Perfect onset detection is thus impossible, however we can try to get an approximate estimation.

From what i could gather from the papers i read on the topic onset detection almost always follows the same schema. First you read in your audio signal, next you transform it to a onset detection function (just another array of floats corresponding to your original audio samples in some way) and finally you pick the peaks in this detection function as onsets/beats. The first stage we already talked about and it turned out to be pretty easy. The second stage of the process is often pretty elaborate doing some magic to transform the audio signal into another simpler, compressed one dimensional function. The last stage is not covered in detail in most of the literature. However, while i was working on a prototype it turned out that this stage is probably the most important one, even more important than the onset detection function calculation. In a future article we’ll concentrate on peak detection, for now let’s stick to onset detection functions.

Here’s a nice little chart for the general process of onset detection, again taken from Bello et al.

The pre-processing stage could for example do dynamic range compression, that is making everything equally loud, a shitty habit used in almost all mixes of modern day music. However shitty DRC might sound it is benefial for some onset detection functions. We won’t use one that benefits from it so we don’t care for pre-processing.

The reduction phase is equal to constructing the onset detection function. The goal of this stage is it to get a compressed view of the structure of the musical signal. The onset detection function should be able to catch musical events and make it easier to detect onsets as compared to trying that on the plain wave form of the audio signal. The output of the onset detection function are values indicating the presence or absence of a musical event for successive time spans. Remember how we read in samples for playback. We read in 1024 samples in each iteration. For such a sample window an onset detection function would output a value. For one second of music sampled at 44100Hz and using 1024 samples as the time span for the onset detection function we’d get roughly 43 values (44100/1024) that model the structure of the audio signal. Coincidentially the sample window size of 1024 can be processed by our FFT class. To summarize: generating the values of the onset detection function means calculating a value for each successive sample window which gives us a hopefully nice curve we perform peak detection on.

Now, i introduced the FFT in the last article so we are going to use it for generating the onset detection function. What’s the motivation for this? Listen to the following excerpt from “Judith” by “A Perfect Circle” and have a look at the image below:

Audio clip: Adobe Flash Player (version 9 or above) is required to play this audio clip. Download the latest version here. You also need to have JavaScript enabled in your browser.

The image depicts the spectra of successive 1024 sample windows of the audio signal. Each vertical pixel colum corresponds to one spectrum. Each pixel in a column stands for a frequency bin. The color tells us how much this frequency bin contributes to the audio signal in the sample window the spectrum was taken from. The values range from blue to red to white, increasing in that order. For reference the y-axis is labeled with the frequencies, the x-axis is labeled with the time. And holy shit batman we can see beats! I took the freedom to mark them with awesome arrows in Paint. The beats are visible only in the high frequency range from a bit about 5000Hz to 16000Hz. They correspond to the hi-hat being played. When you hit a hi-hat it produces something called white noise, a sound that spans the compelte frequency spectrum. Compared to other instruments it has a very high energy when hit. And we can clearly see that in the spectrum plot.

Now, processing this spectrum plot wouldn’t be a lot easier than going for the wave form. For comparison here’s the waveform of the same song excerpt:

We can see the hi-hat hits in the waveform too. So what’s the benefit of going to Fourier transform route? Well, it allows us to focus on specific frequencies. For example, if we are only interested in detecting the beats of the hi-hat or the crash or ride we’d go for the very high frequency range only as this is were those instruments will leave a note of their presence. We can’t do that with the amplitudes alone. The same is true for other instruments like the kick drum or the bass which we’d look for in the very low frequencies only. If we inspect multiple frequency ranges of the spectrum seperately we call it multi-band onset detection. Just to give you some keywords for Mr. Google.

So we know why we do our onset detection in the frequency domain but we are not any wiser on how to compress all the information of the spectrum plot to a managable one dimensional array of floats we can easily do peak detection on. Let me introduce you to our new best friend the spectral difference vulgo spectral flux. The concept is incredibly simple: for each spectrum we calculate the difference to the spectrum of the sample window before the current spectrum. That’s it. What we get is a single number that tells us the absolute difference between the bin values of the current spectrum and the bin values of the last spectrum. Here’s a nice little formula (so i get to use the latex wordpress plugin):

SF(k) =\displaystyle \sum_{i=0}^{n-1} s(k,i) - s(k-1,i)

SF(k) gives us the spectral flux of the kth spectrum. s(k,i) gives us the value of the ith bin in the kth spectrum, s(k-1,i) is analogous for the spectrum before k. We substract the values of each bin of the previous spectrum from the values of the corresponding bin in the current spectrum and sum those differences up to arrive at a final value, the spectral flux of spectrum k!

Let’s write a small program that calculates and visualizes the spectral flux of an mp3:

public class SpectralFlux 
{
   public static final String FILE = "samples/judith.mp3";	
 
   public static void main( String[] argv ) throws Exception
   {
      MP3Decoder decoder = new MP3Decoder( new FileInputStream( FILE  ) );							
      FFT fft = new FFT( 1024, 44100 );
      float[] samples = new float[1024];
      float[] spectrum = new float[1024 / 2 + 1];
      float[] lastSpectrum = new float[1024 / 2 + 1];
      List<Float> spectralFlux = new ArrayList<Float>( );
 
      while( decoder.readSamples( samples ) > 0 )
      {			
         fft.forward( samples );
         System.arraycopy( spectrum, 0, lastSpectrum, 0, spectrum.length ); 
         System.arraycopy( fft.getSpectrum(), 0, spectrum, 0, spectrum.length );
 
         float flux = 0;
         for( int i = 0; i < spectrum.length; i++ )			
            flux += (spectrum[i] - lastSpectrum[i]);			
         spectralFlux.add( flux );					
      }		
 
      Plot plot = new Plot( "Spectral Flux", 1024, 512 );
      plot.plot( spectralFlux, 1, Color.red );		
      new PlaybackVisualizer( plot, 1024, new MP3Decoder( new FileInputStream( FILE ) ) );
   }
}

We start of by instantiating a decoder that will read in the samples from an mp3 file. We also instantiate an FFT object telling it to expect sample windows of 1024 samples and a sampling rate of 44100Hz. Next we create a float array for the samples we read from the decoder. We use a window size of 1024 samples. The next two lines instantiate auxiliary float arrays that will contain the current spectrum and the last spectrum. Finally we also need a place to write the spectral flux for each sample window to, an ArrayList will do the job here.

The following loop does the nasty job of calculating the spectral flux’. First we read in the next 1024 samples from the decoder. If that succeeded we calculate the Fourier transform and spectrum of this sample window via a call to fft.forward(samples). We then copy the last current spectrum to the array lastSpectrum and the spectrum we just calculated to the array spectrum. What follows is the implementation of the formula above. For each bin we substract the value of the last spectrum’s bin from the current spectrum’s bin and add it to a sum variable called flux. When the loop has finished flux contains… the spectral flux of the current spectrum. This value is added to the spectralFlux ArrayList and we continue with decoding the next sample window. We repeat this for each sample window we can read in from the decoder. After the loop is done spectralFlux contains a spectral flux for each sample window we read in. The last three lines just create a plot and visualize the result. Note: i created a small helper class called PlaybackVisualizer which takes a plot, the number of samples per pixel and a decoder that will playback the audio fromt he decoder while setting the marker in the plot correspondingly. Nothing to complex, we saw how to do that in one of the earlier articles. Here’s the output for the song “Judith” we used above already:

Hurray! we can again see peaks. This time they not only correspond to the hi-hat but also to the snare and the kick drum. The reason for this is that use the complete spectrum so the spectral flux values are governed by the instruments that have the most energy, in this case hi-hat, snare and kick drum. Now let’s review what we just did. Take the original spectrum image from above were i marked the hits of the hi-hat. What the code does is essentially creating a sum over the differences of the bins of two successive spectra. When the current spectrum has more overall energy than the previous spectrum the spectral flux function will rise. If the current spectrum has less energy than the previous spectrum the spectral flux function will fall. So we successfully mapped the 2D spectrum plot to a 1D function we can perform peak detection on. Well, sort of…

Before we do any peak detection we have to clean up the spectral flux function a little bit. Let’s start with ignoring negative spectral flux values. We are not interested in falling spectral flux but only in rising spectral flux. The process of ignoring the spectral flux values that are negative is called rectifying. Here’s the modified loop for calculating the spectral flux:

   float flux = 0;
   for( int i = 0; i < spectrum.length; i++ )			
   {
      float value = (spectrum[i] - lastSpectrum[i]);			
      flux += value < 0? 0: value;
   }
   spectralFlux.add( flux );

And the corresponding output:

A bit nicer. Rectifying the spectral flux will aid us when we calculate the a threshold function for peak detection.

Another clean up process is using a so called Hamming window. This Hamming window is applied to the samples before calculating the Fourier transform. It’s basically a smoothing function that will make our samples look a bit nicer. The FFT class has a function by which we can enable Hamming window smoothing:

fft.window(FFT.HAMMING);

That’s all there is to do, here’s the result:

Not a big difference but we can see how it equalizes the first 4 hits a little bit while removing one at the end of the sample. In other songs the hamming window has a bigger effect on the resulting spectral flux but it’s really up to you wheter to use it or not. It doesn’t cost a lot of performance.

With this i want to conclude this article. We have seen how to go from samples to spectrums to calculating a 1D function that we’ll later use to detect the peaks and therefore onsets in an audio signal. From the images above we can already see a lot of peaks which seem easy to detect. In the next article we’ll go a step further and calculate the spectral flux of different frequency bands. By this we hope to be able to extract various instruments as oposed to focusing on the loudest thing in the complete spectrum. We’ll also employ a technique called hopping where we calculate the spectral flux not on overlapping successive sample windows instead of non-overlapping sample windows. This will further smooth the resulting spectral flux function a little bit. We are then ready to create a simple but effective peak detector that gives us the final onset times.

As always you can find all the source code at http://code.google.com/p/audio-analysis/. Over and out.

  • Share/Bookmark
 

Warning: this is how i understood the discrete Fourier transform which might be wrong in a couple of places. For a correct explanation of the fourier transform i suggest reading http://www.dspguide.com/ch8.htm

So finally we’ve come to the dreaded Fourier transform article. Let’s review some of the things we already know. The first thing to remember is that our audio data was fetched from some kind of recording device or synthesized digitally. In any case what we have is measurements of amplitudes for discrete points in time which we call samples. We can either have mono or stereo samples, in the later case we just have two samples per point in time instead of only one. The sample frequency is the number of samples we measured per second (for stereo we count both samples for the left and right channel as a single one in this case). Here’s 1024 samples of the note A (440Hz) at a sampling rate of 44100Hz :

Note: the vertical line to the left is not part of the actual image. It’s wordpress fucking up

1024 samples at a sampling rate of 44100Hz equals a time span of 43,07ms. From this image we can see that the samples form something that resembles a sine wave pretty much. And in fact the samples were generated using Math.sin() as we did in the first article of this series, confirming that sound is nothing but a mixture of sine waves of different frequency and amplitude.

A smart mathemagicion called Fourier once proposed the theorem that all possible functions could be approximated by a series of so called sinusoids, that is sine waves with different frequencies and amplitudes (actually Gauss already knew this but Fourier got the fame).  He did all this magic for so called continuous functions which you will remember from your analysis course in school (you do remember analysis, right?). This was extended to discrete functions and guess what, our samples are just that, a discrete function in time. Now, a Fourier transform decomposes a signal into a set of sinusoids with different frequencies and amplitudes. If we apply a Fourier transform to a discrete audio signal we get the frequency domain representation of our time domain audio signal. This transform is invertable, so we can go one or the other way. From now on we only consider discrete Fourier transforms on audio signals to not overcomplicate matters.

What does the frequency domain of our time domain signal tell us? Well, put simply it tells us what frequencies contribute how much to the time domain signal. A frequency domain signal can thus tell us wheter there’s the note a at 440Hz in the audio signal and how loud this note is in respect to the complete tune. This is pretty awesome as it allows us to investigate specific frequencies for our beat detection purposes. From the frequency chart of the last article we know that various instruments have different frequencies they can produce. If we want to focus on say singing we’d have a look at the frequency range 80Hz to 1000Hz for example. Of course instruments overlap in their frequency ranges so that might lead to some problems. But we’ll care for that later.

Now, what we’ll use for our frequency analysis is called the discrete Fourier transform. Given a discrete time signal, e.g. samples measured at a specific sampling rate we get a discrete frequency domain out. As an intermediate stage we get the so called real and imaginary part of the frequency domain. There is some scary terms in the last sentence so let’s just go over that quickly. The fourier transform decomposes a time signal into coefficients for sinusoids. The real part holds the Fourier coefficients for cosines of specific frequencies (a cosine is a sinusoid) as well as for the sines of the same frequencies. For our use case this representation is not all that important. What we will use is the so called spectrum which is calculated from the real and imaginary parts of the original fourier transform.  The spectrum tells us for each frequency how much the frequency contributes to the original time domain audio signal. From now on we assume that we already have calculated this spectrum, which in the framework is done by a class called FFT (taken from Minim) which relieves some of the pain.  If you want to understand how to go from the real and imaginary parts of the original Fourier transform to the spectrum read the link above.

Not all possible frequencies will be present in the spectrum due to the discrete nature of the transform. Frequencies are binned, that is multiple frequencies are merged together into a single value. Also, there’s a maximum frequency the discrete Fourier transform can give us which is called the Nyquist frequency. It is equal to half the sampling rate of our time domain signal. Say we have an audio signal sampled at 44100Hz the Nyquist frequency is 22050Hz. When we transform the signal to the frequency domain we thus get frequency bins up to 22050Hz. How many of those bins are there? Half the number of samples plus one bin. When we transform 1024 samples we get 513 frequency bins each having a bandwidth (range of frequencies) of the Nyquist frequency divided by the number of bins except for the first and last bin which have half that bandwidth. Say we have 1024 samples we transform, sampled at 44100Hz. The first bin will represent frequencies 0 to 22050 / 513 / 2~ 21.5Hz (remember, the first bin has half the normal bandwidth). The next bin will represent the frequencies from 21.5Hz to 21.5Hz plus 22050 / 513 ~ 43Hz == 64.5 Hz (one more time, 21.5Hz to 64.5Hz, bandwidth is 43Hz). The next bin ranges from 64.5Hz to 107.5Hz and so on and so forth.

Whenever we do a discrete Fourier transform on our audio data we do it on a window of samples, e.g. 1024 samples is a common window size. The size of the window dictates the resolution of the resulting spectrum, that is the number of frequency bins in the spectrum will increase linearly with the number of samples we transform. The bigger the sample window the more fine grained the bins will be, that is the smaller their bandwidth. To calculate the spectrum we use a specific algorithm called the Fast Fourier Transform (FFT) which runs in O(n log n), pretty fast compared to a naive fourier transform implementatin which takes O(n*n). Almost all of the FFT implementations demand sample windows with a size being a power of two. So 256, 512, 1024, 2048 is fine, 273 is not. The FFT implementation we use in the framework also has this “limitation”.

So without further ado i present you the spectrum of the samples depicted in the image above

Note: the vertical lines to the left and right are not part of the actual image. It’s wordpress fucking up

Yes, that’s it. The x-axis corresponds to the frequency bins while the y-axis tells us the amplitude of a frequency bin. We clearly see a peek at the left side, the rest of the spectrum is zero. This peak corresponds to the frequency bin which contains the frequency 440Hz, the frequency of the note A we generated 1024 samples for. What we also see is that it’s not a clean peak, bins to the left and right also receive some amplitude. This is called leakage and is something we can’t solve. In our case it’s not a big concern really but in other application scenarios it might be. Here’s the program that generated the above two images:

public class FourierTransformPlot 
{
   public static void main( String[] argv )
   {
      final float frequency = 440; // Note A		
      float increment = (float)(2*Math.PI) * frequency / 44100;		
      float angle = 0;		
      float samples[] = new float[1024];      
 
      for( int i = 0; i < samples.length; i++ )
      {
         samples[i] = ((float)Math.sin( angle ));
         angle += increment;			
      }
 
      FFT fft = new FFT( 1024, 44100 );
      fft.forward( samples );
 
      Plot plotSamples = new Plot( "Samples", 512, 512 );
      plotSamples.plot( samples, 2, Color.white );
 
      Plot plotSpectrum = new Plot( "Spectrum", 512, 512);
      plotSpectrum.plot(fft.getSpectrum(), 1, Color.white );
   }
}

The first couple of lines should be familiar, we simply generate 1024 samples at a sampling rate of 44000Hz. There’s actually only 2 interesting lines in this program. The first one is the one where we instantiate the FFT object. As parameters for it’s constructor it wants the sample window size we’ll use to generate the spectrum as well as the sampling rate. The next line performs the Fourier transform as well as the spectrum calculation. All we have to do is pass in a float array of samples having the size we specified in the FFT constructor. That’s all we need to do to get our precious frequency bins. What’s left is plotting the samples as well as the spectrum. Note the call to fft.getSpectrum(). This method will return the spectrum of the last fourier transform we did by calling FFT.forward().

Pfuh, that was pretty nasty. I really suggest reading up on the discrete Fourier transform. It’s an essential mathematical tool for all audio engineers. And while most audio engineers will use some sort of toolbox as we did it doesn’t hurt to know your way around.

We now are equipped with all we need to get our onset/beat detector going. Stay tuned for the next article.

  • Share/Bookmark