Music Analysis - Spectrogram

Warning! Some information on this page is older than 6 years now. I keep it for reference, but it probably doesn't reflect my current knowledge and beliefs.

Jul 2010

I've started learning about sound analysis. I have some deficiencies in education when it comes to digital signal processing (greetings for the professor who taught this subject at our university ;) but Wikipedia comes to the rescue. As a starting point, here is a spectrogram I've made from one of my recent favourite songs: Sarge Devant feat. Emma Hewitt - Take Me With You.

Now I'm going to exaplain in details how I've done this by showing some C++ code. First I had to figure out how to decode an MP3, OGG or other compressed sound format. FMOD is my favourite sound library and I knew it can play many file formats. It took me some time though to find functions for fast decoding uncompressed PCM data from a song without actually playing it for all 3 minutes. I've found on the FMOD forum that Sound::seekData and Sound::readData can do the job. Finally I've finished with this code (all code shown here is stripped from error checking which I actually do everywhere):

#include <fmod.hpp>
#pragma comment(lib, "fmodex_vc.lib")

// Initialize FMOD.
// Set output device type to dummy/null.
FMOD::System *g_System;
g_System->init(100, FMOD_INIT_NORMAL, 0);

// Open the music file as stream, software mode.
FMOD::Sound *sound;
g_System->createStream("C:\\...mp3", FMOD_SOFTWARE, 0, &sound);

// Fetch song parameters
/* In my case it is:
freq=44100, Format=FMOD_SOUND_FORMAT_PCM16, Channels=2, Bits=16
Length=221570 ms, PCM=9771264, Bytes=39085056 */
float freq; FMOD_SOUND_FORMAT format; int channels, bits;
uint lenMs, lenPcm, lenPcmBytes;
sound->getDefaults(&freq, 0, 0, 0);
sound->getFormat(0, &format, &channels, &bits);
sound->getLength(&lenMs,       FMOD_TIMEUNIT_MS);
sound->getLength(&lenPcm,      FMOD_TIMEUNIT_PCM);
sound->getLength(&lenPcmBytes, FMOD_TIMEUNIT_PCMBYTES);
StartMusic(freq, format, channels, bits, lenMs, lenPcm, lenPcmBytes);

// Decode song data
static const uint BUF_SIZE = 0x10000;
uint8 buf[BUF_SIZE];
bool processing = true;
  uint bytesRead;
  res = sound->readData(buf, BUF_SIZE, &bytesRead);
  if (res == FMOD_ERR_FILE_EOF)
    processing = false;
  ProcessMusicBuf(buf, bytesRead);
while (processing);


Bytes entering ProcessMusicBuf function (always 0x10000 = 65536 bytes except the last call) are PCM samples arranged like this: each sample is a little-endian signed short number for left channel followed by similar a number for right channel. So each sample has 4 bytes. There are 44100 samples per second (this is the sampling frequency in Hz).

To make a spectrum, one need to transform data from time domain to freqency domain, that is to perform the Fourier Transform. I used FFTW library for this. Here is how I do it:

#include <fftw3.h>
#pragma comment(lib, "libfftw3f-3.lib")

static const uint FFT_INPUT_SAMPLES = 4096;
static const uint FFT_OUTPUT_SAMPLES = (FFT_INPUT_SAMPLES / 2) + 1;
static float g_InBuf[FFT_INPUT_SAMPLES], g_WindowedBuf[FFT_INPUT_SAMPLES];
static uint g_InBufPos = FFT_INPUT_SAMPLES/2;
static fftwf_complex g_OutBuf[FFT_OUTPUT_SAMPLES];
static fftwf_plan g_Plan;

// Initialization
g_Plan = fftwf_plan_dft_r2c_1d(FFT_INPUT_SAMPLES, g_WindowedBuf, g_OutBuf, FFTW_ESTIMATE);

for (uint i = 0; i < FFT_INPUT_SAMPLES/2; ++i)
  g_InBuf[i] = 0.f;

// Finalization

// Processing function
void ProcessMusicBuf(const void *buf, uint bytes)
  const uint8 *bufBytes = (const uint8*)buf;
  uint samples = bytes / 4;
  while (samples)
    uint samplesToCopy = std::min(samples, FFT_INPUT_SAMPLES - g_InBufPos);
    PostSamples(g_InBuf + g_InBufPos, bufBytes, samplesToCopy);
    g_InBufPos += samplesToCopy;
    if (g_InBufPos == FFT_INPUT_SAMPLES)
      memcpy(g_InBuf, g_InBuf + FFT_INPUT_SAMPLES/2, sizeof(float) * FFT_INPUT_SAMPLES / 2);
      g_InBufPos = FFT_INPUT_SAMPLES/2;

    samples -= samplesToCopy;
    bufBytes += samplesToCopy * 4;

// Helper functions

void PostSamples(float *outSamples, const uint8 *inPcmData, uint sampleCount)
  for (uint i = 0; i < sampleCount; ++i, inPcmData += 4, ++outSamples)
    short lSample = *(const short*)inPcmData;
    short rSample = *(const short*)(inPcmData + 2);
    *outSamples = ((float)lSample + (float)rSample) / 65536.f;

static void PostInBuf()
  for (uint i = 0; i < FFT_INPUT_SAMPLES; ++i)
    g_WindowedBuf[i] = g_InBuf[i];
    // Hamming window
    g_WindowedBuf[i] *= 0.54f - 0.46f * cosf( (PI * 2.f * i) / (FFT_INPUT_SAMPLES - 1) );


  // I rewrite to g_OutBuf[i][0] squared absolute value of a complex number g_OutBuf[i].
  for (uint i = 0; i < FFT_OUTPUT_SAMPLES; ++i)
    g_OutBuf[i][0] = g_OutBuf[i][0]*g_OutBuf[i][0] + g_OutBuf[i][1]*g_OutBuf[i][1];
  // see below...

Here I have a window for 4096 samples sliding along the data. PCM values (2 signed shorts) are averaged and rescaled to -1..1 for the g_InBuf by PostSamples function. g_InBuf is filled with zeros to the half at the beginning and then each time 2048 new samples is read, the window is advanced bo 2048 samples. So each time I process the buffer with PostInBuf, I have 2048 new samples on the right and 2048 old samples on the left. In other words, windows overlap by a half. I process buffer every 2048 samples which gives 0.04644 s.

I process g_InBuf into g_WindowedBuf using Hamming window. There exist many Window functions with different properties and my choice so far is rather arbitrary :)

Next I execute the FFT transform. 4096 real numbers on input processed by FFTW dft_r2c_1d transform gives 4096/2+1 = 2049 complex numbers at output, which I suppose can be interpreted as frequencies from 0 to 22050 Hz with 10.7666 Hz step. I don't know yet how could I use the phase of these complex numbers so I only measure squared absolute values (real*real + imag*imag), as they probably give the amplitude, energy or whatever at particular frequency and in particular point in time.

Finally to visualize my spectrum I use FreeImage library. It can allocate, fill and then save a bitmap in many different image file formats (like BMP, PNG, JPEG). Here is the code:

#include <FreeImage.h>
#pragma comment(lib, "FreeImage.lib")

static const uint imgSizeX = 4759;
static const uint imgSizeY = 256;
static FIBITMAP *bitmap;
static uint g_ImageX;

// Initialization
bitmap = FreeImage_Allocate((int)imgSizeX, (int)imgSizeY, 24);

// Finalization
FreeImage_Save(FIF_BMP, bitmap, "G:\\tmp\\Spectrogram.bmp");

void PostInBuf()
  // see above...
  if (g_ImageX < imgSizeX)
    uint samplesPerPixel = FFT_OUTPUT_SAMPLES / imgSizeY;
    float val;
    RGBQUAD pixel;
    for (uint y = 0, sample = 0; y < imgSizeY; ++y)
      // Average values over samplesPerPixel numbers.
      val = 0.f;
      for (uint localSample = 0; localSample < samplesPerPixel; ++localSample, ++sample)
        val += g_OutBuf[sample][0];
      val /= (float)samplesPerPixel;
      val = log10f(val + 1.f); // Logarithm.
      val *= 0.4f / 3.f; // Manually selected scaling.
      val = common::minmax(0.f, val, 1.f); // Clamp.
      val = powf(val, 1.f/2.2f); // Gamma (a computer graphics thing :)
      pixel.rgbRed = pixel.rgbGreen = pixel.rgbBlue = (uint8)(val * 255.f + 0.5f);
      FreeImage_SetPixelColor(bitmap, g_ImageX, y, &pixel);


I know these algorithms could be optimized, parallelized and so on, but now I'm more concerned about what to do with this further. What information can be extracted from music to be used for generating gameplay like in AudioSurf? What are good articles about this on the Internet?

The spectrogram I made already reveals some regular features, so I think it could be now processed as image :) But 2D image recognition and processing is another broad topic...

Another of my questions for those who managed to read up to this point is: wouldn't be faster and more convenient to do such math research in Scilab instead of C++? Is it possible to do such things in Scilab anyway?

Comments | #math #libraries #music #rendering #dsp #algorithms Share


[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2023