Spectrograms in C++ with libav (FFmpeg), libpng and S/PHI/nX

Written by vasek on . Posted in All Posts, S/PHI/nX

SxFFT It this C++ example we will draw spectrograms with libav (FFmpeg), S/PHI/nX, and libpng. Libav is used to read audio files. From S/PHI/nX we will use mostly SxMatrix to store audio sample data and vector of complex numbers as the input for SxFFT1d. Libpng writes images in PNG file format.

The spectrogram is an important representation of audio data because human hearing is based on a kind of real-time spectrogram encoded by the cochlea of the inner ear (more here). The spectrogram can be defined as an intensity plot of Short-Time Fourier Transform (STFT) magnitude. STFT is a sequence of Fast Fourier Transform (FFT) on input audio samples. The intensities are usually plotted in decibel logarithmic scale.

Drawing spectrograms

Decibel sound pressure level from FFT output is \( 10 \log_{10} \left| 2 \frac{f_{\mathrm{out}}}{f_{\mathrm{size}}} \right|^2. \)

We start with FFT Post scaling to normalise FFT output by \( 2 \frac{f_{\mathrm{out}}}{f_{\mathrm{size}}} \). We multiply by two because we will use only one half of the output (the other half is symmetric) and we scale by FFT size. After this the values no longer depend on FFT size we used.

The next part is to get intermediate amplitude with \( \left| 2 \frac{f_{\mathrm{out}}}{f_{\mathrm{size}}} \right|^2 \) which is real part ^ 2 + imaginary part ^ 2. Because we are using intermediate amplitude (without square root), we will multiply by 10 instead of 20. (The logarithm is multiplied by 20 in a common form of spectrogram).

Two natural logarithms are used to get the logarithm to base 10 as \( \frac{\log (A + \epsilon)}{\log(10)} \). The amplitude A is increased by small value epsilon=1e-6 because the logarithm does not like zero much. These logarithms are quite picky. Another result of such epsilon in logarithm is that it sets dB threshold to -60. This cuts the noise below the threshold from the spectrogram.

To draw the spectrogram we repeat the equation for each column in the image with some advancement in the sample data which is specified by step. This step can be smaller than FFT size and the default value is half of FFT size. Much more thorough description is for example in Handling FFT Output and Input.

Spectrogram code which draws image of size width * height pixels:

   SxFFT1d fft1d (SxFFT::Forward, fftSize);
   SxVector<Complex16> in(fftSize), out(fftSize);
   // --- draw width x height pixels
   for (ssize_t x=0; x < width; x++)  {
      // --- fill new samples from channel 0 to complex numbers
      in = samples.colRef (0, x * step, (x * step) + fftSize - 1);
      fft1d.fftForward (fftSize, in.elements, out.elements);

      // --- get Decibels and put them to range <0, 1> by /-60;
      out = 10. * (log ((out * (2. / fftSize)).absSqr() + 1e-6) / log(10)) /-60.;
      // --- putpixel one image column
      for (ssize_t i=0; i < height; i++)
         image(x, height - (i + 1)) = 255 - uint8_t(out(i).re * 255.);

Audio files and libav (FFmpeg)

All we need is to read a sequence of samples but audio data can be stored in many different audio file formats such as well known mp3. Audio file formats are read and written with audio codecs which encode or decode digital data. The codecs have different properties: uncompressed audio, lossless or lossy compression, various support of sampling rates, bit depth or number of channels. A list of technical details shows the limits of several audio file formats.

We will use 3rd party library libav which can deal with many of these formats. Audio/Video library libav (originally from FFmpeg) and mainly its part libavcodec provides excellent support for many audio formats and multimedia formats in general. More about libav and its features.

Some formats as for example MP3 are patented. The distribution and/or selling of something with MP3 coders require payments. Here is more detailed page with patent status of other audio formats. Since MP3 was mentioned, reading of mp3 files is supported by libav and to write mp3 files we can extend libav with LAME MP3 Encoder.

readAudio function

The function reads audio stream from a file and returns it as a audio data in matrix. The first parameter specifies the input file. The second parameter optionally (if not NULL) returns the sample rate. The decoder is automatically detected from the file extension. The most of this function is taken from api-example.c which is distributed together with libav.

SxMatrix<Float> readAudio (const SxString &url, int *sampleRate)
   av_register_all ();

   AVFormatContext *formatContext = NULL;
   AVCodecContext *codecContext = NULL;
   AVCodec *codec = NULL;
   AVPacket packet;
   AVFrame frame;
   SxMatrix<Float> samples;
   SxList<SxPtr<SxArray<uint8_t> > > bufferList;
   ssize_t nSamples = 0;
   // --- open the file and autodetect the format
   if (avformat_open_input (&formatContext, url.ascii (), NULL, NULL) < 0)
      SX_THROW ("Can not open input file " + url);

   if (avformat_find_stream_info (formatContext, NULL) < 0)
      SX_THROW ("Can not find stream information");
   // --- find audio stream (between video/audio/subtitles/.. streams)
   int audioStreamId = av_find_best_stream (formatContext, AVMEDIA_TYPE_AUDIO,
                                            -1, -1, &codec, 0);
   if (audioStreamId < 0)
      SX_THROW ("Can not find audio stream in the input file");

   codecContext = formatContext->streams[audioStreamId]->codec;

   // --- init the audio decoder
   if (avcodec_open2 (codecContext, codec, NULL) < 0)
      SX_THROW ("Can not open audio decoder");
   // --- read all packets
   while (av_read_frame (formatContext, &packet) == 0)  {
      if (packet.stream_index == audioStreamId)  {
         avcodec_get_frame_defaults (&frame);
         int gotFrame = 0;
         if (avcodec_decode_audio4 (codecContext,&frame,&gotFrame,&packet) < 0){
            SX_THROW ("Error decoding audio");
         if (gotFrame)  {
            // --- audio frame has been decoded
            int size = av_samples_get_buffer_size (NULL,
            if (size < 0)  {
               SX_THROW ("av_samples_get_buffer_size invalid value");
            bufferList << SxPtr<SxArray<uint8_t> >::create (frame.data[0],size);
            nSamples += frame.nb_samples;
      av_free_packet (&packet);
   if (nSamples < 1)
      SX_THROW ("Decoded audio data is empty");
   int sampleSize = av_get_bytes_per_sample (codecContext->sample_fmt);
   if (sampleSize < 1)
      SX_THROW ("Invalid sample format");

   // --- create one huge matrix from all buffers
   ssize_t n = 0;
   ssize_t nCols = codecContext->channels;
   samples.reformat (nSamples, nCols);

   SxList<SxPtr<SxArray<uint8_t> > >::ConstIterator it;
   for (it = bufferList.begin (); it != bufferList.end(); ++it)  {
      ssize_t nRows = (*it)->getSize () / (sampleSize * nCols);

      // --- fill each column in matrix as one channel, de-interleave channels
      for (ssize_t c=0; c < nCols; c++)  {
         // --- convert sample values to float from sample format in input file
         if (codecContext->sample_fmt == AV_SAMPLE_FMT_U8)  {
            uint8_t *ptr = (*it)->elements;
            for (ssize_t r=0; r < nRows; r++)  {
               samples(r + n, c) = ((float)(ptr[r * nCols + c] - 128) / 128.f);
         } else if (codecContext->sample_fmt == AV_SAMPLE_FMT_S16)  {
            int16_t *ptr = (int16_t*)(*it)->elements;
            for (ssize_t r=0; r < nRows; r++)  {
               samples(r + n, c) = (float)ptr[r * nCols + c] * (1.f / (1<<15));
         } else if (codecContext->sample_fmt == AV_SAMPLE_FMT_S32)  {
            int32_t *ptr = (int32_t*)(*it)->elements;
            for (ssize_t r=0; r < nRows; r++)  {
               samples(r + n, c) = (float)ptr[r * nCols + c] * (1.f / (1U<<31));
         } else if (codecContext->sample_fmt == AV_SAMPLE_FMT_FLT)  {
            float *ptr = (float*)(*it)->elements;
            for (ssize_t r=0; r < nRows; r++)  {
               samples(r + n, c) = ptr[r * nCols + c];
         } else if (codecContext->sample_fmt == AV_SAMPLE_FMT_DBL)  {
            double *ptr = (double*)(*it)->elements;
            for (ssize_t r=0; r < nRows; r++)  {
               samples(r + n, c) = (float)ptr[r * nCols + c];
         } else {
            SxString fmt = av_get_sample_fmt_name (codecContext->sample_fmt);
            SX_THROW ("Sample format '" + fmt + "' is not supported");
      n += nRows;

   // --- optional, return value with sample rate
   if (sampleRate) *sampleRate = codecContext->sample_rate;
   if (codecContext) avcodec_close (codecContext);
   avformat_close_input (&formatContext);
   return samples;

Audio data in SxMatrix

We use one SxMatrix to store all samples and all channels. The rows in the matrix are the amplitudes of an audio signal in floats (converted to range -1 to 1). The columns represent audio channels. For example 1 column is for Mono sound, 2 columns are for Stereo sound, 6 columns for 5.1 surround and so on. A sequence of samples in one channel is likely to hit the cache because the storage of elements in the matrix is in column-major order. The capital Float instead of float is caused by historical reasons when templates had less powerful support between various compilers.

An example of sound wave in SxMatrix:

   SxMatrix<Float> samples;

   // --- create 100 samples of Stereo sound
   //     that is 100 samples in left and 100 samples in right channel
   samples.reformat (100, 2);

   // --- set the first sample (0) in left channel (0) to minimal amplitude (-1.)
   samples(0, 0) = -1.;

   // --- set the first sample (0) in right channel (1) to maximal amplitude (1.)
   samples(0, 1) = 1.;

   // --- write samples to sound.wav with 8 kHz sampling rate
   //     = 100 samples will make 12.5 ms short sound
   writeAudio ("sound.wav", samples, 8000);

   cout << samples.nRows () << " samples, "
        << samples.nCols () << " channels" << endl;

Main program

The program reads a sound, computes spectrogram and writes it to png image file. In the beginning we parse the program arguments with SxCLI. This will give us the names for the input and output files and the parameters for spectrogram. These are fftSize and step. ReadAudio function uses libav to read audio samples from the input file and returns them in SxMatrix. The next part creates spectrogram image and fills the pixels from FFT outputs converted to decibel sound pressure levels (dbSPL). Then we create image palette with 256 colors and write the image with palette to the output PNG file:

int main (int argc, char **argv)
   // --- parse input arguments
   SxCLI cli (argc, argv);
   SxString filenameIn  = cli.option ("-i|--input", "string",
                                      "Audio input file").toString ();
   SxString filenameOut = cli.option ("-o|--output", "string",
                                      "PNG output file").toString ("out.png");
   int fftSize          = cli.option ("-n", "integer",
                                      "FFT size").toInt (2048, 8);
   int step             = cli.option ("-s", "integer",
                                      "Pixels in width per second").toInt (0,0);
   cli.finalize ();
   try  {
      cout << "Reading audio..." << endl;
      int sampleRate = 0;
      SxMatrix<Float> samples = readAudio (filenameIn, &sampleRate);
      cout << samples.nRows () << " samples, "
           << samples.nCols () << " channels, "
           << sampleRate << " Hz" << endl;
      // --- pixels in image width each second
      if (step > 0)  {
         step = sampleRate / step; 
      }  else  {
         step = fftSize / 2; // default 50% FFT overlap
      // --- create image width * height pixels
      SxNArray<uint8_t, 2> image;
      ssize_t width  = (samples.nRows () - fftSize) / step;
      ssize_t height = fftSize / 2;
      image.reformat (width, height);
      SxFFT1d fft1d (SxFFT::Forward, fftSize);
      SxVector<Complex16> in(fftSize), out(fftSize);
      cout << "Computing FFT..." << endl;
      for (ssize_t x=0; x < width; x++)  {
         // --- fill new samples from channel 0 to complex numbers
         in = samples.colRef (0, x * step, (x * step) + fftSize - 1);
         fft1d.fftForward (fftSize, in.elements, out.elements);

         // --- get Decibels and put them to range <0, 1> by /-60;
         out = 10. * (log ((out * (2./fftSize)).absSqr() + 1e-6)/log(10)) /-60.;
         // --- putpixel one image column
         for (ssize_t i=0; i < height; i++)
            image(x, height - (i + 1)) = 255 - uint8_t(out(i).re * 255.);
      // --- create rainbow palette with 256 colors
      SxNArray<uint8_t, 2> palette;
      palette.reformat (3, 256);
      for (int i=0; i < 32; i++)  {
         palette(0, i) = 0;
         palette(1, i) = 0;
         palette(2, i) = i * 4;
         palette(0, i+32) = 0;
         palette(1, i+32) = 0;
         palette(2, i+32) = 128 + i * 4;
         palette(0, i+64) = 0;
         palette(1, i+64) = i * 4;
         palette(2, i+64) = 255;
         palette(0, i+96) = 0;
         palette(1, i+96) = 128 + i * 4;
         palette(2, i+96) = 255;
         palette(0, i+128) = 0;
         palette(1, i+128) = 255;
         palette(2, i+128) = 255 - i * 8;
         palette(0, i+160) = i * 8;
         palette(1, i+160) = 255;
         palette(2, i+160) = 0;
         palette(0, i+192) = 255;
         palette(1, i+192) = 255 - i * 8;
         palette(2, i+192) = 0;
         palette(0, i+224) = 255;
         palette(1, i+224) = 0;
         palette(2, i+224) = 0;
      cout << "Writing image..." << endl;
      writeImage (filenameOut, image, palette);
   } catch (SxException e)  {
      cout << "ERROR: " << e.getMessage () << endl;
      exit (1);
   return 0;

Building the project

To run this example we will continue as follows: the first step is to create some work directory, then we download all libraries and the source code example into this folder and install them there locally. This way we do not need root access and it is also easier to try, because if something goes wrong, we can just delete the work directory and start again. The final file structure in our work directory should look like this:


Installing SxAccelerate

Convenient way is to install SxAccelerate from a binary distribution (requires only small modification of include and library paths in compiler flags for this example).
Without binary distribution, the installation of SxAccelerate from source is the same as in the previous example and we can do the same steps as before. In the end we should have sphinx/sxaccelerate/include directory with header files and sphinx/sxaccelerate/lib directory with libraries we can link to.

curl -O http://sxlib.mpie.de/downloads/sxaccelerate-1.0.2.tar.bz2
curl -O http://sxlib.mpie.de/downloads/sx-3rdparty-unix.tar

tar xvf sxaccelerate-1.0.2.tar.bz2
mv sx-3rdparty-unix.tar sphinx/sxaccelerate/3rd-party/packages/
cd sphinx/sxaccelerate/3rd-party/packages
tar xvf sx-3rdparty-unix.tar
cd ../../../..

cd sphinx/sxaccelerate
./configure --prefix=`pwd`
make install

Installing libav

Please download libav-0.8.6.tar.gz. Extract the downloaded file as libav directory in your work directory (it is easier to have the directory without the version number). In the configure step we are using prefix with the current directory to install everything locally to libav. If it is working there will be libav/include directory with header files and libav/lib directory with the library we can link to.

curl -O http://libav.org/releases/libav-0.8.6.tar.gz
tar xvf libav-0.8.6.tar.gz
mv libav-0.8.6 libav
cd libav
./configure --prefix=`pwd` --disable-yasm
make install

We could also download and install yasm to have faster encoding and decoding.

Installing libpng

On some systems, libpng is already installed but probably without the header files which we need in order to compile the example. Please download libpng-1.5.11.tar.gz from http://www.libpng.org/pub/png/libpng.html. Extract the downloaded file as libpng directory in your work directory (it is easier to have the directory without the version number). In the configure step we are using prefix with the current directory to install everything locally to libpng. If it is working there will be libpng/include directory with header files and libpng/lib directory with the library we can link to.

tar xvf libpng-1.5.11.tar.gz
mv libpng-1.5.11 libpng
cd libpng
./configure --prefix=`pwd`
make install

Compile and run

Now we can download the complete example audio.cpp and compile it.

Because we have several external libraries, we need to provide the paths to include and lib directories so that gcc can find the header files and link to the libraries. Additionally we specify to which libraries we need to link. These are lib fftw3, png, avutil, avformat, avcodec, z, bz2, sxutil, and sxmath.
On MacOS X we additionally link to frameworks: CoreFoundation, VideoDecodeAcceleration and QuartzCore.
S/PHI/nX is compiled with several extended diagnostic messages to prevent some risky C++ code (the flags are in sphinx/sxaccelerate/src/system/m4/sxcompflags.m4). We will use at least -Wall and -ansi.

g++ audio.cpp -Ilibav/include -Llibav/lib \
 -Ilibpng/include -Llibpng/lib \
 -Isphinx/sxaccelerate/include -Lsphinx/sxaccelerate/lib \
 -Isphinx/sxaccelerate/3rd-party/include \
 -Lsphinx/sxaccelerate/3rd-party/lib \
 -lfftw3 -lpng -lavutil -lavformat -lavcodec -lz -lbz2 -lsxutil -lsxmath \
 -framework CoreFoundation -framework VideoDecodeAcceleration \
 -framework QuartzCore -Wall -ansi -o audio

If everything worked and the executable was generated, we can specify some input and run the program. Following command reads the sound from hmpback1.wav and draws spectrogram with FFT size 1024 to the default output image out.png.

./audio -i hmpback1.wav -n 1024

Spectrogram of Whale Song #1. note As we can see, the image is mostly black because these whales do not say much. The input audio file has 1 channel with 330 thousand samples at sample rate 22050 Hz. The image has 643 x 512 pixels and shows 15 seconds in horizontal axis (643 FFTs) and 512 frequency bins in linear vertical axis from 0 Hz (bottom) to 11025 Hz (top) which makes 21.5 Hz per pixel. If we use greater FFT size, we would get better resolution in frequency. For 2048 FFT points we get 10.76 Hz per pixel, the image could look nicer but it would be also half as thin. Just as a note, our ears are sensitive to frequencies from 20 Hz to 20 kHz.

Many thanks to Fabrice Bellard, Michael Niedermayer and all from FFmpeg/libav project!

Please follow the author on Twitter @vbubnik.

Tags: ,

© 2013 Gemmantics