How is audio represented with numbers in computers?

Physically, as you probably know, audio is a vibration. Typically, we're talking about vibrations of air between approximitely 20Hz and 20,000Hz. That means the air is moving back and forth 20 to 20,000 times per second.

If you measure that vibration and convert it to an electrical signal (say, using a microphone), you'll get an electrical signal with the voltage varying in the same waveform as the sound. In our pure-tone hypothetical, that waveform will match that of the sine function.

Now, we have an analogue signal, the voltage. Still not digital. But, we know this voltage varies between (for example) -1V and +1V. We can, of course, attach a volt meter to the wires and read the voltage.

Arbitrarily, we'll change the scale on our volt meter. We'll multiple the volts by 32767. It now calls -1V -32767 and +1V 32767. Oh, and it'll round to the nearest integer.

Now, we hook our volt meter to a computer, and instruct the computer to read the meter 44,100 times per second. Add a second volt meter (for the other stereo channel), and we now have the data that goes on an audio CD.

This format is called stereo 44,100 Hz, 16-bit linear PCM. And it really is just a bunch of voltage measurements.


Audio can represented by digital samples. Essentially, a sampler (also called an Analog to digital converter) grabs a value of an audio signal every 1/fs, where fs is the sampling frequency. The ADC, then quantizes the signal, which is a rounding operation. So if your signal ranges from 0 to 3 Volts (Full Scale Range) then a sample will be rounded to, for example a 16-bit number. In this example, a 16-bit number is recorded once every 1/fs/

So for example, most WAV/MP3s are sampled an audio signal at 44 kHz. I don't know how detail you want, but there's this thing called the "Nyquist Sampling Rate" the says that the sampling frequency must be at least twice the desired frequency. So on your WAV/MP3 file you are at best going to be able to hear up tp 22 kHz frequencies.

There is a lot of detail you can go into in this area. The simplest form would certainly be the WAV format. It is uncompressed audio. Formats like mp3 and ogg are have to be decompressed before you can work with them.


Minimal C audio generation example

The example below generates a pure 1000k Hz sinus in raw format. At the common 44.1kHz sampling rate, it will last about 4 seconds.

main.c:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main(void) {
    FILE *f;
    const double PI2 = 2 * acos(-1.0);
    const double SAMPLE_FREQ = 44100;
    const unsigned int NSAMPLES = 4 * SAMPLE_FREQ;
    uint16_t ampl;
    uint8_t bytes[2];
    unsigned int t;

    f = fopen("out.raw", "wb");
    for (t = 0; t < NSAMPLES; ++t) {
        ampl = UINT16_MAX * 0.5 * (1.0 + sin(PI2 * t * 1000.0 / SAMPLE_FREQ));
        bytes[0] = ampl >> 8;
        bytes[1] = ampl & 0xFF;
        fwrite(bytes, 2, sizeof(uint8_t), f);
    }
    fclose(f);
    return EXIT_SUCCESS;
}

GitHub upstream.

Generate out.raw:

gcc -std=c99 -o main main.c -lm
./main

Play out.raw directly:

sudo apt-get install ffmpeg
ffplay -autoexit -f u16be -ar 44100 -ac 1 out.raw

or convert to a more common audio format and then play with a more common audio player:

ffmpeg -f u16be -ar 44100 -ac 1 -i out.raw out.flac
vlc out.flac

Generated FLAC file: https://github.com/cirosantilli/media/blob/master/canon.flac

Parameters explained at: https://superuser.com/questions/76665/how-to-play-a-pcm-file-on-an-unix-system/1063230#1063230

Related: What ffmpeg command to use to convert a list of unsigned integers into an audio file?

Tested on Ubuntu 18.04.

Canon in D in C

Here is a more interesting synthesis example.

Outcome: https://www.youtube.com/watch?v=JISozfHATms

main.c

#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

typedef uint16_t point_type_t;

double PI2;

void write_ampl(FILE *f, point_type_t ampl) {
    uint8_t bytes[2];
    bytes[0] = ampl >> 8;
    bytes[1] = ampl & 0xFF;
    fwrite(bytes, 2, sizeof(uint8_t), f);
}

/* https://en.wikipedia.org/wiki/Piano_key_frequencies */
double piano_freq(unsigned int i) {
    return 440.0 * pow(2, (i - 49.0) / 12.0);
}

/* Chord formed by the nth note of the piano. */
point_type_t piano_sum(unsigned int max_ampl, unsigned int time,
        double sample_freq, unsigned int nargs, unsigned int *notes) {
    unsigned int i;
    double sum = 0;
    for (i = 0 ; i < nargs; ++i)
        sum += sin(PI2 * time * piano_freq(notes[i]) / sample_freq);
    return max_ampl * 0.5 * (nargs + sum) / nargs;
}

enum notes {
    A0 = 1, AS0, B0,
    C1, C1S, D1, D1S, E1, F1, F1S, G1, G1S, A1, A1S, B1,
    C2, C2S, D2, D2S, E2, F2, F2S, G2, G2S, A2, A2S, B2,
    C3, C3S, D3, D3S, E3, F3, F3S, G3, G3S, A3, A3S, B3,
    C4, C4S, D4, D4S, E4, F4, F4S, G4, G4S, A4, A4S, B4,
    C5, C5S, D5, D5S, E5, F5, F5S, G5, G5S, A5, A5S, B5,
    C6, C6S, D6, D6S, E6, F6, F6S, G6, G6S, A6, A6S, B6,
    C7, C7S, D7, D7S, E7, F7, F7S, G7, G7S, A7, A7S, B7,
    C8,
};

int main(void) {
    FILE *f;
    PI2 = 2 * acos(-1.0);
    const double SAMPLE_FREQ = 44100;
    point_type_t ampl;
    point_type_t max_ampl = UINT16_MAX;
    unsigned int t, i;
    unsigned int samples_per_unit = SAMPLE_FREQ * 0.375;
    unsigned int *ip[] = {
        (unsigned int[]){4, 2, C3, E4},
        (unsigned int[]){4, 2, G3, D4},
        (unsigned int[]){4, 2, A3, C4},
        (unsigned int[]){4, 2, E3, B3},

        (unsigned int[]){4, 2, F3, A3},
        (unsigned int[]){4, 2, C3, G3},
        (unsigned int[]){4, 2, F3, A3},
        (unsigned int[]){4, 2, G3, B3},

        (unsigned int[]){4, 3, C3, G4, E5},
        (unsigned int[]){4, 3, G3, B4, D5},
        (unsigned int[]){4, 2, A3,     C5},
        (unsigned int[]){4, 3, E3, G4, B4},

        (unsigned int[]){4, 3, F3, C4, A4},
        (unsigned int[]){4, 3, C3, G4, G4},
        (unsigned int[]){4, 3, F3, F4, A4},
        (unsigned int[]){4, 3, G3, D4, B4},

        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){2, 3, G3, D4, D5},
        (unsigned int[]){2, 3, G3, D4, B4},

        (unsigned int[]){2, 3, A3, C4, C5},
        (unsigned int[]){2, 3, A3, C4, E5},
        (unsigned int[]){2, 2, E3,     G5},
        (unsigned int[]){2, 2, E3,     G4},

        (unsigned int[]){2, 3, F3, A3, A4},
        (unsigned int[]){2, 3, F3, A3, F4},
        (unsigned int[]){2, 3, C3,     E4},
        (unsigned int[]){2, 3, C3,     G4},

        (unsigned int[]){2, 3, F3, A3, F4},
        (unsigned int[]){2, 3, F3, A3, C5},
        (unsigned int[]){2, 3, G3, B3, B4},
        (unsigned int[]){2, 3, G3, B3, G4},

        (unsigned int[]){2, 3, C4, E4, C5},
        (unsigned int[]){1, 3, C4, E4, E5},
        (unsigned int[]){1, 3, C4, E4, G5},
        (unsigned int[]){1, 2, G3,     G5},
        (unsigned int[]){1, 2, G3,     A5},
        (unsigned int[]){1, 2, G3,     G5},
        (unsigned int[]){1, 2, G3,     F5},

        (unsigned int[]){3, 3, A3, C4, E5},
        (unsigned int[]){1, 3, A3, C4, E5},
        (unsigned int[]){1, 3, E3, G3, E5},
        (unsigned int[]){1, 3, E3, G3, F5},
        (unsigned int[]){1, 3, E3, G3, E5},
        (unsigned int[]){1, 3, E3, G3, D5},
    };
    f = fopen("canon.raw", "wb");
    for (i = 0; i < sizeof(ip) / sizeof(int*); ++i) {
        unsigned int *cur = ip[i];
        unsigned int total = samples_per_unit * cur[0];
        for (t = 0; t < total; ++t) {
            ampl = piano_sum(max_ampl, t, SAMPLE_FREQ, cur[1], &cur[2]);
            write_ampl(f, ampl);
        }
    }
    fclose(f);
    return EXIT_SUCCESS;
}

GitHub upstream.

For YouTube, I prepared it as:

wget -O canon.png https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/The_C_Programming_Language_logo.svg/564px-The_C_Programming_Language_logo.svg.png
ffmpeg -loop 1 -y -i canon.png -i canon.flac -shortest -acodec copy -vcodec vp9 canon.mkv

as explained at: https://superuser.com/questions/700419/how-to-convert-mp3-to-youtube-allowed-video-format/1472572#1472572

Tested on Ubuntu 18.04.

Physics

Audio is encoded as a single number for every moment in time. Compare that to a video, which needs WIDTH * HEIGHT numbers per moment in time.

This number is then converted to the linear displacement of the diaphragm of your speaker:

|   /
|  /
|-/
| | A   I   R
|-\
|  \
|   \
<-> displacement

|     /
|    /
|---/
|   | A I   R
|---\
|    \
|     \
<---> displacement

|       /
|      /
|-----/
|     | A I R
|-----\
|      \
|       \
<-----> displacement

The displacement pushes air backwards and forwards, creating pressure differences, which travel through air as P-waves.

Only displacement matters: a constant signal, even if maximal, produces no sound: the diaphragm just stays at a fixed position.

The sampling frequency determines how fast the displacements should be done.

44,1kHz is a common sampling frequency because humans can hear up to 20kHz and because of the Nyquist–Shannon sampling theorem.

The sampling frequency is analogous to the FPS for video, although it has a much higher value compared to the 25 (cinema) - 144 (hardcore gaming monitors) range we commonly see for video.

Formats

Uncompressed:

  • .raw is an underspecified format that contains just the amplitude bytes, and no metadata.

    We have to pass a few meta-data parameters on the command line like the sampling frequency because the format does not contain that data.

  • .wav is another popular uncompressed format which contain all needed metadata: WAV File Synthesis From Scratch - C

  • MIDI (.mid): https://en.wikipedia.org/wiki/MIDI

    This format represents keystrokes of an instrument. File sizes can be very small as a result, but it can't necessarily represent "arbitrary sounds", more like notes. It is what a basic digital keyboard will output to a computer. You then need a synthesizer software to transform those notes into actual sound amplitudes. Also it is harder to make it portable, and the receiver would need the exact same synthesis software/database.

    Conversion to MP3: https://softwarerecs.stackexchange.com/questions/10915/automatically-turn-midi-files-into-wav-or-mp3/76955#76955

In practice, most people deal exclusively with compressed formats, which make files streaming much smaller. Some of those formats take into account characteristics of the human ear to further compress the audio in a lossy way that people won't notice. The most popular royalty free formats as of 2019 appear to be:

  • lossless: FLAC
  • lossy: Vorbis

Biology

Humans perceive sound mostly by their frequency decomposition (AKA Fourier transform).

I think this is because the inner ear has parts which resonate to different frequencies (TODO confirm).

Therefore, when synthesizing music, we think more in terms of adding up frequencies instead of points in time. This is illustrated in this example.

This leads to thinking in terms of a 1D vector between 20Hz and 20kHz for each point in time.

The mathematical Fourier transform loses the notion of time, so what we do when synthesizing is to take groups of points, and sum up frequencies for that group, and take the Fourier transform there.

Luckily, the Fourier transform is linear, so we can just add up and normalize displacements directly.

The size of each group of points leads to a time - frequency precision tradeoff, mediated by the same mathematics as Heisenberg's uncertainty principle.

Wavelets may be a more precise mathematical description of this intermediary time - frequency description.

Quick ways to generate common tones out of the box

The amazing FFmpeg library covers several of them: Linux sine wave audio generator

sudo apt-get install ffmpeg
ffmpeg -f lavfi -i "sine=frequency=1000:duration=5" out.wav

LMMS

This is so incredibly easy to use, it should be your first try if you just want to generate some MIDI tracks with lots of synthesis options out of the box.

Example: https://askubuntu.com/questions/709673/save-as-midi-when-playing-from-vmpk-qsynth/1298231#1298231

MuseScore

https://github.com/musescore/MuseScore

The best FOSS scorewriter GUI I've seen so far. You can really compose for an orchestra with this.

Csound

https://en.wikipedia.org/wiki/Csound

https://github.com/csound/csound

Program that reads a custom XML format that allows you to create some very funky arbitrary synthesized sounds and tunes.

sudo apt install csound

Here's a really cool and advanced demo: https://github.com/csound/csound/blob/b319c336d31d942af2d279b636339df83dc9f9f9/examples/xanadu.csd rendered at: https://www.youtube.com/watch?v=7fXhVMDCfaA

abcmidi

Nice project that converts MIDI to the ABC notation and vice versa, allowing you to edit a MIDI file in your text editor: https://sound.stackexchange.com/questions/39457/how-to-open-midi-file-in-text-editor/50058#50058

Python pyo

https://github.com/belangeo/pyo

Python sound library.

Got it to work after a bit of frustration: Pyo server.boot() fails with pyolib._core.PyoServerStateException on Ubuntu 14.04

MusicXML

https://en.wikipedia.org/wiki/MusicXML

An attempt to standardize music sheet representation.

I can't find easily how to convert it to an audio format from the command line however... Convert musicxml to wav?

Other high level out-of-box open source synthesizers for Linux

If you are going down this road, you might as well have a look at the big boys to learn about common synthesis techniques:

  • https://www.youtube.com/watch?v=cXCwH9n3M-c Zyn-Fusion + Ardour synthesis music composition tutorial "from scratch" by unfa, big kudos to him
  • how to get ZynAddSubFX running on Ubuntu: https://askubuntu.com/questions/340204/zynaddsubfx-fails-to-initialize-with-error-message-default-i-o-did-not-initiali/1238546#1238546
  • http://linuxsynths.com/ quick review of ALL of them with generated audio samples

The simplest way to represent sound as numbers is PCM (Pulse Code Modulation). This means that the amplitude of the sound is recorded at a set frequency (each amplitude value is called a sample). CD quality sound for example is 16 bit samples (in stereo) at the frequency 44100 Hz.

A sample can be represented as an integer number (usually 8, 12, 16, 24 or 32 bits) or a floating point number (16 bit float or 32 bit double). The number can either be signed or unsigned.

For 16 bit signed samples the value 0 would be in the middle, and -32768 and 32767 would be the maximum amplitues. For 16 bit unsigned samples the value 32768 would be in the middle, and 0 and 65535 would be the maximum amplitudes.

For floating point samples the usual format is that 0 is in the middle, and -1.0 and 1.0 are the maximum amplitudes.

The PCM data can then be compressed, for example using MP3.

Tags:

Audio