Sound Synthesis IV: Next Generation Sound Synthesis

Last time we looked at (and listened to!) various methods of digital sound synthesis, beginning with the very primitive systems used by early computers, and ending with the sample-based methods in widespread use today. This time I’m going to talk about a new and very promising method currently in development.

What’s wrong with sample-based synthesis?

Our glockenspiel test sound already sounded pretty good using the sample-based method… do we really need a more advanced method? The answer is, although sample-based synthesis does work very well for certain instruments under certain conditions, it doesn’t work well the whole time.

Although it’s based on samples of real instruments, it’s still not fully realistic. Often the same sample will be used for different notes and different volumes, with the synth altering the frequency and amplitude of the sample as needed. But on a real piano (for example), the notes will all sound subtly different. A high C won’t sound exactly the same as a low C with its frequency increased, and pressing a key hard will result in a very different sound from pressing the same key softly – it won’t just be louder. Some of the better synths will use a larger number of samples in an attempt to capture these nuances, but the amount of data can become unmanageable. And that’s just considering one note at a time. In a real piano, when multiple notes are being played at the same time, the vibrations in all the different strings will influence each other in quite complex ways to create the overall sound.

It gets even worse for string and brass instruments. For example, changing from one note to another on a trumpet can sound totally different depending on how fast the player opens and closes the valves and it is unlikely a sample-based system will be able to reproduce all the possibilities properly without recording an unrealistically large number of samples. In some genres of music, the player may do things with the instrument that were never intended, such as playing it with a valve only part way open. A sample-based system would have no way of dealing with such unforeseen cases – if no-one recorded a sample for that behaviour, it can’t synthesise it.

The other problem with many of the synthesis methods is one of control. Even if it were possible to get them to generate the desired sound, it’s not always very obvious how to do it. FM synthesisers, for example, take a bewildering array of parameters, many of which can seem like “magic numbers” that don’t bear any obvious relation to the sound being generated. To play a note, sound envelopes and frequencies need to be set for every operator, the waveforms can be adjusted, and the overall configuration of the operators also needs to be set. Hardly intuitive stuff for people accustomed to thinking in terms of instruments and notes.

Physical Modelling Synthesis

A newer synthesis method has the potential to solve both the realism problem and the control problem, giving musicians virtual instruments that not only sound more realistic but are much easier to “play” and will correctly handle all situations, even ones that weren’t envisaged when the synth was designed. This is called Physical Modelling Synthesis, and it’s the basis for the project I’m working on just now.

The basic idea is that instead of doing something abstract that just happens to give the result you want (like e.g. FM synthesis), or “cheating” with recordings to give a better sounding result (like sample-based synthesis), you simulate exactly how a real instrument would behave. This means building a mathematical model of the entire instrument as well as anything else that’s relevant (the surrounding air, for example). Real instruments create sound because they vibrate in a certain audible way when they are played – whether that’s by hitting them, bowing them, plucking their strings, blowing into them, or whatever. Physical modelling synthesis works by calculating exactly how the materials that make up the instrument would vibrate given certain inputs.

How do we model an instrument mathematically? It can get very complex, especially for instruments that are made up of lots of different parts (for example, a piano has hundreds of strings, a sound board, and a box filled with air surrounding them all). But let’s start by looking at something simpler: a metal bar that could be, for example, one note of a glockenspiel.

glockdiagram1

To simulate the behaviour of the bar, we can divide it into pieces called elements. Then for each element we store a number, which will represent the movement of that part of the bar as it vibrates. To begin with, the bar will be still and not vibrating, so all these numbers will be zero:

glockdiagram2

We also need something else in this setup – we need a way to hear what’s going on, otherwise the whole exercise would be a bit pointless. So, we’ll take an output from towards the right hand end of the bar:

glockdiagram3

Think of this like a sort of “virtual microphone” that can be placed anywhere on our instrument model. All it does is take the number from the element it’s placed on – it doesn’t care about any of the other elements at all. At the moment the number (like all the others) is stuck at zero, which means the microphone will be picking up silence. As it should be, because a static, non-moving bar doesn’t make any sound.

Now we need to make the bar vibrate so that it does generate some sound. To do this, we will simulate hitting the bar with a beater near its left hand end:

glockdiagram4

What happens when the beater hits the bar? Essentially, it just makes the bar move slightly. So now, instead of all zeroes in our element numbers, we have a non-zero value in the element that’s just been hit by the beater, to represent this movement:

glockdiagram5

But the movement of the bar won’t stay confined to this little section nearest where the beater hit. Over time, it will propagate along the whole length of the bar, causing it to vibrate at its resonant frequency. After some short length of time, the bar might look like this:

glockdiagram6

and then like this:

glockdiagram7

then this:

glockdiagram8

As you can see, the value from the beater strike has “spread out” along the bar so now the majority of the bar is displaced in one direction or another. The details of how this is done depend on the material and exactly how the bar is modelled, but basically each time the computer updates the bar, the number in each box is calculated based on the previous numbers in all the surrounding boxes. (The values that were in those boxes immediately before the update are the most influential, but for some models numbers from longer ago come into play as well). Sometimes the boxes at the ends of the bar are treated differently from the other boxes – in fact, they are different, because unlike the boxes in the middle they only have a neighbouring box on one side of them, not both. There are various different ways of treating the edge boxes, and these are referred to as the model’s boundary conditions. They can get quite complex so I won’t say more about them here.

Above I said “some short length of time”, but that’s quite vague. We actually want to wait a very specific length of time, called the timestep, between updates to the bar. The timestep is generally chosen to match the sampling rate of the audio being output, so that the microphone can just pick up one value each time the bar is updated and output it. So, for a CD quality sample rate of 44100Hz, a timestep lasts 1/44100th of a second, or 0.0000226757 seconds.

If the model is working properly, the result of all this will be that the bar vibrates at its resonant frequency – just like the bar of a real glockenspiel. Every timestep, the “microphone” will pick up a value, and when this sequence of values is played back through speakers, it should sound like a metal bar being hit by a beater.

Here are the first 20 values picked up by the microphone: 0, 0, 0.022, -0.174, -0.260, 0.111, 0.255, 0.123, 0.426, 0.705, 0.495, 0.342, 0.293, 0.116, 0.016, 0.009, 0.033, -0.033, -0.312, -0.321, -0.030

and here’s a graph showing the wave produced by them:

pmgraph

To simulate a whole glockenspiel, we can model several of these bars, each one a slightly different length so as to produce a different note, and take audio outputs from all of them. Then if we hit them with our virtual beater at the right times, we can hear our test sample, this time generated by physical modelling synthesis:

pmsynth

I used a very primitive version of physical modelling synthesis to generate this sample, so it doesn’t sound amazing. I also used a bit of trial and error tweaking to get the bar lengths I wanted, so the tuning isn’t perfect. Both the project, and my knowledge of this type of synthesis, are still in fairly early stages just now! In the next section I’ll talk about what we can do do improve the accuracy of the models, and therefore also the quality of the sound produced.

Accuracy and model complexity

In our project we are mainly going for quality rather than speed. We want to try and generate the best quality of sound that we can from these models; if it takes a minute (or even an hour) of computer time to generate a second of audio, we don’t see that as a huge problem. But obviously we’d like things to run as fast as possible, and if it’s taking days or weeks to generate short audio samples, that is a problem. So I’ll say a bit about how we’re trying to improve the quality of the models, as well as how we hope to keep the compute time from becoming unmanageable.

A long thin metal bar is one of the simplest things to model and we can get away with using a one-dimensional row of elements (as demonstrated above) for this. But for other instruments (or parts of instruments), more complex models may be required. To model a cymbal, for example, we will need a two-dimensional grid of elements spaced across the surface of the cymbal. And for something big and complicated like a whole piano, we would most likely need individual 1D models for each string, a 2D model for the sound board, and a 3D model for the air surrounding everything, all connected and interacting with each other in order to get an accurate synthesis. In fact, any instrument model can generally be improved by embedding it in a 3D space model, so that it is affected by the acoustics of the room it is in.

There are also different ways of updating the model’s elements each timestep. Simple linear models are very easy and fast to compute and are sufficient for many purposes (for example, modelling the vibration of air in a room). Non-linear models are much more complicated to update and need more compute time, but may be necessary in order to get accurate sound from gongs, brass instruments, and others.

Inputs (for example, striking, bowing, blowing the model instruments) and how they are modelled can have an effect as well. The simplest way to model a strike is to add a number to one of the elements of the model for just a single timestep as shown in the example above, but it’s more realistic to add a force that gradually increases and then gradually decreases again across several timesteps. Bowing and blowing are more complicated. With most of these there is some kind of trade-off between the accuracy of the input and the amount of computational resources needed to model it.

2D models and especially 3D models can consume a lot of memory and take a huge number of calculations to update. For CD quality audio, quite a finely spaced grid is required and even a moderately sized 3D room model can easily max out the memory available on most current computers. Accurately modelling the acoustics of a larger room, such as a concert hall, using this method is currently not realistic due to lack of memory, but should become feasible within a few years.

The number of calculations required to update large models is also a challenge, but not an insurmountable one. Especially for the 3D acoustic models, the largest ones, we usually want to do the same (or very similar) calculations again and again and again on a massive number of points. Fortunately, there is a type of computer hardware that is very good at doing exactly this: the GPU.

GPU stands for graphics processing unit, and these processors were indeed originally designed for generating graphics, where the same relatively simple calculations need to be applied to every polygon or every pixel on the screen many, many times. In the last few years there has been a lot of interest in using GPUs for other sorts of calculations, for example scientific simulations, and now many of the world’s most powerful supercomputers contain GPUs. They are ideal for much of the processing in our synthesis project where the simple calculations being applied to every point in a 3D room model closely parallel the calculations being applied to every pixel on the screen when rendering an image.

Advantages of Physical Modelling Synthesis

You might wonder, when sample-based synthesis is getting so good and is so much easier to perform, why bother with physical modelling synthesis? There are three main reasons:

  • Sound quality. With a good enough model, physical modelling synthesis can theoretically sound just as good as a real instrument. Even with simpler models, certain instrument types (e.g. brass) can sound a lot better than sample-based synthesis.
  • Flexibility. If you want to do something more unusual, for example hitting the strings of a violin with the wooden side of the bow instead of bowing them with the hair, or playing a wind instrument with the valves half-open, you are probably going to be out of luck with a sample-based synthesiser. Unless whoever designed the synthesiser foresaw exactly what you want and included samples of it, there will be no way to do it. But physical modelling synthesis can – you can use the same instrument model and just modify the inputs however you want.
  • Ease of control. I mentioned at the beginning that older types of synthesiser can be hard to control – although they may theoretically be able to generate the sound you want, it might not be at all obvious how to get them to do it, because the input parameters don’t bear much obvious relation to things in the “real world”. FM is particularly bad for this – to play a note you might have to do something like: “Set the frequency of operator 1 to 1000Hz, set its waveform type to full sine wave, set its attack rate to 32, its decay rate to 18, its sustain level to 5 and its release rate to 4. Now set operator 2’s frequency to 200Hz, its attack rate to 50, decay rate 2, sustain level 14, release rate 3. Now chain the operators together so that 2 is modulating 1”. (In reality the quoted text would be some kind of programming language rather than English, but you get the idea). Your only options for getting the sound you want are likely to be trial and error, or using a library of existing sounds that someone else came up with by trial and error.

Contrast this with how you might play a note on a physical modelling synthesiser: “Hit the left hand bar of my glockenspiel model with the virtual beater 10mm from its front end, with a force of 10N”. Much better, isn’t it? You might still use a bit of trial and error to find the optimum location and force for the hit, but the model’s input parameters are a lot closer to things we understand from the real world, so it will be a lot less like groping around in the dark. This is because we are trying to model the real world as accurately as possible, unlike FM and sample-based synthesisers which are abstract systems attempting to generate sound as simply as possible.

Here’s a link to the Next Generation Sound Synthesis project website. The project’s been running for a year and has four years still to go. We’re investigating several different areas, including how to make good quality mathematical models for various types of instruments, how to get them to run as fast as possible, and also how to make them effective and easy to use for musicians.

Of course, whatever happens I doubt we will be able to synthesise the bassoon ;).

Sound Synthesis III: Early Synthesis Methods

Digital Sound Synthesis

Before I delve into describing different types of synthesis, I should start with a disclaimer: I’m coming at this mainly from the angle of how old computers (and video game systems) used to synthesise sound rather than talking about music synthesisers, because that’s where most of my knowledge is. Although I have owned various keyboards, I don’t have a deep knowledge of exactly how they work as I’m more of a pianist than a keyboard player really. There is quite a bit of overlap between methods used in computers and methods used in musical instruments though, especially more recently.

To illustrate the different synthesis methods, I’m going to be using the same example sound over and over again, synthesised in different ways. It’s the glockenspiel part from the opening of Sonic Triangle‘s sort-of Christmas song “It Could Be Different”. For comparison to the synthesised versions, here it is played (not particularly well, but you should get the idea!) on a real glockenspiel:

glockenspiel

(In fact, in the original recording of the song, it isn’t a real glockenspiel. It’s the sample-based synthesis of my Casio keyboard… there’ll be more about that sort of synthesis later).

If you have trouble hearing the sounds in this post, try right clicking the links, saving them to your hard drive and opening them from there. Seriously, I can’t believe that in 2013 there still isn’t an easy way of putting sounds on web pages that works on all major browsers. Grrrr!

Primitive Methods

As we saw last time, digital sound recordings (which include CDs, DVDs, and any music files on a computer) are just very long lists of numbers that were created by feeding a sound wave into an analogue-to-digital converter. To play them back, we feed the numbers into a digital-to-analogue converter and then play back the resulting sound using a loudspeaker. But what if, instead of using a list of numbers that was previously recorded, we used a computer program to generate a list of numbers and then played them back in the same way? This is the basis of digital sound synthesis – creating entirely new sounds that never existed in reality.

Very old (1980s) home computers and games consoles tended to only be able to generate very primitive, “beepy” sounding music. This was because they were generating basic sound wave shapes that aren’t like anything you’d get from a real musical instrument. The simplest of all, used by a lot of early computers, is a square wave:

synth3_1

square wave sound

Another option is the triangle wave, with a slightly softer sound:

synth3_2

triangle wave sound

The sound could be improved by giving each note a “shape” (known as its envelope), so that a glockenspiel sound, for example, would start loud and then die away, like a real glockenspiel does:

synth3_3

triangle wave with envelope sound

None of these methods sound particularly nice, and it’s hard to imagine any musician using them now unless they were deliberately going for a retro electronic sort of effect. But they have the advantage of being very easy to synthesise, requiring only a simple electronic circuit or a few lines of program code. (I wrote a program to generate the sound samples in this section from scratch in about half an hour). The square wave, for example, only has two possible levels, so all the computer has to do is keep track of how long to go before switching to the other level. The length of time spent on each level determines the pitch of the sound produced, and the difference in height between the levels determines the volume.

FM Synthesis

I remember being very excited when we upgraded from our old ZX Spectrum +3, which could only do square wave synthesis, to a PC and a Sega Megadrive that were capable of FM (Frequency Modulation) Synthesis. They could actually produce the sounds of different instruments! Looking back now, they didn’t sound very much like the instruments they were supposed to, but it was still a big improvement on square waves.

FM synthesis involves combining two (or sometimes more) waves together to produce a single, more complex wave. The waves are generally sine waves and the combination process is called frequency modulation – it means the frequency of one wave (the “carrier”) is altered over time in a way that depends on the other wave (the “modulator”) to produce the final sound wave. So, at low points on the modulator wave, the carrier wave’s peaks will be spread out with a longer distance between them, while at the high points of the modulator they will be bunched up closer together, like this:

synth3_4

Some FM synthesisers can combine more than two waves together in various ways to give a richer range of possible sounds.

Here’s our glockenspiel snippet synthesised in FM:

fm sound

(In case you’re curious, this was done using DOSBox, which emulates the Yamaha OPL-2 FM synthesiser chip used in the old Adlib and SoundBlaster sound cards common in DOS PCs, and the Allegro MIDI player example program. Describing how to get an ancient version of Allegro up and running on a modern computer would make a whole blog post in itself, but probably not a very interesting one).

It’s certainly a step up from the square wave and triangle wave versions. But it still sounds unnatural; you would be unlikely to mistake it for a real glockenspiel.

FM synthesis is a lot more complicated to perform than the older primitive methods, but by the 90s FM synthesiser chips were cheap enough to put in games consoles and add-in sound cards for PCs. Contrary to popular belief, they are not analogue (or hybrid analogue-digital) synths; they are fully digital devices apart from the final conversion to analogue at the end of the process.

In case you were wondering, this is pretty much the same “frequency modulation” process that is used in FM radio. The main difference between the two is that in FM radio, you have a modulator wave that is an audio signal, but the carrier wave is a very high frequency radio wave (up in the megahertz, millions-of-hertz range). In FM synthesis, both the carrier and modulator are audio frequency waves.

Sample-based Synthesis

Today, when you hear decent synthesised sound coming from a computer or a music keyboard, it’s very likely to be using sample-based methods. (This is often referred to as “wavetable synthesis”, but strictly speaking this term refers to only a quite specific subset of the sample-based methods). Sample-based synthesis is not really true synthesis in the same way that the other methods I’ve talked about are – it’s more a clever mixture of recording and synthesis.

Sample-based synthesis works by using short recordings of real instruments and manipulating and combining them to generate the final sound. For example, it might contain a recording of someone playing middle C on a grand piano. When it needs to play back a middle C, it can play back the recording unchanged. If it needs the note below, it will “stretch out” the sample slightly to increase its wavelength and lower its frequency. Similarly, for the note above it can “compress” the sample so that its frequency increases. It can also adjust the volume if the desired note is louder or quieter than the original recording. If a chord needs to be played, several instances of the sample can be played back simultaneously, adjusted to different pitches.

This synthesis method is not too computationally intensive; sound cards capable of sample-based synthesis (such as the Gravis Ultrasound and the SoundBlaster AWE 32/64) became affordable in the mid 90s and today’s computers can easily do it in software. Windows, for example, has a built-in sample-based synthesiser that is used to play back MIDI sound if there isn’t a hardware synth connected. Sound quality can be very good for some instruments – it is typically very good for percussion instruments, reasonable for ensemble sounds (like a whole string section or a choir), and not so good for solo string and wind instruments. The quality also depends on how good the samples themselves are and how intelligent the synth is at combining them.

Here’s the glockenspiel phrase played on a sample-based synth (namely my Casio keyboard):

sample based

This is a big step up from the other synths – this time we have something that might even be mistaken for a real glockenspiel! But it’s not perfect… if you listen carefully, you’ll notice that all of the notes sound suspiciously similar to each other, unlike the real glockenspiel recording where they are noticeably different.

Next time I’ll talk about the limitations of the methods I’ve described in this post, and what can be done about them.