From argh to articulation

Human speech is an incredibly complex form of communication, but we are finally cracking the code and learning how it evolved.

Illustration by Kallum Best

Illustration by Kallum Best

Have you ever stopped mid-sentence and considered how truly amazing the act of speaking is? We take meaningless sounds uttered by blowing hot air, and shape them into oohs, ahhs and sharp consonants. Then we string these arbitrary sounds together to form meanings that are perceived and deciphered by others.

Human speech is the most complex form of auditory communication in the animal kingdom. Only a few other species, such as whales and songbirds, even come close. So how is it that we have evolved this complex skill when other animals tend to rely on sight, smell and simple sounds to convey their messages? To understand the evolution of human speech, we have to go back to our beginnings and back to basics.

There is significant debate as to when modern human speech actually evolved, with most estimates suggesting 100,000 to 150,000 years ago. That figure could, however, be much greater: a 2013 study estimated the origin to be as old as 1.75 million years.

There are a few reasons why these estimates are so varied. First and foremost, the organs involved in speech production don’t fossilise, meaning they leave no physical trace for archaeologists to dig up. Secondly, though we can study genes and anatomical features that enable speech, being physically able to speak doesn't necessarily mean that our ancestors were actually doing it.

The evolution of human speech is a long story, spanning millions of years of incremental adaptations and changes. Like a set of bagpipes, with a compressible air sack and pipes leading to versatile chanters, the human body has many separate components that have come together to enable sound production.

Even the simplest conversations are only possible thanks to millions of years of evolution. Anna Vander Stel/Unsplash (CC0)

Even the simplest conversations are only possible thanks to millions of years of evolution. Anna Vander Stel/Unsplash (CC0)


Let’s start at the base of the human lungs, with the diaphragm. In most vertebrates capable of emitting sound, the underlying mechanism of sound production is the contraction of the lungs and the expulsion of air through the trachea, over the larynx and pharynx. Although the larynx is an ancestral structure present in most animals, only mammals and a few lizards actually possess vocal cords in the larynx. This is why many reptiles do not communicate vocally.

In non-mammalian animals that do vocalise, convergent evolution has produced sound organs similar to vocal cords, such as the avian syrinx or ridges in the trachea of frogs. This process of converting breath into audible sounds, no matter the method, is called phonation. In mammals, the vocal cords rapidly open and close like valves, converting energy in the form of air from the lungs into sound.

Charles Darwin was one of the first people to note the strange arrangement of the human vocal tract. The horizontal section of the tract, the oral cavity, leads into the throat at roughly a 90-degree angle. This increases your likelihood of choking, which isn't exactly what one would consider a beneficial trait.

But there's also an important upside to this design. The two sections of the vocal tract are of roughly equal lengths, and this, coupled with the immensely versatile movement of the tongue, means that we can shape our vocal tract into a wide variety of diameters. Like the various pipes on a church organ, this shaping allows us to produce different sound frequencies, forming the basis of all the most common vowel sounds in every spoken language.

In contrast to humans, other primates have vocal tracts with a much longer horizontal component and a smaller, less mobile tongue. This means that they can’t form the different diameters needed for many basic vowel sounds, and are limited to the “oohs” and “aahs” that we would emit when monkeying around as children.

Non-human primates are less able to produce complex sounds. T.W. Wood/Wikimedia Commons (public domain)

Non-human primates are less able to produce complex sounds. T.W. Wood/Wikimedia Commons (public domain)


Moving onwards and upwards from the lungs and the larynx, we reach the collection of eight separate muscles that form the tongue. In most vertebrates, the tongue is long, flat, and contained mainly within the oral cavity. By comparison, the human tongue is incredibly dexterous and extends down our throat with both an oral and a pharyngeal section.

One of the most unique features of the human tongue is our ability to use it for multiple functions simultaneously. Although apes and other non-human primates have tongues with a similar structure to ours, they also have a neural 'off switch' that prevents them from vocalising and eating at the same time. Humans have no such limitation.

From tongue to cheek, we finally arrive at the last part of the mouth important for speech production: the lips. The immense pliability and tactility of our lips allow us to do everything from smiling to kissing, but they weren’t always so supple.

One of the basic features that distinguishes the different lineages of primates is the presence of the rhinarium, or 'wet nose’; this is mainly seen in more primitive, and generally nocturnal, species such as lemurs and lorises. More 'modern' primates have lost the wet nose and their reliance on olfaction, in favour of exemplary vision more suited for diurnal activity.

As the wet nose and snout receded, the facial muscles and particularly the lips could be employed for things such as facial expressions and sound manipulation. Both of these were very useful to our ancestors, who were moving around in groups in daylight as opposed to creeping around alone at night.

Our lips are very moist and tactile, which helps us produce speech. Anna Sastre/Unsplash (CC0)

Our lips are very moist and tactile, which helps us produce speech. Anna Sastre/Unsplash (CC0)


Although the very etymology of the word “language” actually derives from the Latin lingua, meaning tongue, we are not quite at the end of the list of specialised physical structures needed for human speech production. Situated above the tongue, mouth and lips is, of course, our most complex organ: the brain.

Two main regions of the brain are considered particularly important for language. Broca’s area is located in the frontal lobe, and Wernicke’s area is located in the superior temporal gyrus, where the temporal and parietal lobes meet. Both regions are intimately connected by a bundle of nerves called the arcuate fascilicus.

Broca’s area is generally considered vital for speech production, as damage to this region causes a particular type of aphasia wherein the afflicted can only produce short and simple sentences mainly consisting of nouns and verbs. Words such as “the” and “to”, and grammatical endings such as “-ing” and “-ed”, are often omitted. This results in sentences such as “Dog look cat”, as opposed to “The dog is looking at the cat”.

Wernicke’s area, on the other hand, seems to be much more involved in speech comprehension, the interpretation of the sounds of human speech. A person exhibiting Wernicke’s aphasia will string together perfectly fluent and often grammatically correct sentences that are nonetheless nonsensical. For example, although the sentence “The moon marmalade dissolved the octopus cat” is technically legible, it doesn’t actually make much sense. This lack of meaning also applies in reverse, as people with this aphasia often have a hard time understanding the speech of others.

The journey we have taken so far has followed our breath from our lungs through our trachea, past vibrating vocal cords, and around our tongue and lips. Though it has all been controlled by various brain functions, both automatic and deliberate, there is one final part to this story. The real key to speech, our most basic component, is our DNA.

While much of the information described so far has been known for decades or even centuries, research into the genes responsible for speech has only just begun. One of the most exciting and illuminating findings so far was the discovery of the strangely named Forkhead box protein P2 (or FOXP2).

This protein, coded by the gene FOXP2 on Chromosone 7, was discovered after studying an English family with a particular speech impediment. About half of the members of the family, known only as KE, exhibited dyspraxia, a condition where the brain has problems coordinating the various parts of the mouth and throat involved in speech production. This results in issues with pronunciation, where sounds and phonemes are often interchanged and the resulting words are unintelligible.

Dyspraxia had been in the KE family for over three generations, giving researchers the opportunity to explore its hereditary causes. In a 2001 study, Cecilia Lai and colleagues presented evidence that FOXP2 is vital for the development of speech and language. Not only that, but the gene is also necessary for healthy lung and brain development, and is expressed in many parts of the brain, including the basal ganglia and frontal cortex.

The structure of the FOXP2 protein, which is important for the development of speech. Emw/Wikimedia Commons (CC BY-SA 3.0) 

The structure of the FOXP2 protein, which is important for the development of speech. Emw/Wikimedia Commons (CC BY-SA 3.0) 


The way the FOXP2 gene seems to work is by allowing us to transform experiences into automatic association. This is vital for anyone learning a language. Take, for example, an infant presented with a picture of a dog. As their parents are showing them the picture and saying the word “dog”, the FOXP2 gene enables the neural encoding of the word into an automatic association between dog-like things and the word they just heard.

Although true language is the exclusive domain of humans, a different form of the FOXP2 gene is also found in other vertebrates. Studies have begun looking into the effects of the gene in different animals. In mice, experimentally knocking out the FOXP2 gene led to abnormal brain formation, severe motor impairment, and loss of vocalisations. Conversely, artificial insertion of the human FOXP2 gene resulted in mice that could produce more complex and frequent alarm calls. Additionally, these 'humanised' mice were better than normal mice at running a maze, after they were allowed to commit the maze to routine memory.

In songbirds, the gene likely has very similar effects as it does in humans. Knocking out FOXP2 in zebra finches, which rely heavily on complex songs for communication, results in faulty songs that are incomplete or inaccurately copied from their original models. This implies that in birds, as in humans, the FOXP2 gene is involved in vocal learning. As birds are one of the best model systems available in the animal kingdom for studying vocal learning, these results may be seminal for further studies looking into the evolution of general vocal communication.

The FOXP2 gene also affects vocal learning in songbirds, such as this zebra finch. Maurice van Bruggen/Wikimedia Commons (CC BY-SA 3.0) 

The FOXP2 gene also affects vocal learning in songbirds, such as this zebra finch. Maurice van Bruggen/Wikimedia Commons (CC BY-SA 3.0) 


Over millions of years, the accumulation of these physiological and genetic features allowed humans to start speaking. But exactly why these changes occurred is still a mystery.

Animals use a variety of different modes of communication, ranging from tactile and olfactory through to visual and auditory. Humans have, however, lost much of our reliance on senses other than sight and sound ‒ and when communicating anything even remotely complex, we tend to rely on sounds rather than visual signals. So why have we evolved to rely so heavily on vocalisations as our main mode of communication?

You will probably be entirely unsurprised to learn that there are a myriad of ideas as to why human speech evolved. Some early theories proposed by linguist Max Müller in the 1800s went by the eccentric names of the 'bow-wow', 'pooh-pooh' and 'ding-dong' theories. However, modern researchers have realised that the origins of human speech are a bit more complex than the simple sound imitations Müller was describing.

Multiple factors probably drove the evolution of human speech. One of the most compelling theories is the 'gestural theory', which suggests that speech developed as an extension of earlier, gesture-based communication. According to this idea, we once used hand gestures a lot more than we currently do, but this mode of communication came with a cost: it requires direct visibility and full use of the hands. When we began to use tools, our hands were required more and more for functions other than gestures, and so the benefits of speech outweighed those of gesturing.

There is significant support for the gestural theory, as both gestures and vocal language rely on similar neural circuits and are regulated by similar genes. Additionally, many non-human primates do exhibit gesturing communication, and humans who have speech impairments or suffer from deafness readily fall back on gesturing to communicate.

The evolution of human speech has not been a simple and straightforward journey. The English language has about 44 distinctive phonemes, and other languages have as many as 150. Amazingly, we can take these arbitrary sounds and create communication as elaborate as Shakespeare’s sonatas, convey ideas as complex as string theory, and express abstract emotions and thoughts.

This system of communication – human speech – is arguably the most complex of any living species. It is certainly the most complex mode of vocal communication. It is understandable, then, that this intricate sound system requires an intricate set of tools. From our lungs to our larynx, from our tongue to our brain, and the genes underlying it all, speech is a fascinating phenomenon with plenty of mysteries yet to be discovered.

Edited by Andrew Katsis and Ellie Michaelides