The OEC: Facts about the language

The 20-volume historical Oxford English Dictionary is the largest record of words used in English, past and present. It contains words that are now obsolete or rare (such as xenagogue 'a person who guides strangers' and vicine 'neighbouring or adjacent') in addition to the latest coinages such as phishing and podcast.
The second edition of the OED, published in 1989 and consisting of twenty volumes, contains more than 615,000 entries, and the third, available online, is expanding all the time, with batches of 2,500 new and revised words and phrases being added in regular quarterly updates.
It is a question often asked, but not so easily answered. Even the OED does not set out to include every specialized technical term or slang or dialect expression ever used. New words are constantly being invented, developed from existing words, or adopted from other languages. Most will be used rarely, or only by a small group of people. This means that an unlimited number of words may occur in speech and writing which will never be recorded in even the largest dictionary.
Furthermore, what exactly is a word? Clearly we should include single units such as cat and dog. But are the plurals cats and dogs separate words? Should we include compounds such as walking stick, which are made up of two existing words? There are an almost unlimited number of such two-word compounds, which can't all be included in a dictionary. And what about abbreviations like BBC and Dr, or proper names such as London, Nelson, and Harry Potter: are they words? As you can see, the question is not a straightforward one.
Although it may be impossible to know the number of words in English, the Oxford English Corpus can help us assess the number of words in current use.
Instead of talking about words, it's more useful in this context to talk about lemmas, a lemma being the base form of a word. For example, climbs, climbing, and climbed are all examples of the one lemma climb. Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the words used in the Oxford English Corpus. If you were to read through the corpus, one word in four (ignoring proper names) would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.
The remaining 5% of the corpus consists of a very large number of lemmas which occur rarely: words like moidore or parados, which may occur only once every several million words. Like all natural languages, English consists of a small number of very common words, a larger number of intermediate ones, and then an indefinitely long 'tail' of very rare terms.
| Vocabulary size (no. lemmas) | % of content in OEC | Example lemmas |
|---|---|---|
| 10 | 25% | the, of, and, to, that, have |
| 100 | 50% | from, because, go, me, our, well, way |
| 1000 | 75% | girl, win, decide, huge, difficult, series |
| 7000 | 90% | tackle, peak, crude, purely, dude, modest |
| 50,000 | 95% | saboteur, autocracy, calyx, conformist |
| >1,000,000 | 99% | laggardly, endobenthic, pomological |
The long tail means that to account for 99% of the Oxford English Corpus you would need a vocabulary of more than a million lemmas. This would include some words which may occur only once or twice in the whole corpus: highly technical terms like chrondrogenesis or dicarboxylate, and one-off coinages like bootlickingly or unsurfworthy that people would probably understand but would be unlikely to use.
If we decide that around 90-95% of the corpus gives a reasonable idea of an average vocabulary, we are left with a figure somewhere in the range of 7,000-50,000 lemmas: say, 25,000. What does a vocabulary of this size represent? It represents the set of most significant words in English: those which occur reasonably frequently and which account for all but a small part of everything we may encounter in speech or writing. It includes all the words that we actively use in general everyday life.
It's interesting to note that most reasonably sized dictionaries contain significantly more than 25,000 lemmas.The 11th edition of the Concise Oxford English Dictionary, for example, lists more than 75,000 single-word lemmas, which means that the majority of its entries must belong to the long tail of extremely rare words. This makes good sense: such terms occur very infrequently, but when they do they are likely to be crucial to what's being said, and the reader might well want to look them up.The idea of a quantifiable vocabulary should be seen in this light: the words we ignore for the purposes of the exercise may be very rare, but in context they may be very important.
Based on the evidence of the Oxford English Corpus, which currently contains over 2 billion words, the 100 commonest English words found in writing around the world are as follows:
| 1 the 2 be 3 to 4 of 5 and 6 a 7 in 8 that 9 have 10 I 11 it 12 for 13 not 14 on 15 with 16 he 17 as 18 you 19 do 20 at 21 this 22 but 23 his 24 by 25 from |
26 they 27 we 28 say 29 her 30 she 31 or 32 an 33 will 34 my 35 one 36 all 37 would 38 there 39 their 40 what 41 so 42 up 43 out 44 if 45 about 46 who 47 get 48 which 49 go 50 me |
51 when 52 make 53 can 54 like 55 time 56 no 57 just 58 him 59 know 60 take 61 people 62 into 63 year 64 your 65 good 66 some 67 could 68 them 69 see 70 other 71 than 72 then 73 now 74 look 75 only |
76 come 77 its 78 over 79 think 80 also 81 back 82 after 83 use 84 two 85 how 86 our 87 work 88 first 89 well 90 way 91 even 92 new 93 want 94 because 95 any 96 these 97 give 98 day 99 most 100 us |
It's noticeable that many of the most frequently used words are short ones whose main purpose is to join other, longer words rather than determine the meaning of a sentence. These are known as 'function words'. It could be said that it's more interesting to explore the frequency of 'content words', as shown in the list below:
| Nouns | Verbs | Adjectives |
|---|---|---|
| 1 time 2 person 3 year 4 way 5 day 6 thing 7 man 8 world 9 life 10 hand 11 part 12 child 13 eye 14 woman 15 place 16 work 17 week 18 case 19 point 20 government 21 company 22 number 23 group 24 problem 25 fact |
1 be 2 have 3 do 4 say 5 get 6 make 7 go 8 know 9 take 10 see 11 come 12 think 13 look 14 want 15 give 16 use 17 find 18 tell 19 ask 20 work 21 seem 22 feel 23 try 24 leave 25 call |
1 good 2 new 3 first 4 last 5 long 6 great 7 little 8 own 9 other 10 old 11 right 12 big 13 high 14 different 15 small 16 large 17 next 18 early 19 young 20 important 21 few 22 public 23 bad 24 same 25 able |
The commonest nouns are time, person, and year, followed by way and day (month is 40th). The majority of the top 25 nouns (15) are from Old English, and of the remainder, most came into medieval English from Old French, and before that from Latin. Notice that many of these words are very common because they have more than one meaning: way and part, for example, are listed in the Concise OED as having 18 and 16 different meanings respectively. They often also form part of common phrases: some of the frequency of time, for example, comes from its use in adverbial phrases like on time, in time, last time, next time, this time, etc.
As you would expect, the commonest verbs express basic concepts. Strikingly, the 25 most frequent verbs are all one-syllable words; the first two-syllable verbs are become (26th) and include (27th). Of these 25, 20 are Old English words, and three more, get, seem, and want, entered English from Old Norse in the early medieval period. Only try and use came from Old French. It seems that English prefers terse, ancient words to describe actions or occurrences.
Again, most of the top adjectives are one-syllable words, and 17 out of 25 derive from Old English: only different, large, and important are from Latin. In terms of the words' meanings, great is higher in the ranking than big, probably because of its informal sense 'very good'. Little is surprisingly high at 7, as compared with small at 15. Bad is unexpectedly low at 23: is this because we have such a large choice of synonyms available for expressing 'bad things'?