Introduction
Welcome to version 6.0.2 of 12dicts, a
collection of English word lists. It differs in several important
ways from most of the other free word lists you can download.
- The 12dicts lists are
oriented towards common words. If you're looking for
myriads of archaic, scientific or computer jargon words, you should
look elsewhere.
- The 12dicts lists have been rigorously checked
for errors. (This is not to
say that they are error-free, merely that enough care has been taken
that errors
are rather infrequent.)
- 12dicts contains a variety of lists, of
different sizes and characteristics.
One size does not fit all. Because each list has different
characteristics, I do
not recommend combining them, except as noted below.
Originally, 12dicts was composed of lists derived from a specific set
of 12 source
dictionaries. In addition to these "classic" lists, 12dicts now
includes lists derived
from other sources. It would perhaps be appropriate to rename 12dicts
to something
more generic, such as BAWL (Beale's Assorted Word Lists), but I have
not done so in
order to preserve continuity.
The remainder of this document is organized as
follows:
This is release 6.0.2 of 12dicts, released June 2016.
This is an update to release 6.0. The following is a brief rundown of the
changes and additions in release 6.0 and beyond:
- A number of new lists, based on 6 "advanced
learner's" ESL
dictionaries, have been added. The sources are reasonably balanced
between American and British English. In addition to 3of6game.txt and
3of6all.txt, which are more or less traditional word lists,
6phrase.txt, a list of multi-word phrases, was added.
- The 5desk.txt list has been augmented with words
from two of
the advanced learner's dictionaries, and renamed 5d+2a.txt to
reflect this change.
- The lemmatized lists have been augmented by
adding words
from the new advanced learner list 3of6game.txt along with some
commonly-used hyphenated words from both 2of12.txt and 3of6all.txt.
These lists have been renamed from 2+2lemma.txt and 2+2gfreq.txt to
2+2+3lem.txt and 2+2+3frq.txt to reflect this change.
- Word frequency information for the lemmatized
frequency list
is now obtained from a BYU corpus-derived frequency list rather than
from Google web data. A small number of abbreviations and proper names
have been added to the list.
- Two new small lists of especially common or
important words have been added: 2of5core.txt and 2+2+3cmn.txt.
- The annotations of the 6of12.txt list have been
reworked.
- Minor corrections have been made to the
"classic" lists.
- The neologism file, containing words too recent
or
controversial to be listed in many of the source dictionaries, has been
updated.
- Slight changes were made to the list of
6of12.txt signature
words after it was determined that a few of them should have been
present as regular (non-signature) words in the
main body of the list but were omitted due to compilation errors.
- The files were organized into directories to
make them more manageable given their increased number.
- The 2of4brif.txt list is being "deprecated". I
will continue
to distribute it, but will not be changing or maintaining it. I
consider the 3of6game.txt list to be a complete replacement.
- Version 6.0 of 12dicts had been out for less than a week
before I discovered a number of embarrassing typos in 5d+2a.txt. These
have been corrected (along with a minor omission in the 2+2+3 lists)
in version 6.0.1.
- Version 6.0.2 of 12dicts makes numerous changes to the
lemmatized lists, including improvements to the lemmatization, tweaks
to improve the frequency data for words which are also proper names,
and additional signature words for the 2+2+3cmn list.
Some general
observations
With the exception of the neol2016 list, all the 12dicts
lists were assembled in a similar fashion. Words were extracted from a
set of source dictionaries and, in most cases, a list was assembled by
selecting all words and phrases present in some number of the sources
meeting certain criteria. For instance, the 2of12 list comprises
lower-case and hyphenated words present in at least two of twelve
source dictionaries. For some lists, rules are added establishing
exceptions for certain words or classes of words - for instance,
the 2of12 list contains the upper-case words I and O as exceptions to
its general exclusion of upper-case words and names.
Some lists contain annotations, which are special characters
appended to certain words. For instance, the ":" character is used in
some lists to identify abbreviations which are ordinarily used without
a terminating period. This annotation allows these abbreviations to be
distinguished from possibly similar regular words. Another annotation,
used in the 3of6game and 3of6all lists, is the "$" character,
indicating a word that was placed in the list even though fewer than
three of the sources mention it. The "+" and "!'" annotations are used
to identify signature words and neologisms, as described below. Note
that is it possible for a word to have more than one annotation, though
this is uncommon. For instance, in the 6of12 list, the word boldfaced~= has both
a "~" and a "=" annotation, signifying that the word was an arbitrary
choice between two equally attested forms (boldfaced
and bold-faced),
and that it was not given a separate definition in a majority of the
sources listing it.
A number of the lists contain signature words. These are words (or
phrases) which do not meet the formal criteria for inclusion in a
list, but which I have chosen to add anyway, as words which "ought to
be" present. Whether a list contains signature words depends on the
specific list. Usually, but not always, a signature word is present in
some of
the sources used for a list, but not enough of them to qualify for
inclusion on that basis. Some lists may "inherit" signature words from
other lists from which they were assembled. For instance, the 6phrase
list includes the signature words from the 3of6all list. In most
cases, signature words are marked with the "+" annotation.
The neol2016 list contains
neologisms, words which are not listed in
some or all of the source dictionaries for 12dicts, generally for one
of two reasons. First, many of the words are recent coinages which were
not yet fully recognized by mainstream lexicographers when the 12dicts
sources were published. Examples of such words are selfie, Obamacare, emoji
and snarky.
Other so-called neologisms are well-established, often well-known,
words which are
considered scandalous, such as sexual slang and ethnic slurs, and which are
often deliberately omitted from dictionaries. (I will not give any
examples of this sort
of word here, but you will find some in the neol2016 list.) Note that
the neologism list has been accumulating for about fifteen years now,
and
some of its words have become almost old-fashioned, such as spam and dotcom. The
neologism list is provided so that some or all of its words can be
added to the other lists where the intended usage makes that
appropriate. However, I have added the single-word neologisms to the
2of12inf and 3of6game, as these lists are the most likely to be used in
coding word games, where it is desirable to recognize the very
latest hot vocabulary. In these lists, neologisms are
annotated with the "!" character.
One other observation worth making is about diacritics. Some
dictionaries will tell you that there are English words correctly
spelled café, naïve, façade and piñata,
and I do not wish to disagree with these authorities. But as a
practical matter, Americans do not like to use diacritics. Furthermore
they use keyboards which do not contain accented letters, and are often
unfamiliar with the often clumsy techniques that their software
provides to use such characters. For this reason, 12dicts drops all the
accents from its English vocabulary. This is particularly valuable for
coding word games, where expecting players to accent the e in cafe is not going to
make them happy. (I cannot help pointing out that Scrabble® contains
no É tiles.) I apologize to those who consider it a matter of some
emotional importance that resume
and résumé
should be differently spelled.
The
organization of 12dicts
The 12dicts lists are organized into four directories,
grouping
lists with similar characteristics together. The remainder of this
document follows this organization as well. For each directory, a
section of the documentation describes in detail the lists it contains.
Most users of 12dicts end up using only a single list. If it is clear
which directory will contain the list you need, you can go directly to
the appropriate documentation.
The four directories are:
- American.
The lists in this directory contain primarily American English
words.
- International.
The lists in this directory contain words from both American
English and British English.
- Lemmatized.
The lists in this directory combine other lists, and are formatted in a way that clarifies word
relationships.
- Special.
The lists in this directory are special-purpose lists that do not fit
into the other directories.
Picking a list to
use
If you are not certain which directory might contain the
kind of
list you are looking for, here is a breakdown of the 12dicts lists by
size and purpose which may be helpful. If it does not help you find what you are looking
for, you might want to check out this table,
which summarizes the characteristics of all the 12dicts files, put
together by Kevin Atkinson. Also, I suggest reading the introduction to
each directory presented in the previous paragraph, each
of which contains a table summarizing exactly what you can expect from
each list in that directory.
- Lists for use in word games: 2of12inf (American), 3of6game (International).
- A list ordered by word frequency: 2+2+3frq (Lemmatized).
- Small lists of common words: 2of5core (Special, very small), 3esl (American), 2+2+3cmn
(Lemmatized).
- Medium-sized lists: 6of12
(American, smaller, includes phrases), 2of12
(American, larger, no phrases).
- Large lists: 3of6all
(International, includes phrases), 5d+2a
(International, no phrases, many obscure words), 2+2+3lem
(Lemmatized, very large).
- A list of phrases: 6phrase
(Special).
The 12dicts project began as the n-dicts projects, n being a variable
whose
value finally stabilized as 12. The purpose of the project was to
create a
list of words approximating the common core of the vocabulary of
American
English.
The methodology of the project was to record and
correlate the words
listed in a number of small dictionaries. The number of dictionaries
so recorded ended up as 12, comprising 8 ESL (English as a Second
Language)
dictionaries and 4 "desk dictionaries". The dictionaries chosen
varied widely by publisher, by style, by completeness and by depth. All
of them were dictionaries of American
English (three from British publishers). The smallest of them contained
about 20,000 entries, and the largest 46,000. (All totaled, there are
about 75,000 entries, many of which appeared in only a single
dictionary.)
All but two of the sources were published between 1992 and 1999, when
12dicts
was first released.
The following table summarizes the contents of each
of the classic lists, located in the American directory, ordered by
size in words:
|
3esl |
6of12 |
2of12 |
2of12inf |
Size (Words) |
22,000 |
32,000 |
41,000 |
82,000 |
Number of Sources |
3 |
12 |
12 |
12 |
American English |
Y |
Y |
Y |
Y |
British English |
– |
– |
– |
– |
Ordinary words |
Y |
Y |
Y |
Y |
Inflections |
– |
– |
– |
Y |
Hyphenations |
Y |
Y |
Y |
– |
Phrases |
Y |
Y |
– |
– |
Names |
Y |
Y |
– |
– |
Abbreviations |
Y |
Y |
– |
– |
Acronyms |
Y |
Y |
– |
– |
Prefixes/Suffixes |
– |
– |
– |
– |
Signature words |
– |
Y |
Y |
* |
Neologisms |
– |
– |
– |
Y |
Annotations |
Y |
Y |
Y |
Y |
A * in the "Signature Words" row means that
signature
words associated with some other list may be present, but there are no
signature words associated specifically with that list.
I initially tried two different ways of winnowing the 12dicts data to
produce lists of common words. Both produced interesting results.
One list, the 6of12 list, contained all words and phrases
listed in 6 of the 12 dictionaries. One way of describing this list
is that it contains those words and phrases which a (seeming) majority
of lexicographers believe are relevant to people learning English,
and/or to everyday usage. This list contained about 32,000 words and
phrases. The other list, the 2of12 list, was more inclusive in that it
included words listed in as few as two of the source dictionaries, but
less inclusive in that it excluded items of various sorts, including
multi-word phrases, proper names and abbreviations. This list contained
about 41,000 words. It was likely more suitable for use in areas
like spell checking or word games than the 6of12 list. (Honesty
compels me to admit that neither of these lists is, by itself, a good
choice for spell checking, due to the absence of inflections, proper
names, Roman numerals, etc.)
A third list, 2of12inf.txt, developed later, was of
a rather different
character, and is discussed in a later section.
A more precise description of the criteria by which
the above lists
were composed is as follows:
6of12 list word selection
- The 6of12 list contains all non-excluded words
and phrases which
appear in 6 or more of the source dictionaries.
- Prefixes and suffixes are excluded.
Abbreviations are included;
however, if they are entirely lower-case and alphabetic, they are
terminated with a colon (":") so they can be easily distinguished
from regular words.
- Inflections of included words are not themselves
included unless
they are separately defined or irregular.
- It sometimes occurs that a word is listed in
several forms (e.g.,
with and without hyphenation) in 6 or more dictionaries, even though
no single form is so listed. In this case, if one spelling is clearly
more accepted, this spelling and this spelling only is listed. If all
spellings seem equally accepted, one spelling has been selected
arbitrarily for inclusion.
- The 6of12 list contains a significant number of
signature words, as discussed below. All of these words are
listed in at least one of the source dictionaries.
- In addition to the ":" suffix discussed above,
other annotations are used to mark words with certain characteristics,
as discussed below.
2of12 list word selection
- The 2of12 list contains all non-excluded words
which appear in at
least 2 of the source dictionaries.
- This list excludes capitalized words, multi-word
phrases, and
abbreviations, as well as prefixes and suffixes. It does not
exclude hyphenated words or contractions. If a word occurs in
both a hyphenated and an unhyphenated form, the unhyphenated
form is listed, even if the hyphenated form is generally
preferred.
- The list excludes spellings which are considered
(by a majority
of the dictionaries listing it) to be non-American usage. It
also excludes secondary spellings which are mentioned by fewer
than four of the source dictionaries.
- Inflections of included words are not themselves
included unless
they are separately defined, or irregular.
- Several of the source dictionaries include
listings for obscure
currencies, such as ringgit, khoum and ngwee.
I was unable to regard such words as part of the English "core
vocabulary",
and so I required citation in over a third of the dictionaries for
inclusion of such monetary units. A side-effect was the elimination
of the word lepton, which, in addition to its use
in particle
physics, is also .01 Greek drachmas.
- This list also includes a small number of
signature words, as
discussed below.
Signature words
As indicated, both lists have been augmented with words
(and, in the
case of the 6of12 list, phrases) which fail to meet the formal
requirements for inclusion. In the case of the 6of12 list, 1024
words were added (about 3 % of the total). These are all words which,
in the judgment of the compiler, are as familiar as many of the words
which did meet the criteria for inclusion. Examples of some of the
sorts
of words which were added are:
- Words of the same category as other included
words. An example is
the astrological sign Cancer, which alone of all
the
astrological signs fails to appear in 6 or more of the dictionaries.
Similarly added was the omitted holiday Christmas Eve.
- Vulgarities, sexual terms and insults. Some such
words were
already included, but most of the source dictionaries were quite
squeamish about them. These words are very widely known indeed;
I hold that any list of "common" words which does not include the
infamous f-word is simply discredited thereby. Some may feel that
it would have been better to leave some or all of these terms
unmentioned. Nevertheless, the expression of blasphemy,
unwarranted contempt and perverse lust, whether in words or in
deeds, is a very human trait. Suppressing the evidence of these
aspects of the human condition in our language makes no more sense
than excluding leprosy, gangrene and dementia,
no matter how unpleasant they may be to contemplate.
- Conventional conversational phrases so common as
to be practically
invisible to native speakers. Examples are thank you, good
night, uh-huh, of course and gesundheit.
- Sports terminology, especially for football and
baseball. (If I,
who am practically sports-blind, noticed this deficiency, it must
be of major proportions indeed.)
Note that the signature words in the 6of12 list can be
identified via
the annotation "+", and eliminated if desired.
A much smaller set of words (49) was added to the
2of12 list. These
were of two sorts:
- Signature words from the 6of12 list which were
not already present
in the 2of12 list, and which are not excluded due to being
abbreviations, phrases, etc.
- Inflections of irregular verbs not explicitly
mentioned in 2
source dictionaries, such as outfought and reheard.
These words are not marked with suffix characters.
Annotations
Some of the 6of12 list entries are annotated with a suffix
character,
giving additional information about the associated word. The
annotations can be easily removed with an editor or a script if
they are unwanted.
These annotations are:
: |
The word is an otherwise unmarked
abbreviation. This suffix always occurs before any other suffix. |
& |
The word is primarily a non-American usage. |
# |
The word is generally held to be a variant
or less preferred
form of another word. |
= |
Roughly, this indicates a "second class"
word, as described
below. |
< |
This form of a word is held to be the
primary form by fewer
dictionaries than some other form of the word. |
^ |
This form of the word was selected
as the most commonly listed of a set of variant spellings. |
~ |
This form of a word is one of a set of
variant spellings, none of which was clearly preferred. |
+ |
The word is a signature word. |
The reasons a word might be marked with the =
annotation
are:
- The word is an inflection which was defined in
the same
entry as the base word.
- The word is a derived word (usually ending with -ly,
-ness or -er/or) which was
not defined in a separate
entry.
- The word appeared in a list of undefined words
with a
common prefix, such as un- or re-.
Note that, in the determination of the "<", "^", and
"^" suffixes, only certain very close spelling variations are
considered, namely single word vs. hyphenated word vs. multi-word,
differences in capitalization, and presence or absence of a terminating
period for abbreviations. The words tenderhearted
and tender-hearted
are close variants by this definition, but judgment and judgement are not.
The words in the 2of12 list are not annotated.
The 2of12inf list is of a rather different character from the two
original "classic" lists. Conceptually,
it is simple. It consists of all the unhyphenated words in the 2of12
list, plus
their inflections, amounting to about 82,000 words. This list may
be more useful than the other lists for applications like word games.
It was created to help Kevin Atkinson in his Aspell and SCOWL projects
(for which, follow these
links).
Unlike the 6of12 and
2of12 lists, this list was not based exclusively on the contents of my
12 source dictionaries, and for this reason it has, I feel, less
authority than the other classic 12dicts lists. It also probably has a
significantly higher error rate than the other lists, for reasons
explained below.
The criteria defining the 2of12inf list are as
follows:
- The 2of12inf list contains all non-excluded
words which appear in
at least 2 of the source dictionaries.
- This list excludes capitalized words, multi-word
phrases,
abbreviations, contractions, hyphenated words and single-letter
words, as well as prefixes and suffixes.
- The list does not exclude secondary spellings,
non-American usages
or monetary units.
- The list includes inflections of all included
words. Any
inflection mentioned or clearly implied by any of the source
dictionaries is included (i.e., two citations are not required).
Additionally, some inflections have been added from other sources.
- Plurals of "uncountable" nouns were included,
annotated with the
"%" suffix character. See below for an extended discussion of
the inclusion of these words.
- Qualifying signature words from the other lists,
plus their
inflections, were
added. No other signature words were added.
- Qualifying neologisms from the neol2016 list,
including their inflections, were added. The neologisms are indicated
by a '!' prefix.
Though the 2of12inf list still consists mostly of very common words,
criteria 3 through 5 above cause the 2of12inf list to contain a greater
proportion of unfamiliar and unusual words than the other classic
12dicts lists.
The 2of12inf list was not derived directly from the
12 source
dictionaries. The starting point was a subset of Kevin Atkinson's
AGID list, a list of words, parts of speech and inflections derived
from public-domain sources, notably Moby Words and WordNet. (See the
file agid.txt in the 12dicts archive, which is a copy of the AGID
"readme",
for more information on the antecedents of AGID.) 2of12inf was created
by a process of editing the AGID subset to remove spurious entries and
those which reflected a more esoteric English vocabulary than the other
12dicts lists, and to add inflections which AGID failed to identify.
This process required significantly less effort than would have been
needed to derive the list directly from the source dictionaries.
Unfortunately, a side effect of the process was that the result is
probably somewhat less reliable than the other 12dicts lists.
In particular, Moby Words is notoriously unreliable, and I find it
unlikely that I have successfully identified all the spurious
inflections its use has introduced. It would be nice to
release another edition of 2of12inf which is not derived from AGID,
and therefore not "infected" by Moby Words, but I haven't done so in 15
years, and so it probably won't happen.
After the first version of the 2of12inf list was
released, I replaced
one of the source dictionaries, officially an international dictionary
but in actuality rather British in its orientation, with a more
American dictionary by the same publisher. It was not practical
(nor necessarily desirable) for me to go through the list removing
inflections endorsed only by the superseded dictionary. For this
reason, the 2of12inf list has a slightly more international character
than the other 12dicts lists. It is not altogether clear that this
is a bad thing.
Selection of inflections
Ideally, the 2of12inf list would contain only inflections listed in
one of the 12dicts source dictionaries. This proved not to be
practical. The reason for this has to do with the nature of these
sources, which are mostly ESL dictionaries. An ESL dictionary might
well list the word esophagus, but, because an
English learner is
unlikely to need to talk about this organ in the plural, it will
probably not bother to list the plural form esophagi.
For words of
this sort, I therefore needed to obtain their inflections from other
sources. Obviously, the decisions on when to include additional
inflections were judgment calls, as were the choices of which
inflections to add.
Adjectival inflections (comparatives and
superlatives) proved to be
an especially annoying problem. Only 2 of my 12 source dictionaries
provided remotely reliable information of this sort. In fact, such
information is sparse and inconsistent in most dictionaries of any
size. I relied on a small set of additional dictionaries for this
information, which was mostly disjoint from the sources for plurals
and verb forms. Several of these sources were Scrabble®-related,
and therefore inclined to include forms of little plausibility such
as iller/illest or fertiler/fertilest.
Accordingly, I ended up rejecting some of the documented inflections on
grounds of implausibility. I have no doubt that, in the process, I made
a number of errors of both inclusion and exclusion and, in any case,
many
of the forms listed have no connection with any of the 12dicts source
dictionaries.
One additional problem in the creation of the
2of12inf list was that
of "uncountable" nouns and their plurals. Some English dictionaries,
especially ESL dictionaries, as well as other linguistic sources
attest to the existence of nouns which cannot be counted or used in
the plural. Examples of such nouns include mud, rayon,
oregano,
chess, fairness, wisdom, aluminum, training, materialism
and chickenpox. This is an entirely commonsense
notion, but a
difficulty is the fact that the boundary between the countable and the
uncountable is extremely vague and ill-defined. For example, the word
coffee is ordinarily uncountable, but not when
ordering in a
restaurant, as is the word symmetry, except in
physics or math.
In general, it is possible to contrive a context where use of the
plural of any noun whatsoever is reasonable.
An alternate position, therefore, is that in fact
no nouns are
uncountable, and that any noun which is not already plural possesses
a plural. This position is especially useful in the context of word
games, where words such as zeals and anthraxes
may produce large scores. For this reason, the official Scrabble
dictionaries list words such as thens, onces and
mankinds, which most people find
rather implausible. The fact that the 2of12inf list might well be
useful in gaming contexts, together with the fact that the boundary
between countable and uncountable nouns is so ill-defined, served as
a powerful argument for inclusion of all plural forms, whether
commonly used or not, while its derivation from ESL sources argued
for including only the plurals of countable nouns, however
distinguished.
As I prepared the list for release, I was unable to
resolve this dilemma,
and adopted a
compromise. The 2of12inf list includes all plurals, but with the
plurals of uncountable nouns marked, making it easy to remove them
if they are not wanted. That left the issue of how to establish
countability. Six of my source dictionaries included information
on countability, which was adequate to decide the status of most of
the included nouns. As for the rest, as usual, I used my best
judgment. I will confess to occasionally overriding the source
dictionaries when I believed they were clearly incorrect. (For
instance, I chose not to mark the word hatreds as
an
uncountable plural, in defiance of the opinion of all my sources,
on the grounds that it has been used in too many news stories from
Bosnia to be considered unusual.) It is interesting to note that
most of the plurals I added from auxiliary sources were of words
considered uncountable. I also note that at some point after the
release of the 2of12inf list, I decided that it would have been better
to have left the Scrabble plurals out, and, while I was not
comfortable with removing them, no list I've created
since then which lists inflections includes them.
The difficulties listed above, and the fact that I
was forced to
exercise personal judgment frequently in creating it, emphasizes a
fundamental difference between this list and the other classic 12dicts
lists. I have tried to make the 6of12 and 2of12 lists reflect only the
source dictionaries, and to keep my own judgments and opinions out of
the picture (except for my addition of signature words). This has
proved impossible to achieve for the 2of12inf list, which accordingly
represents a less authoritative and more arbitrary collection.
Additionally, the 2of12inf list has undergone less proofreading and
validation than the other lists, and I suspect the error rate is
somewhat higher than the idealistic goal of 0.02% I adopted for this
project. Nevertheless, I hope it may prove to be
of some use and interest.
I wish to offer my special thanks to Kevin
Atkinson, for supplying me
with the AGID list, and for encouraging me to add the inflections. Of
course, any errors that remain in the 2of12inf list are my own
responsibility, and should not be blamed on Kevin, AGID, or even on
Moby.
The 3esl list represents another attempt to produce an English "core
vocabulary" list. It is about 2/3 of the size of the 6of12 list,
which it resembles in terms of the sorts of words included.
The 3esl list is a far more subjective list than
any of the classic
12dicts lists. It was compiled from 3 small ESL dictionaries, using
the same criteria for eligibility as the 6of12 list. I started with
a list composed of all words from the smallest of the 3 sources, plus
all words contained in both of the others. This list was then edited
in the following ways:
- I removed alternate spellings for included
words, such as grey
and off-stage. I also removed very similar synonyms
for the
same concept, for instance, removing cable television
as a
duplicate of cable TV.
- I added one form of each word which would have
been included if
the sources had agreed on spelling, such as shortchange
and back seat.
- I removed some words which were present in the
smallest of the
sources but seemed too esoteric, such as the symbols of chemical
elements. I did this only for words which were not present in the
other sources.
- I added some words which were present in only
one of the two
larger sources, but which seemed appropriate to add. These words
were frequently of the sort added to the 6of12 list as signature
words, as well as some inflections that often function as words
with meanings of their own, such as comforting and notes.
All of these changes were quite subjective in nature, and quite
numerous. Probably more than 10 % of the candidate words were added
or removed in this way. For this reason, it is pointless to speak
of signature words for this list; the composition of the list is too
arbitrary for the term to make any sense. (I will note that the list
is still not entirely arbitrary, as I added only words found in
some form in one of the sources, and removed no words present in two
of the sources other than duplicates. Thus, words like front
page were not added, no matter how familiar, and words such
as lugubrious were not removed, despite clearly not
being
part of anyone's "core vocabulary".)
Like the 6of12 list, the 3esl list marks lower-case
abbreviations
with a ":" suffix, to prevent them from being mistaken for regular
English words.
One final note on this list. The 3esl list contains
about 1500 words
not present in the 6of12 list. Because these two lists have the same
rules for the kinds of words included, one could easily combine
the two to produce a slightly larger list including a number of words
whose omission from 6of12 is rather surprising. Be warned that in a
few cases, the spelling chosen for words with multiple spellings is
different in the two lists, and I would recommend that the duplicates
be removed. (I'll be happy to provide a list of the duplicates if
anyone wants one.)
The
international 12dicts lists
Four 12dicts lists contain a more cosmopolitan vocabulary
than the classic lists. Two of these lists, 2of4brif and 5d+2a
(previously called 5desk), were released over ten years ago. The
2of4brif list was derived from four British dictionaries, and has now
been deprecated, as I believe the 3of6game list to be a superior
implementation of the same concept, compiled from more recent sources.
The 5d+2a list was originally compiled from a variety of sources, but
was extensively revised for this release by addition of several fairly
recently published sources.
For release 6, two new international lists were added to 12dicts:
3of6game and 3of6all. These were based on 6 "advanced learner's" ESL
dictionaries, released by both American and British publishers,
most of which covered both strains of English. The
3of6game list
is intended primarily for use in word games, and can be compared to
2of12inf in its general approach. The 3of6all list includes more forms
of
words (hyphenated, capitalized, multi-word phrases, etc.), and can be
compared to 6of12 in its general approach.
Two other more unusual lists were derived from these sources: 6phrase
and 2of5core. 6phrase is a collection of all the multi-word phrases from
any of the six dictionaries. Five of the six international sources flag
some words as being the most important words for an English beginner to
master. The 2of5core list collects those words that are flagged in at least two
of these dictionaries. Both of these lists are discussed in a little
more detail in the "Specialized Lists"
section of this document.
The following table summarizes the contents of
each
of the lists in the International directory, ordered
by size in words:
|
2of4brif |
3of6game |
5d+2a |
3of6all |
Size (Words) |
60,000 |
65,000 |
68,000 |
83,000 |
Number of Sources |
4 |
6 |
7 (+5 minor) |
6 |
American English |
Some |
Y |
Y |
Y |
British English |
Y |
Y |
Y |
Y |
Ordinary words |
Y |
Y |
Y |
Y |
Inflections |
Y |
Y |
– |
Y |
Hyphenations |
– |
– |
– |
Y |
Phrases |
– |
– |
– |
Y |
Names |
– |
– |
Y |
Y |
Abbreviations |
– |
– |
– |
Y |
Acronyms |
– |
– |
Y |
Y |
Prefixes/Suffixes |
– |
– |
– |
Y |
Signature words |
– |
Y |
– |
Y |
Neologisms |
– |
Y |
– |
– |
Annotations |
– |
Y |
– |
Y |
All of the classic 12dicts lists are unabashedly oriented towards
American English. After receiving a few expressions of interest in a
British English list, I put together the 2of4brif list. This list
was compiled from 4 large "international" ESL dictionaries, published
by British publishers. To this American, they are more British than
they are international; quite possibly, they seem more American than
international to British readers. It is interesting to note that,
although there were only a third as many sources for this list as for
the 12dicts lists, these dictionaries resembled each other far more
closely than their American counterparts, which could mean that the
2of4brif list is as good an approximation of a "core" British English
vocabulary as the 6of12 list is for American English. (Or, alternately,
it may simply mean that my choice of sources was too narrow.)
This criteria for inclusion in this list were
basically those of the
2of12inf list. In particular, inflections are included for all words,
but hyphenated words, contractions, phrases, proper names and
abbreviations are all excluded. One important difference between
the two is the way in which inflections were determined for inclusion.
The 2of12inf list includes some inflections found in one (or even none)
of its sources. Further, as discussed in detail above,
it includes plurals for words which are not normally
considered to have plurals. The 2of4brif list differs in both of
these regards. It includes only inflections endorsed by two or more
of the sources, specifically excluding any plural forms for nouns
listed as uncountable.
The 2of4brif list includes no signature words as
such. I made a small
number of adjustments for consistency, such as making sure that
-ise and -ize spellings were
equally
represented, and adding plurals for ordinal numbers. (Why
fourteenth would be defined as a fraction, but not
seventeenth, I must simply regard as a mystery.)
These
edits were so few, and so clearly harmless, that I have not marked
them.
Prospective users of the 2of4brif list should
realize that it was
compiled by an American. If my sources contained any glaring errors
(and most dictionaries have a few), I might well not have noticed,
and perpetuated them in the list. The fact that two citations were
required is some protection against such an event, but no guarantee.
As the 2of4brif list is very similar in makeup to
the 2of12inf list,
a user who wants a larger, more international list than either could
reasonably merge the two. If you do this, you should remove the
unusual plurals (marked with a "%") from the 2of12inf list in the
process, for consistency.
Note that I have deprecated the 2of4brif list. I
believe that any applications of this list would be better off using
the 3of6game list in its place.
The 3of6
lists
The lists 3of6game and 3of6all are new with version 6 of
12dicts. Both were derived from a set of six advanced learner's ESL
dictionaries. The dictionaries can be broken down as follows:
- One strongly American-oriented dictionary.
- Two somewhat British-oriented dictionaries.
- Three international dictionaries, one from an
American publisher, two from a British publisher.
This provided a good balance between British and American
usage. My goal was to produce lists that contained blancmange and swede as well as applesauce and boysenberry. Note
that
some of the British dictionaries include words from Australian, Indian,
African and Caribbean English, and a fraction of this vocabulary made
it into the 3of6 lists.
In previous versions of 12dicts, I asked users to tell me what they
were doing with the lists. The most common answer was that they were
used to supply the vocabulary for a word game. The 3of6game list was
designed to fulfill this purpose. It contains only the sort of words
likely to be used in a word game (no hyphenated words, proper names,
abbreviations, contractions or phrases), but does contain inflections.
In general, words must appear in three of the sources to be
included. The rules, however, do provide for a number of (annotated)
exceptions, including uncommon inflections and words whose most common
form is either hyphenated or phrasal. Details are below.
The 3of6all list is a larger list, basically containing any kind of
word you can imagine, if found in three of the sources. As with
3+3game, some additional words were added as exceptions, but
there are not as many of them, as the goal of this list is to be as
faithful as reasonable to the sources.
Both the 3of6game and 3of6all lists contain signature words/phrases.
The 3of6game list also contains neologisms, as game players are likely
to want to play recently coined or popularized words.
The
3of6game list
The 3of6game list contains words which are listed in 3 of
the 6 advanced learners dictionaries described above. Only words
suitable for play in most word games are included, excluding hyphenated
words, multi-word phrases, capitalized words, abbreviations and
contractions. There are no restrictions on length - in particular, it
contains four one-letter words: a,
x (a verb
meaning to cross out), I
and O, the
last two of which are included despite their capitalization (which is
an English spelling phenomenon entirely disconnected from
logic). In certain cases, words are present in this list despite being
listed in fewer than three sources. This serves the purpose of
offering game players more words in situations where lexicographers
differ about what word forms are correct. Some exceptional situations
are:
- A word is one of a set of close variants, none
of which is present in three of the sources. These words are marked
with a "^" suffix. An example is the word aqualung, which is
sometimes capitalized or hyphenated.
- The word is a British spelling of an American
word listed in three sources, or an American spelling of a British word
from three sources. These words are marked with a "&" suffix.
Examples include prolog,
an American form of the British prologue,
and hyaena,
a British spelling of the American hyena.
- A word is a plural of a word which only
two of the sources describe as countable, such as boyhoods. Similarly,
adjectival inflections are added if as few as two of the sources attest
to it, as with frillier
and frilliest.
- A word is an unusual inflection of a word where
at least three sources agree that some inflection is called for,
such as the less common plural planetaria
of planetarium.
- A word is an inflection for a word used as an
unusual part of speech, whose meaning is closely related to a more
common meaning. Examples are the verb forms autopsied and autopsying, whose
meanings are closely related to the common meaning of the noun autopsy.
- A word is a unhyphenated form of a word normally
hyphenated or written phrasally such as ballgame, which is
more commonly written ball
game.
Words not present in three of the source dictionaries are
marked with the "$" suffix character if the "^" and "&"
annotations do not apply.
The 3of6game list includes both signature words and neologisms, marked
with a "+" or "!" respectively. There are 520
signature words for this list, representing words
that I feel "ought to be" included. Each signature word is present in
at least one of the source dictionaries. Virtually all of these words
are American English, as I am not qualified to tell whether a
interesting Britishism like tosspot
is used often enough to justify its addition as a signature word. Note
that the presence of annotations allows a user to remove these
extra words if she finds their addition unjustified.
The 3of6game list could be combined with the 2of12inf list (minus the
uncountable plurals) and/or 2of4brif if a larger list is required. Note
that because 2of2inf is very strongly American, such a combination will
be less balanced between American and British English than 3of6game
itself.
The
3of6all list
The 3of6all list contains words which are listed in three of
the six advanced learner's dictionaries. In contrast to the 3of6game
list, no words are excluded, not even abbreviations, prefixes or
suffixes. Most words have their inflections included. An exception is
made for phrasal verbs and other verb phrases, whose inflections are
completely predictable from the initial word of the phrase.
The 3of6all list contains many phrasal verbs, such as let down, take after, sound off and make out, whose
meanings are often quite hard for inexperienced
students of English to guess. Phrasal verbs are marked by the ";"
suffix
character. Only four of the six source dictionaries provide phrasal
verb information in an easy-to-collect way. For this
reason, I put a phrasal verb into the 3of6all list even if I found it
in only two of the sources.
The 3of6all list contains some other words present in fewer than three
of the
dictionaries, though not as many as 3of6game. All such words are
marked. The cases where this occurs are as follows:
- As described for the 3of6game list, a word is
one of a set of close variants, none of which is present
in three of the sources. These words are marked with a "^" suffix. For
this list, in addition to differences in hyphenation or
single/multi-word format, variants only in capitalization or (for
abbreviations) the presence or absence of a period are considered close.
- As described for the 3of6game list, a word
is a British spelling of an American word listed in three
sources, or an American spelling of a British word from three sources.
These words are marked with a "&" suffix.
- A few other words present in fewer than three of
the
dictionaries are added. Usually, this occurs when a word is found by
three sources to have the same part of speech, but the sources fail to
agree on the spelling of the inflection(s). An example is the word Grammy, whose plural
is claimed by two sources to be Grammies,
and by two others to be Grammys.
These words are annotated with the "$" suffix.
There is one other situation where an annotation suffix is
used. This occurs when a word is shown by a majority of the sources as
being used only in a few
specific phrases, even though other dictionaries may give it a regular
definition. An example is the word bated,
which is shown by most of the sources as used only in the phrase with bated breath.
In this case, the word is flagged with a ">" suffix. A search on
a word so flagged will reveal the key phrase(s) elsewhere in the list.
Recall that, sometimes, a word may have more than one suffix. An
abbreviation shown with the ":" suffix (indicating the absence of a
final period) may be followed by another suffix, and the combination
">^" appears upon occasion.
The 3of6all list contains signature phrases, but no neologisms. The
signature phrases are marked with the "+" suffix. The 629 3of6all
signatures are all basic conversational idioms and common connective
phrases, like I told you
so, in
front of and on
the other hand. Though these phrases often show up in the
sources in lists of idioms, they generally do not appear as separate
headwords, which kept me from easily recording their presence. I
believe, however, that all of these phrases are extremely common, and
deserve to be included in this list. The signature phrases are all
marked with the "+" suffix.
I created the 5d+2a list (originally called 5desk) in an attempt to do
a better /usr/dict/words
(the failings of which were a large part of my motivation for doing
12dicts in the first place).
The sorts of words admitted are the same sorts that /usr/dict/words
traditionally contains. Though somewhat larger in size than many
versions of
/usr/dict/words, this is still a short word list, striving for
inclusion
of words one is likely to encounter rather than the complete jargon of
every possible scientific, artistic or occult endeavor.
The original 5desk list was assembled primarily
from five "desk
dictionaries". It
was augmented by words from five minor sources, including a "vocabulary
builder" and a collection of proper names. It excluded
prefixes, suffixes, phrases, hyphenated words, contractions and most
abbreviations and acronyms. There was no requirement for multiple
listings; all qualifying words from each of the sources were included.
Inflections of included words were not included themselves except when
irregular, or separately defined. Variant and non-American spellings
were not excluded, and no signature words were added.
Words commonly considered to be
abbreviations/acronyms were included
if they contained at least one upper case character, and were defined
with an explicit part of speech. This excluded items like Mr
and
Feb, which are abbreviations in the classic sense,
but allowed words
like DNA and ATM, which are
used far more frequently than that
which they abbreviate. While there is a trend in modern dictionaries
to list such words as nouns (or occasionally verbs, adverbs, etc.),
it is a trend in progress, and rather inconsistently applied. For
this reason, the set of such words in the 5desk list is somewhat
incoherent, including SPCA but not PETA,
AIDS but not SAD,
KGB
but
not CIA, and PDQ but not ASAP.
When version 6 of 12dicts was released, the 5desk
list was
augmented by adding qualifying words from two advanced learner's ESL
dictionaries, and as a result renamed to 5d+2a.txt. Both of the
additional dictionaries had a strongly international vocabulary,
causing the new list to have a less American and more cosmopolitan
character. This increased the size of the list by about 20% to about
68,000 words.
One class of commonly-used words is regrettably
absent from the 5desk
list, because I was unable to find a satisfactory source for them.
This is the class of commercial names such as Exxon, Tylenol,
Pepsi and Chevy. This is probably
forgivable,
as this class of names is as ephemeral and transitory as teenage slang.
The one-time household words Kool, Ovaltine, Philco
and
Ipana serve now only as answers to trivia questions,
with modern wonders like Starbucks, Google, Ritalin
and TiVo taking their place on the tongues of the
trendy.
The 5d+2a list contains no signature words. I did
take the liberty of adding the personal names of around thirty
well-known individuals, mostly statesmen and politicians. Though the
original 5desk list contained many such names from all periods of human
history, I have not found a useful source to bring the list into the
twenty-first century. At the same time, I felt that distributing a list
full of
names that did not include Cheney and Obama was not
reasonable. So I compromised by adding a few names whose historical
significance was clear to me, until such time as a better source than
my own memories of the last 15 years can be found.
The 5d+2a list has clearly moved beyond any "core
vocabulary" concept.
It includes quite esoteric words (ogee, pleonastic),
very
uncommon spellings (thiamine, yuppy), and obscure
geographical
and historical names (Paricutin, Nevelson). Like
the traditional /usr/dict/words, it is frequently inconsistent and
arbitrary, but I
hope at the least I have avoided including spelling errors, and
overlooking the stuff of everyday conversation. Perhaps it will be
useful as a compromise between basic lists such as 3esl, and truly
massive lists like Mendel Cooper's ENABLE.
The
lemmatized 12dicts lists
Version 6 of 12dicts provides three lemmatized lists
combining words from the 2of12inf, 3of6game and 2of4brif lists. The
word "lemmatized" is a rare
word, which you will find in none of these lists, but what it means is
that these lists are formatted as a collection of word sets, called
lemmas (or lemmata, if you're into irregular plurals), each set
composed of a headword and some number (possibly zero) of closely
related
words. Two of these lists were introduced in version 5 of 12dicts, but
they have undergone major revisions since then.
The three lists are 2+2+3lem (originally 2+2lemma), 2+2+3frq
(originally 2+2gfreq) and 2+2+3cmn. 2+2+3lem simply arranges
the words of the three source lists into lemmas and lists them
alphabetically by headword. 2+2+3frq arranges the same lemmas by
approximate order of their frequency of usage, computed with the help
of a frequency list obtained from Brigham Young University (BYU),
omitting those words and lemmas whose usage is so small that they fail
to show up in the BYU material. 2+2+3cmn extracts a subset of the
lemmas of 2+2+3lem, namely those lemmas with a certain minimum level of
usage (approximately the level of the word butterscotch), and
lists them alphabetically by headword. This is yet another attempt in
12dicts to generate a core English vocabulary.
The advantage of a lemmatized presentation of words is that it puts
related words together, even when spellings differ greatly, as for be, are, is and were. A moderate
disadvantage is that the same word can appear in more than one lemma,
such as putting,
which is present in the lemmas headed by both put and putt. Overall, I
find the lemmatized format to be clearer and more useful than a simple
alphabetized list, and I rather wish I had released the other lists
which include inflections in that format.
The following table summarizes the contents of
each
of the lists in the Lemmatized directory, ordered
by size in words:
|
2+2+3cmn |
2+2+3frq |
2+2+3lem |
Size (Words) |
25,000 |
34,000 |
84,000 |
Number of Sources |
21 |
21 |
21 |
American English |
Y |
Y |
Y |
British English |
Some |
Some |
Y |
Ordinary words |
Y |
Y |
Y |
Inflections |
Some |
Some |
Y |
Hyphenations |
Some |
Some |
Y |
Phrases |
– |
– |
– |
Names |
Some |
Some |
– |
Abbreviations |
Some |
Some |
– |
Acronyms |
Some |
Some |
– |
Prefixes/Suffixes |
– |
– |
– |
Signature words |
Y |
* |
* |
Neologisms |
A few |
A few |
Y |
Annotations |
Y |
Y |
Y |
A * in the "Signature Words" row means that
signature
words associated with some other list may be present, but there are no
signature words associated specifically with that list.
The 2+2+3lem list
The list 2+2+3lem.txt contains the words in the
2of12inf, 2of4brif and 3of3game lists.
Also, the new words from the neol2016.txt list have
been added, marked with a "!" if they would not have otherwise been
included. (Marking the new words permits them to be removed if it is
preferred for this list to be in synch with the other 12dicts lists.)
Furthermore, some high-frequency hyphenated words from 2of12.txt and
3of6all have been added. These words were originally added to the
lemmatized frequency list (see below),
and I liked the results so much that I added them to this list as well.
Finally, British forms of words in
the 2of12inf list not already in the other lists have been added.
Words
marked with a % in the 2of12inf list ("Scrabble plurals") have
however been omitted.
In the previous version of 12dicts, the 2+2+3lem list was
called 2+2lemma. The only significant changes were the addition of new
words, and switching from "+" to "!" to mark neologisms in the list.
The 2+2+3lem list is not formatted as a simple list
of words.
It is composed of entries of 1 or 2 lines each. The
first
line contains a headword, and the second line, which is indented if
present, contains an alphabetized list of related words. A
simple example:
funny
funnier, funnies, funniest, funnily, funniness
The list of related words contains three sorts of
entries.
-
Inflections.
-
Variant spellings.
-
Words formed with certain suffixes.
In addition to true variant spellings such
as grey
for gray
and thru
for through,
item 2 also includes words
which, though pronounced differently, are clearly variants
of the headword. Thus, hooray is considered
a variant of hurrah
(but mere synonyms like furze
and gorse
remain
independent).
Item 3 is based on a small list of suffixes,
producing closely
and consistently related words. These suffixes are -ful, -ish,
-less, -like, -ly, -most and -ness. -ally is also
allowed, if
there is no -al
word to apply the -ly
suffix to. (For instance, basically is
considered to be derived from basic, because there
is
no word basical.) When
one of these suffixes is used in an
unusual way, the resulting word is considered independent.
For
instance, likely
is not considered to be derived from like, nor bashful
from bash.
There are some rather difficult questions
here, such as how closely slavish
is related to slave,
or sluggish
to slug.
In general, I have chosen the course of
least surprise by treating such pairs as independent.
Here are some other notes on the determination of
what words are related.
Certain uses of the suffixes -ed and -s are treated as
inflections, even though technically they are not.
Thus, talented
is treated as derived from talent,
and optics
from optic.
Words ending with the suffix -ability/ibility are
treated as relatives of the corresponding -able/ible word.
Sometimes, the choice of which variant to treat as
the headword
is somewhat arbitrary. I have consistently chosen an American
spelling over a British spelling here. This has some effect on
the number of headwords. I treat cheque as a variant
of check,
whereas, to an observer with a British bias, they would no doubt be
separate headwords.
No distinction is made of different meanings of the
same word,
even when they are so different that dictionaries list them
separately. wind
the noun and wind
the verb are considered as a
single word, as are second
the adjective, second
the noun and second
the verb.
It may sometimes happen that two different words
have the same inflection (putting
derives both from putt
and put; holier relates
to holey
as well as holy),
or that an inflection
is a headword in its own right (as with wound, the past
tense of wind,
or crooked,
the past tense of crook).
These
situations are noted in the 2+2+3lem list as cross-references to the
alternate headword. There are two specific situations
which might not be obvious where
inflections are treated as different words.
These occur when a present tense form or a -ness word has a
plural inflection, as with meaning
and weakness.
Such words
are always made headwords, even when the relationship to the original
root is very close. Here is an example showing how
cross-references are indicated:
base
based, baseless, basely, baseness,
baser, bases -> [basis], basest, basing
Almost always, a given word has only one
cross-reference - the
biggest exception is the incredible tangle shown in the example below:
slue
-> [slough]
slew -> [slay, slew, slough],
slewed, slewing,
slews -> [slew, slough], slued, slues -> [slough], sluing
where 4 uncommon words mostly pronounced sloo have become
thoroughly confused.
The 2+2+3frq list
In the previous version of 12dicts, there was
a file called
2+2gfreq.txt. This file has been completely replaced by a new
implementation of the same idea. Like the older list, the 2+2+3frq list
presents the lemmas of 2+2+3lem in bands of lemmas
with about
the same frequency of use. However, there are the following major
differences from what was done before:
- In the previous version, word frequency
information was
obtained from data collected from the World Wide Web supplied by
Google. This data was very voluminous, but was quite distorted by the
Web's emphasis on computerese, pornography and marketing. I am now
using a commercial word frequency database, supplied by Brigham Young
University, based on its Corpus of Contemporary American English (COCA).
This data is less voluminous than the Google data, but is far more
balanced and seemingly trustworthy. It has some other advantages,
discussed below.
- High-frequency hyphenated words from 2of12inf
and 3of6all
have been added. I liked the effect of this so much that I added the
same words to the 2+2+3lem list.
- A certain number of high frequency
abbreviations,
contractions and capitalized words were added. Some of these words were
not to be found in any other 12dicts list, for which reason I did not
also add them to 2+2+3lem.
- The list was shortened by omitting all lemmas
which did not appear at all in the BYU data.
- Individual lemmas were shortened by omitting
very infrequent
words and all regular inflections, except when they were used
frequently as a part of speech different from the headword, such as disappointed as an
adjective rather than a verb form.
The lemmas of 2+2+3frq are grouped into bands by the
combined
number of occurrences in the BYU data of the words in the lemmas. Band
21 contains lemmas whose words together appear between 16 and 31 times
in the BYU data. Each other band contains lemmas of twice the frequency
of the following band, that is, each lemma in band 20 appears in the
BYU data between 32 and 63 times, and so on. The first band contains
the three lemmas most frequently used in the English language
(according to BYU), namely the,
be (plus its
inflections) and to.
As already noted, some words are found in multiple lemmas. One helpful
aspect of the BYU data is that it separates frequency data for a word
by parts of speech, and notes the base word for inflected words. This
often allows the frequency counts for a word like building to be
accumulated under the correct lemma (either build or building).
In the event that the BYU data is unable to completely resolve the
appropriate lemma for a word, its frequency count is divided equally
among the various candidates.
2+2+3frq is divided into bands by lines like this:
----- 5 -----
The lemmas in each band are presented in alphabetical
order, not by the frequency of the individual lemma.
Note that because the BYU data was extracted from a corpus of American
English, the 2+2+3frq file tilts in an American direction, though some
British words like bloke,
colour and lorry have made it
through.
A useful attribute of the BYU
data is that it,
unlike the Google data, includes hyphenated words, as well as some
abbreviations, contractions and capitalized words. The two cases are
rather different. The inclusion of hyphenated words is explicitly
intended. However, the BYU documentation states that proper names have
been excluded where possible, while admitting that, in many cases, the
software processing the data was unable to be sure whether a word was a
proper name or not, in which case the word was included. The effect is
that there are many words generally considered to be proper names
present, notably the names of months of the year and days of the week,
plus those of religions, nationalities and ideologies. You will not
find names like linda,
picasso, vladivostok, microsoft or rumpelstiltskin in
the data, but you will find november,
buddhist, peruvian and marxist,
to the extent that I wonder if BYU might have used a different
definition of "proper name" than the one I was taught in school. As for
abbreviations, the BYU documentation makes no mention of them, but
there are some very familiar abbreviations in the data. There are not a
lot of them, which makes me wonder whether their presence was
intentional or a processing error. Either way, I have no reason to
doubt their frequency counts.
I decided that I wanted to add high-frequency hyphenated words, proper
names and abbreviations to the frequency list, as I consider this data
to be very interesting. When I did so, I discovered in band 17 the
words atlantean
and klingon.
I really don't think that these words have anywhere close to the same
frequency as armband
and carpool,
which are also present in band 17. This makes me suspect that, for
words of this frequency or less, the BYU data is starting to become
less reliable. For this reason, I decided to stop adding hyphenated
words, capitalized words, contractions and abbreviations after band 17.
In the case of hyphenated words, I added them to the 2+2+3frq list only
if they were present in either 2of12.txt or 3of6all.txt. I also added
these words to the 2+2+3lem list. In the case of abbreviations and
capitalized words, there were not all that many of them, and some of
them were not present in any other 12dicts list, such as Americanist, Thatcherism and, of
course, Klingon.
For this reason, when I added capitalized words, contractions and
abbreviations to 2+2+3frq, I parenthesized them to indicate that their
presence had nothing to do with any source but the BYU data. The same
consideration led me to omit these words from the 2+2+3lem list.
I should note that, though the BYU data is superior to the previous
Google web data, it is not without its flaws. Three issues of
particular importance are difficulties with part of speech information
for words like painting and filling, an inconsistent approach to words which are also proper names like rose, king and miller, and a tendency to combine data for words and common acronyms, such as eta/ETA and sac/SAC.
I have attempted to tweak the frequencies in such cases, using various
public word frequency sources, whenever I observed them, which is to
say whenever the results of taking the BYU data at face value led to
implausible results.
The 2+2+3frq list is considerably smaller than the previous 2+2gfreq
list due to my decision to drop lemmas which were absent from the BYU
data, especially since the BYU data was considerably less voluminous
and so left out many more words than the Google data. In addition, I
observed that many high-frequency lemmas contained unusual spellings
and archaic forms that were not present in the BYU data, such as cocoanut, iodin and didst,
and decided to drop non-headwords from the lemmas unless their
frequency was at or above the level of band 17. A similar decision was
made to drop regular inflections from the lemmas in the 2+2+3frq list
unless they had high frequency with a different part of speech, for
example, loving
as an adjective or fighting
as a noun. Finally, I chose to drop the word/lemma cross-references
from the 2+2+3frq list, replacing them with a * indicating that a word
was to be found under another headword (though it might have been
suppressed if it was a regular inflection).
As an example of how this works out in practice, here is the lemma for time from 2+2+3lem:
time
timed, timeless, timelessly, timelessness, times, timing ->
[timing]
and here is the condensed version from 2+2+3frq.
time
timed, timeless
The words timelessly
and timelessness
are not used often enough (according to BYU) to mention in the
frequency list, while the word times
was not frequently used except as a form of time, and, while the
word timing
was frequently used as a noun, its counts were collected under the
lemma timing
rather than time.
The 2+2+3cmn list
The 2+2+3cmn list is a relatively simple transformation of
the
2+2+3frq list, in yet another attempt to produce a "core English" word
list. It is composed of the lemmas of the 2+2+3frq list from bands 1
through 17, sorted in alphabetical order by headword. Minor formatting
differences are that the "!" is removed from neologisms, and
the
parentheses are removed from capitalized words, abbreviations and
contractions.
I have added 77 signature words to 2+2+3cmn, which are
abbreviations, contractions and capitalized words (mostly
contractions) which I know to be extremely high frequency, but which
were not present in the BYU data, words such as can't, Mr. and DVD. These words are
marked with a + to indicate their absence from the 2+2+3frq source data.
Like 2+2+3frq, 2+2+3cmn tilts strongly in the direction of American
English.
Because all the words of 2+2+3cmn are of moderately high frequency
(assuming the BYU data is to be trusted), it probably is a better
claimant than either 2of5core or 3esl to truly representing a core
English vocabulary, at least of the American variety.
Specialized
12 dicts lists
The following table summarizes the contents of
each
of the lists in the Special directory, ordered
by size in words:
|
neol2016 |
2of5core |
6phrase |
Size (Words) |
600 |
4,700 |
22,000 |
Number of Sources |
0 |
5 |
6 |
American English |
Y |
Y |
Y |
British English |
A little |
Y |
Y |
Ordinary words |
Y |
Y |
– |
Inflections |
Y |
– |
– |
Hyphenations |
Y |
A few |
– |
Phrases |
Y |
A few |
Y |
Names |
Y |
A few |
A few |
Abbreviations |
Y |
A few |
A few |
Acronyms |
Y |
A few |
– |
Prefixes/Suffixes |
– |
– |
– |
Signature words |
– |
– |
* |
Neologisms |
Y |
– |
– |
Annotations |
Y |
N |
Y |
A * in the "Signature Words" row means that
signature
words associated with some other list may be present, but there are no
signature words associated specifically with that list.
The neol2016 list
The neol2016 list is a very simple list of new or newly
recognized words, as described above.
It is comprised of three parts, separated by blank lines.
The first part lists regular (non-hyphenated, non-capitalized) words
together with their inflections and
variants, laid out similarly to the 2+2+3lem list. It includes plurals
for uncountable nouns, marked with a "%" suffix. These words (except
for the uncountable plurals) have been pre-added to the 2of12inf and
3of6game lists, suffixed with "!", allowing them to be easily
removed if desired.
The second part of the file is a small set of words for which
additional inflections have been added. This portion of the file is in
the same format as the first list. These inflections have also been
added to the 2of12inf and 3of6game lists.
The third part of the file contains new words and phrases which are not
regular words: hyphenated words, multi-word phrases, proper
names, abbreviations and acronyms. These words have not been pre-added
to any other list.
In all cases, users are encouraged to add some or all of these words to
any of the other lists, as they feel appropriate.
The 2of5core list
Five of the six advanced learner's ESL dictionaries from
which the 3of6 lists were compiled mark a subset of their words as
being important words which every student of English should master.
These subsets vary widely from dictionary to dictionary. As one of the
original goals of the 12dicts project was to compile a list
representing the
English core vocabulary, I thought it would be interesting to combine
these lists. My original thought was to provide a list that was simply
the union of the marked subsets for each source. However, one
particular dictionary had at least twice as many words in its subset as
any of the others, and in many cases the words seemed to me to be
poorly chosen. (Do moor
and cash flow
seem like key English language concepts to you?) So I chose when
assembling my list to require that all words be marked as important
words by at least two of the sources. The result was the 2of5core list,
which contains about 4,700 words.
While most words selected in this way were the same in American and
British English, some belonged to one variant or the other. In some
cases, a word appeared in two forms, such as center and centre. When I
observed that a word was present in two forms, I combined them into a
single line, for example center/centre.
No other changes were made to the list.
Due to the way in which the list was constructed, it seems somewhat
haphazard. You may want to check out the Oxford 3000™, a list of 3000
words available from Oxford University, which is a core vocabulary
created by lexicographers, to my eye superior to the 2of5core list.
The 6phrase list
When I was compiling the 3of6all list, I noticed something
interesting. There were an extraordinary number of phrases listed by
only one of the sources. Many of these were extremely common phrases,
which I would expect most experienced English speakers to understand.
So, naturally, I decided to compile them all into a list.
The 6phrase list contains all multi-word phrases from any of the six
advanced learner's dictionaries which were used as sources for 3of6all,
all 22,000 of them. The list does not include inflections, except in a
few cases where a plural cannot easily be guessed from the words in a
phrase. Usually, this happens for phrases of non-English origin, such
as eau de cologne,
whose plural is eaux de
cologne. The list includes phrasal verbs, which are
suffixed by the ";" character, as in the 3of6all list. The list is
sorted in a different order than the lexicographical ordering used by
the other lists, in order to group all phrases starting with the same
word together.
You will observe that the same phrase will often be repeated several
times in the list, with slightly different spelling, capitalization
and/or hyphenation. No attempt was made to edit the list to remove or
reduce such "clutter".
The 6phrase list includes the 3of6all signature phrases. These are not
marked with a suffix.
In contrast to most of the other lists, I am unable to think of any
applications of the 6phrase list. But I find it rather interesting,
which is why I'm bothering to include it. At the very least, it may
serve as an illustration of the incredible richness of the English
language, without even venturing into vocabulary too esoteric to be
included in a learner's dictionary.
It may have occurred to some to wonder about how
something like
the 12dicts project came to be (though I assume that anyone who bothers
to download this archive must already have some idea that such a
project could be of interest).
Many years ago, there was a post to the sci.crypt
Usenet newsgroup,
on the subject of creating PGP passphrases using randomly selected
entries from a supplied list of very short words. (If this sounds
interesting, follow
this link for an expanded version of the post.) The word
list,
which was extracted from /usr/dict/words on some UNIX system, seemed
to me ill-suited to its intended purpose. It included arcane acronyms
(bstj, fmc), misspellings (diety, ouvre)
and
words of amazing obscurity (bhoy, kombu).
I decided
I
could do better, and eventually did.
This caused me to start downloading English word lists, of which there
were many, from the Internet. I was not impressed by the overall
quality of these lists, and the few which were high-quality were
all-inclusive, burying the everyday words under a mountain of archaisms
and esoterica.
This was a long time ago, and an Internet search
for word lists
now turns up lists of higher quality than back then (thanks in part to
the influence of 12dicts), so I will limit myself to two brief
criticisms of the various lists available at that time. First, they contained
far too many misspellings and typos, and had obviously never been
proofread. Additionally, their approach to vocabulary was scattershot, omitting
common words while adding a random selection of highly technical words,
often associated with UNIX and academic computer science. (My favorite
is the list which included bremsstrahlung,
but omitted log
and beer.)
Due to my original purpose of finding a list of short, common words, I
found this sort of thing particularly frustrating.
One result of my frustration with this situation was my working with
Mendel Cooper on ENABLE, a large Scrabble®-oriented list, which was
close to unique in having an active
caretaker who was clearly concerned with quality, and in being oriented towards
American rather than British English. But ENABLE was an
all-encompassing
list and, even if it had been complete at the time I started my search
for a list of common words, it would not have been what I wanted for
that reason. (The ENABLE web site is no longer online, but a Google
search will turn up places where you can still download it.)
I finally decided that only starting from scratch
with a systematic
approach was likely to get me what I was looking for, and that
dictionaries intended for non-native speakers of English were the
best possible source for words that are in some cases so familiar
that we never think of them. This has led to the 12dicts lists,
which I hope have managed to avoid the flaws recited above.
My
other projects
During the intervals between releases of 12dicts, I have
been fooling
around with English spelling reform. One of the results of
this
activity is the development of CAAPR and ABCD, both of which may be
downloaded from my website, www.wyrdplay.org.
CAAPR is the Combined Anglo-American Pronunciation Reference, a
fancy name for a bi-dialectal pronunciation dictionary whose word list
is derived primarily from the 12dicts 6of12 list. ABCD, Alan's
Basic Codes with Diacritics, is also a pronunciation dictionary, of a
somewhat different sort - the notation is designed to clarify when a
word is spelled in accordance with normal English spelling
patterns (as with fault
or tunnel),
and when it is not (as with fought
or colonel).
Though these files were developed as a
result of my interest in spelling reform, they may be of interest to
other
"word nerds" unconcerned with that particular quixotic pastime.
Click the following links to CAAPR
and ABCD
if interested.
When I released the first version of 12dicts in
1999, I assumed
I was
done with it. It hasn't worked out that way. I now think I'm pretty
much done with it again, though an occasional update to neol20xx.txt might
be called for. Perhaps in ten more years I'll have reached version 9, and be
laughing uncontrollably at the thought that I might have finished
earlier, but for the present I don't see what else might be both useful
and fun to add.
Feel free to send comments, suggestions,
inquiries and/or large sums of money to me at 12dicts@pobox.com.
(Actually, the bit about money is a joke. Do not send me even small
amounts of money; 12dicts is free wordware.)
After making this request in previous versions, I have been
delighted to see the interest in these lists for projects ranging from
interactive games to literacy programs. And I have been
particularly pleased to occasionally hear of first-year Computer
Science assignments specifying a 12dicts list rather than
/usr/dict/words for their input. Keep up the good work, and do let
me know what you're doing. (Oh, and please put "12dicts" in
the
subject line when you email me. This will allow me to easily
notice your mail even if it is misclassified by an overzealous filter
as spam. Speaking of
spam, the publication of my email address in this package has led to a
marked increase in the amount of spam I receive and, ironically, much
of it contains subject lines which appear to have been
extracted at random from my own lists. This is a use of 12dicts of
which I
do not approve!)
The 12dicts lists were compiled by Alan Beale. I explicitly release
them to the public domain, but request acknowledgment of their use.
(Actually, the dependency of the 2of12inf list and the 2+2+3 lists on
AGID prevents their
release into the public domain. However, I do not impose any additional
requirements on their use beyond those imposed by AGID and its sources,
as described in agid.txt.)
- Alan Beale -