| |
Soundexing and Genealogy
by Gary Mokotoff
What can be more frustrating to a genealogist than to look through an alphabetic
index of records and not locate what is being sought and then find out later, perhaps years later, the data was
there but misspelled. How do you locate towns of immigrant non-English-speaking ancestors when the only information
available was passed down through the generations orally and no matter how you try to spell it, you cannot find
the town on a map.
A major solution to these problems was provided more than 80 years ago when Robert C. Russell of Pittsburgh, Pennsylvania
was issued patent number 1,261,167 on April 2, 1918 for having "invented certain new and useful Improvements
in Indexes...as will enable others skilled in the art to which it appertains to make and use the same." The
idea of indexing information by how it sounds rather than alphabetically was born. It has become known as "soundexing."
The Russell Soundex System
Russell noted in his patent the
reason for this system. "There are certain sounds which form the nucleus of the English language, and those
sounds are inadequately represented merely by the letters of the alphabet, as one sound may sometimes be represented
by more than one letter or combination of letters, and one letter or combination of letters may represent two or
more sounds. Because of this, a great many names may have two or more different
spellings which in an alphabetic index, or an index which separates names according to the sequence of their contained
letters in the alphabet, necessitates their filing in widely separate places"
(emphasis added).
He knew that letters of the alphabet were divided, phonetically, into categories. To each category, he assigned
a numeric value. As his patent describes:
- The vowels (he called them "oral resonants") a, e, i, o, u, y.
- The labials and labio-dentals b, f, p, v.
- The gutterals and sibilants c, g, k, q, s, x,
z .
- The dental-mutes d,
t.
- The palatal-fricative l.
- The labio-nasal m.
- The den to or lingua-nasal n.
- The dental fricative r.
There were only a few additional rules:
- The initial letter of the word is always kept.
- Two consecutive letters that had the same code are considered as a single letter
(e.g., `tt' coded the same as `t'.
- The combination `gh,' and `s' or `z' if they ended the word, are discarded.
- Only the first occurrence of a vowel (Group 1) is counted.
Thus, Smith and Smyth coded the same because the letters `i' and `y' had the same
value, namely `1,' even though in a normal alphabetic index, they would be far apart.
A patent exists for two reasons. First, to give the inventor exclusive rights to his creation and two, by forcing
the inventor to publicly disclose his work, to encourage other persons to improve on the invention and thus create
superior products based on the original idea.
Through the years, Russell's system has been improved upon. Those familiar with the current soundex system used
by the U.S. government will see that the original system was changed by combining the letters `m' and `n', dropping
vowels all together unless the initial letter of the word, and dropping the rule regarding `gh' and words that
end with `s' or `z'.
The American Soundex System
The soundex code consist of the first letter of the name followed by three digits. These three digits are determined
by dropping the letters a, e, i, o, u, h, w and y and adding three digits from the remaining letters of the name
according to the table below. There are only two additional rules. (1) If two or more consecutive letters have
the same code, they are coded as one letter. (2) If there are an insufficient numbers of letters to make the three
digits, the remaining digits are set to zero.
Soundex Table
1 b,f,p,v 2 c,g,j,k,q,s,x,z 3 d, t 4 l 5 m, n 6 r
Examples:
Miller M460 Peterson P362 Peters P362 Auerbach A612 Uhrbach U612 Moskowitz M232 Moskovitz M213
The Daitch-Mokotoff Soundex System
The latest significant improvement to soundexing is the Daitch-Mokotoff soundex
system. In 1985, this author indexed the names of some 28,000 persons who legally changed their names while living
in Palestine from 1921 to 1948, most of whom were Jews with Germanic or Slavic surnames. It was obvious there were
numerous spelling variants of the same basic surname and the list should be soundexed. Using the conventional U.S.
government system, which is based on the Russell system, many Eastern European Jewish names which sound the same
did not soundex the same. The most prevalent were those names spelled interchangeably with the letter w or v,for example, the names Moskowitz and Moskovitz.
A modification to U.S. soundex system was then created and published in the first issue of Avotaynu, the journal of Jewish genealogy, in an article titled "Proposal for a Jewish Soundex
Code." Randy Daitch read the article and expanded on the rules of the new system. It included the following
improvements over the conventional system:
- The initial letter was encoded just as any other letter within the name. If the
initial letter was a vowel, it was given the code `0'.
- Certain double letter combinations, which represent single sounds, namely ts, tz, and tc
were coded as a single code (the same as the letter s).
Just as disclosing a scheme by patenting encourages others to improve on your invention, publishing the article
caused another genealogist, Randy Daitch, to improve on the system. To the above rules, he added:
- The first six (rather than 4) significant codes are created. This means that
in large data bases, names which sound the same initially, but differ at the end, are coded differently, giving
the researcher a smaller list of data to be searched. For example, Peters and Peterson code identically in the
U.S. system, but differently in the new system.
- Other multiple-letter combinations, in addition to those shown in 2 above, were
added; all of Slavic or Germanic origin.
- If a combination of letters could have two possible sounds, then it is coded
in both manners. For example, the letters ch
can have a soft sound such as in Chicago
or a hard sound as in Christmas.
The new scheme was published a year later in Avotaynu by Daitch under the title "The Jewish Soundex: A Revised Format." This new system
has become known, after its authors, as the Daitch-Mokotoff Soundex System. It has been mistakenly called the Jewish
Soundex System, the Eastern European Soundex System or the European Soundex System because of its origins, but
its new rules are independent or geographic or ethnic considerations and its correct name is the Daitch-Mokotoff
Soundex System.
The D-M system has become the standard of all indexing projects done by Jewish genealogical organizations. It has
been accepted by the Hebrew Immigrant Aid Society (HIAS), a social welfare organization, as its standard soundex
system for retrieving case histories and is the standard at the U.S. Holocaust Memorial Museum in Washington, DC.
It is used to search the Ellis Island database of 24 million immigrants at the Stephen P. Morse Searching the Ellis
Island Database in One Step site.
To reiterate, the major improvements of the Daitch-Mokotoff Soundex are:
- Information is coded to the first six meaningful letters rather than four.
- The initial letter is coded rather than kept as is.
- Where two consecutive letters have a single sound, they are coded as a single
number.
- When a letter or combination of letters may have two different sounds, it is
double coded under the two different codes.
- A letter or combination of letters maps into ten possible codes rather than seven.
Rules of the Daitch-Mokotoff Soundex System
The rules for converting names into D-M code numbers are listed below. They
are followed by the Coding Chart. Turn to the chart briefly to familiarize yourself with the concept and then return
to the specific instructions on this page.
- Town names are coded to six digits, each digit representing a sound listed in
the Coding Chart below.
- The letters A, E, I, O, U, J and Y are always coded at the beginning of a name,
as in Augsburg (054795). In any other situation, they are ignored
except when two of them form a pair and the pair comes before a vowel, as in Breuer (791900),
but not Freud. The letter "H"
is coded at the beginning of a name, as in Halberstadt
(587943) or preceding a vowel
as in Mannheim (665600), otherwise it is not coded.
- When adjacent sounds can combine to form a larger sound, they are given the code
number of the larger sound, as in Chernowitz, which is not coded Chernowi-t- z (496734) but Chernowi-tz (496740).
- When adjacent letters have the same code number, they are coded as one sound,
as in Cherkassy, which is not coded Cherka-s-sy (495440) but Cherkassy
(495400). Exceptions to this rule are the letter combinations "MN" and "NM" whose letters are
coded separately, as in Kleinman
which is coded 586660 not 586600.
- When a name consists of more than one word, it is coded as if one word, such
as Nowy Targ, which is treated as Nowytarg.
- Several letters and letter combinations pose the problem that they may sound
in one of two ways. The letter and letter combinations CH, CK, C, J and RZ (see chart below), are assigned two
possible code numbers. Be sure to try both possibilities.
- When a name lacks enough coded sounds to fill the six digits, the remaining digits
are coded "0" as in Berlin (798600)
which has only four coded sounds (B-R- L-N).
Examples:
C e n io w (467000) Ts e n yu v (467000) H o l u b i c a (587400) G o l u b i ts a (587400) P rz e m y s l (746480) P sh e m e sh i l (746480) R o s o ch o w a c ie c R o s o k ho v a ts e ts (945744) (944744) or (945744)
Daitch-Mokotoff Soundex Coding Chart
Start of Before All Letter Alternate Letter(s) A Name A Vowel Other
AI AJ, AY 0 1 N/C
AU 0 7 N/C
A 0 N/C N/C
B 7 7 7
CHS 5 54 54
CH Try KH (5) and TCH (4)
CK Try K (5) and TSK (45)
CZ CS, CSZ, CZS 4 4 4
C Try K (5) and TZ (4)
DRZ DRS 4 4 4
DS DSH, DSZ 4 4 4
DZ DZH, DZS 4 4 4
D DT 3 3 3
EI EJ, EY 0 1 N/C
EU 1 1 N/C
E 0 N/C N/C
FB 7 7 7
F 7 7 7
G 5 5 5
H 5 5 N/C
IA IE, IO, IU 1 N/C N/C
I 0 N/C N/C
J Try Y (1) and DZH (4)
KS 5 54 54
KH 5 5 5
K 5 5 5
L 8 8 8
MN 66 66 66
M 6 6 6
NM 66 66 66
N 6 6 6
OI OJ, OY 0 1 N/C
O 0 N/C N/C
P PF, PH 7 7 7
Q 5 5 5
RZ, RS Try RTZ (94) and ZH(4)
R 9 9 9
SCHTSCH, SCHTSH, SCHTCH 2 4 4
SCH 4 4 4
SHTCH SHCH, SHTSH 2 4 4
SHT SCHT, SCHD 2 43 43
SH 4 4 4
STCH STSCH, SC 2 4 4
STRZ STRS, STSH 2 4 4
ST 2 43 43
SZCZ SZCS 2 4 4
SZT SHD, SZD, SD 2 43 43
SZ 4 4 4
S 4 4 4
TCH TTCH, TTSCH 4 4 4
TH 3 3 3
TRZ TRS 4 4 4
TSCH TSH 4 4 4
S TTS, TTSZ, TC 4 4 4
TZ TTZ, TZS, TSZ, TS 4 4 4
T 3 3 3
UI UJ, UY 0 1 N/C
U UE 0 N/C N/C
V 7 7 7
W 7 7 7
X 5 54 54
Y 1 N/C N/C
ZDZ ZDZH, ZHDZH 2 4 4
ZD ZHD 2 43 43
ZH ZS, ZSCH, ZSH 4 4 4
Z 4 4 4
N/C = not coded
Copyright © 1997, 2007 Gary Mokotoff. All rights reserved.
No portion of this article may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording or information retrieval system, without prior written permission of the copyright
owner. Brief passages may be quoted with proper attribution.
Send comments to: info@avotaynu.com
Last modified on
|