What can be more frustrating to a
genealogist than to look through an alphabetic index of records and not
locate what is being sought and then find out later, perhaps years
later, the data was there but misspelled. How do you locate towns of
immigrant non-English-speaking ancestors when the only information
available was passed down through the generations orally and no matter
how you try to spell it, you cannot find the town on a map.
A major solution to these problems was provided more than 80 years ago
when Robert C. Russell of Pittsburgh, Pennsylvania was issued patent
number 1,261,167 on April 2, 1918 for having "invented certain new and
useful Improvements in Indexes...as will enable others skilled in the
art to which it appertains to make and use the same." The idea of
indexing information by how it sounds rather than alphabetically was
born. It has become known as "soundexing."
Russell noted in his patent
the reason for this system. "There are certain sounds which form the
nucleus of the English language, and those sounds are inadequately
represented merely by the letters of the alphabet, as one sound may
sometimes be represented by more than one letter or combination of
letters, and one letter or combination of letters may represent two or
more sounds. Because of this, a great many names may have two or
more different spellings which in an alphabetic index, or an index
which separates names according to the sequence of their contained
letters in the alphabet, necessitates their filing in widely separate
places"
(emphasis added).
He knew that letters of the alphabet were divided, phonetically, into
categories. To each category, he assigned a numeric value. As his
patent describes:
- The
vowels (he called them "oral resonants") a, e, i, o, u, y.
- The
labials and labio-dentals b, f, p, v.
- The
gutterals and sibilants c, g, k, q, s, x, z .
- The
dental-mutes d, t.
- The
palatal-fricative l.
- The
labio-nasal m.
- The
den to or lingua-nasal n.
- The
dental fricative r.
There
were only a few additional rules:
- The
initial letter of the word is always kept.
- Two
consecutive letters that had the same code are considered as a single
letter (e.g., `tt' coded the same as `t'.
- The
combination `gh,' and `s' or `z' if they ended the word, are discarded.
- Only
the first occurrence of a vowel (Group 1) is counted.
Thus,
Smith and Smyth coded the same because the letters `i' and `y' had the
same value, namely `1,' even though in a normal alphabetic index, they
would be far apart.
A patent exists for two reasons. First, to give the inventor exclusive
rights to his creation and two, by forcing the inventor to publicly
disclose his work, to encourage other persons to improve on the
invention and thus create superior products based on the original idea.
Through the years, Russell's system has been improved upon. Those
familiar with the current soundex system used by the U.S. government
will see that the original system was changed by combining the letters
`m' and `n', dropping vowels all together unless the initial letter of
the word, and dropping the rule regarding `gh' and words that end with
`s' or `z'.
The soundex code consist of the first letter of the name followed by
three digits. These three digits are determined by dropping the letters
a, e, i, o, u, h, w and y and adding three digits from the remaining
letters of the name according to the table below. There are only two
additional rules. (1) If two or more consecutive letters have the same
code, they are coded as one letter. (2) If there are an insufficient
numbers of letters to make the three digits, the remaining digits are
set to zero.
Soundex Table
1 b,f,p,v 2 c,g,j,k,q,s,x,z 3 d, t 4 l 5 m, n 6 r
Examples:
Miller M460 Peterson P362 Peters P362 Auerbach A612 Uhrbach U612 Moskowitz M232 Moskovitz M213
The
latest significant improvement to soundexing is the Daitch-Mokotoff
soundex system. In 1985, this author indexed the names of some 28,000
persons who legally changed their names while living in Palestine from
1921 to 1948, most of whom were Jews with Germanic or Slavic surnames.
It was obvious there were numerous spelling variants of the same basic
surname and the list should be soundexed. Using the conventional U.S.
government system, which is based on the Russell system, many Eastern
European Jewish names which sound the same did not soundex the same.
The most prevalent were those names spelled interchangeably with the
letter w or v,for example, the
names Moskowitz and Moskovitz.
A modification to U.S. soundex system was then created and published in
the first issue of Avotaynu, the journal of Jewish
genealogy, in an article titled "Proposal for a Jewish Soundex Code."
Randy Daitch read the article and expanded on the rules of the new
system. It included the following improvements over the conventional
system:
- The
initial letter was encoded just as any other letter within the name. If
the initial letter was a vowel, it was given the code `0'.
- Certain
double letter combinations, which represent single sounds, namely ts,
tz, and tc
were coded as a single code (the same as the letter s).
Just as disclosing a scheme by patenting encourages others to improve
on your invention, publishing the article caused another genealogist,
Randy Daitch, to improve on the system. To the above rules, he added:
- The
first six (rather than 4) significant codes are created. This means
that in large data bases, names which sound the same initially, but
differ at the end, are coded differently, giving the researcher a
smaller list of data to be searched. For example, Peters and Peterson
code identically in the U.S. system, but differently in the new system.
- Other
multiple-letter combinations, in addition to those shown in 2 above,
were added; all of Slavic or Germanic origin.
- If
a combination of letters could have two possible sounds, then it is
coded in both manners. For example, the letters ch
can have a soft sound such as in Chicago or a hard
sound as in Christmas.
The
new scheme was published a year later in Avotaynu
by Daitch under the title "The Jewish Soundex: A Revised Format." This
new system has become known, after its authors, as the Daitch-Mokotoff
Soundex System. It has been mistakenly called the Jewish Soundex
System, the Eastern European Soundex System or the European Soundex
System because of its origins, but its new rules are independent or
geographic or ethnic considerations and its correct name is the
Daitch-Mokotoff Soundex System.
The D-M system has become the standard of all indexing projects done by
Jewish genealogical organizations. It has been accepted by the Hebrew
Immigrant Aid Society (HIAS), a social welfare organization, as its
standard soundex system for retrieving case histories and is the
standard at the U.S. Holocaust Memorial Museum in Washington, DC. It is
used to search the Ellis Island database of 24 million immigrants at
the Stephen P. Morse Searching the Ellis Island Database in One Step
site.
To reiterate, the major improvements of the Daitch-Mokotoff Soundex are:
- Information
is coded to the first six meaningful letters rather than four.
- The
initial letter is coded rather than kept as is.
- Where
two consecutive letters have a single sound, they are coded as a single
number.
- When
a letter or combination of letters may have two different sounds, it is
double coded under the two different codes.
- A
letter or combination of letters maps into ten possible codes rather
than seven.
The
rules for converting names into D-M code numbers are listed below. They
are followed by the Coding Chart. Turn to the chart briefly to
familiarize yourself with the concept and then return to the specific
instructions on this page.
- Town
names are coded to six digits, each digit representing a sound listed
in the Coding Chart below.
- The
letters A, E, I, O, U, J and Y are always coded at the beginning of a
name, as in Augsburg (054795).
In any other situation, they are ignored except when two of them form a
pair and the pair comes before a vowel, as in Breuer
(791900), but not Freud. The
letter "H" is coded at the beginning of a name, as in Halberstadt
(587943) or preceding a vowel as in Mannheim
(665600), otherwise it is not coded.
- When
adjacent sounds can combine to form a larger sound, they are given the
code number of the larger sound, as in Chernowitz, which is not coded
Chernowi-t- z (496734)
but Chernowi-tz (496740).
- When
adjacent letters have the same code number, they are coded as one
sound, as in Cherkassy, which is not coded Cherka-s-sy
(495440) but Cherkassy (495400).
Exceptions to this rule are the letter combinations "MN" and "NM" whose
letters are coded separately, as in Kleinman which
is coded 586660 not 586600.
- When
a name consists of more than one word, it is coded as if one word, such
as Nowy Targ, which is treated as Nowytarg.
- Several
letters and letter combinations pose the problem that they may sound in
one of two ways. The letter and letter combinations CH, CK, C, J and RZ
(see chart below), are assigned two possible code numbers. Be sure to
try both possibilities.
- When
a name lacks enough coded sounds to fill the six digits, the remaining
digits are coded "0" as in Berlin (798600) which has
only four coded sounds (B-R- L-N).
Examples:
C e n io w (467000) Ts e n yu v (467000) H o l u b i c a (587400) G o l u b i ts a (587400) P rz e m y s l (746480) P sh e m e sh i l (746480) R o s o ch o w a c ie c R o s o k ho v a ts e ts (945744) (944744) or (945744)
Start of Before All Letter Alternate Letter(s) A Name A Vowel Other AI AJ, AY 0 1 N/C AU 0 7 N/C A 0 N/C N/C B 7 7 7 CHS 5 54 54 CH Try KH (5) and TCH (4) CK Try K (5) and TSK (45) CZ CS, CSZ, CZS 4 4 4 C Try K (5) and TZ (4) DRZ DRS 4 4 4 DS DSH, DSZ 4 4 4 DZ DZH, DZS 4 4 4 D DT 3 3 3 EI EJ, EY 0 1 N/C EU 1 1 N/C E 0 N/C N/C FB 7 7 7 F 7 7 7 G 5 5 5 H 5 5 N/C IA IE, IO, IU 1 N/C N/C I 0 N/C N/C J Try Y (1) and DZH (4) KS 5 54 54 KH 5 5 5 K 5 5 5 L 8 8 8 MN 66 66 66 M 6 6 6 NM 66 66 66 N 6 6 6 OI OJ, OY 0 1 N/C O 0 N/C N/C P PF, PH 7 7 7 Q 5 5 5 RZ, RS Try RTZ (94) and ZH(4) R 9 9 9 SCHTSCH, SCHTSH, SCHTCH 2 4 4 SCH 4 4 4 SHTCH SHCH, SHTSH 2 4 4 SHT SCHT, SCHD 2 43 43 SH 4 4 4 STCH STSCH, SC 2 4 4 STRZ STRS, STSH 2 4 4 ST 2 43 43 SZCZ SZCS 2 4 4 SZT SHD, SZD, SD 2 43 43 SZ 4 4 4 S 4 4 4 TCH TTCH, TTSCH THS 4 4 4 TH 3 3 3 TRZ TRS 4 4 4 TSCH TSH 4 4 4 S TTS, TTSZ, TC 4 4 4 TZ TTZ, TZS, TSZ, TS 4 4 4 T 3 3 3 UI UJ, UY 0 1 N/C U UE 0 N/C N/C V 7 7 7 W 7 7 7 X 5 54 54 Y 1 N/C N/C ZDZ ZDZH, ZHDZH 2 4 4 ZD ZHD 2 43 43 ZH ZS, ZSCH, ZSH 4 4 4 Z 4 4 4
N/C = not coded
Copyright © 1997, Gary Mokotoff. All rights reserved.
No portion of this article may be reproduced or transmitted in any form
or by any means, electronic or mechanical, including photocopying,
recording or information retrieval system, without prior written
permission of the copyright owner. Brief passages may be quoted with
proper attribution.
Send comments to: info@avotaynu.com
Return
to top
|