Soundexing and Genealogy 
by Gary Mokotoff

What can be more frustrating to a genealogist than to look through an alphabetic index of records and not locate what is being sought and then find out later, perhaps years later, the data was there but misspelled. How do you locate towns of immigrant non-English-speaking ancestors when the only information available was passed down through the generations orally and no matter how you try to spell it, you cannot find the town on a map.

A major solution to these problems was provided more than 80 years ago when Robert C. Russell of Pittsburgh, Pennsylvania was issued patent number 1,261,167 on April 2, 1918 for having "invented certain new and useful Improvements in will enable others skilled in the art to which it appertains to make and use the same." The idea of indexing information by how it sounds rather than alphabetically was born. It has become known as "soundexing."

The Russell Soundex System
Russell noted in his patent the reason for this system. "There are certain sounds which form the nucleus of the English language, and those sounds are inadequately represented merely by the letters of the alphabet, as one sound may sometimes be represented by more than one letter or combination of letters, and one letter or combination of letters may represent two or more sounds. Because of this,
a great many names may have two or more different spellings which in an alphabetic index, or an index which separates names according to the sequence of their contained letters in the alphabet, necessitates their filing in widely separate places" (emphasis added).

He knew that letters of the alphabet were divided, phonetically, into categories. To each category, he assigned a numeric value. As his patent describes:

  1. The vowels (he called them "oral resonants") a, e, i, o, u, y.
  2. The labials and labio-dentals b, f, p, v.
  3. The gutterals and sibilants c, g, k, q, s, x, z .
  4. The dental-mutes d, t.
  5. The palatal-fricative l.
  6. The labio-nasal m.
  7. The den to or lingua-nasal n.
  8. The dental fricative r.

There were only a few additional rules:

  • The initial letter of the word is always kept.
  • Two consecutive letters that had the same code are considered as a single letter (e.g., `tt' coded the same as `t'.
  • The combination `gh,' and `s' or `z' if they ended the word, are discarded.
  • Only the first occurrence of a vowel (Group 1) is counted.

Thus, Smith and Smyth coded the same because the letters `i' and `y' had the same value, namely `1,' even though in a normal alphabetic index, they would be far apart.

A patent exists for two reasons. First, to give the inventor exclusive rights to his creation and two, by forcing the inventor to publicly disclose his work, to encourage other persons to improve on the invention and thus create superior products based on the original idea.

Through the years, Russell's system has been improved upon. Those familiar with the current soundex system used by the U.S. government will see that the original system was changed by combining the letters `m' and `n', dropping vowels all together unless the initial letter of the word, and dropping the rule regarding `gh' and words that end with `s' or `z'.

The American Soundex System
The soundex code consist of the first letter of the name followed by three digits. These three digits are determined by dropping the letters a, e, i, o, u, h, w and y and adding three digits from the remaining letters of the name according to the table below. There are only two additional rules. (1) If two or more consecutive letters have the same code, they are coded as one letter. (2) If there are an insufficient numbers of letters to make the three digits, the remaining digits are set to zero.

Soundex Table

 1 b,f,p,v
2 c,g,j,k,q,s,x,z
3 d, t
4 l
5 m, n
6 r


 Miller M460
Peterson P362
Peters P362
Auerbach A612
Uhrbach U612
Moskowitz M232
Moskovitz M213

The Daitch-Mokotoff Soundex System
The latest significant improvement to soundexing is the Daitch-Mokotoff soundex system. In 1985, this author indexed the names of some 28,000 persons who legally changed their names while living in Palestine from 1921 to 1948, most of whom were Jews with Germanic or Slavic surnames. It was obvious there were numerous spelling variants of the same basic surname and the list should be soundexed. Using the conventional U.S. government system, which is based on the Russell system, many Eastern European Jewish names which sound the same did not soundex the same. The most prevalent were those names spelled interchangeably with the letter w or v,for example, the names Moskowitz and Moskovitz.

A modification to U.S. soundex system was then created and published in the first issue of Avotaynu, the journal of Jewish genealogy, in an article titled "Proposal for a Jewish Soundex Code." Randy Daitch read the article and expanded on the rules of the new system. It included the following improvements over the conventional system:

  1. The initial letter was encoded just as any other letter within the name. If the initial letter was a vowel, it was given the code `0'.
  2. Certain double letter combinations, which represent single sounds, namely ts, tz, and tc were coded as a single code (the same as the letter s).
    Just as disclosing a scheme by patenting encourages others to improve on your invention, publishing the article caused another genealogist, Randy Daitch, to improve on the system. To the above rules, he added:
  3. The first six (rather than 4) significant codes are created. This means that in large data bases, names which sound the same initially, but differ at the end, are coded differently, giving the researcher a smaller list of data to be searched. For example, Peters and Peterson code identically in the U.S. system, but differently in the new system.
  4. Other multiple-letter combinations, in addition to those shown in 2 above, were added; all of Slavic or Germanic origin.
  5. If a combination of letters could have two possible sounds, then it is coded in both manners. For example, the letters ch can have a soft sound such as in Chicago or a hard sound as in Christmas.

The new scheme was published a year later in Avotaynu by Daitch under the title "The Jewish Soundex: A Revised Format." This new system has become known, after its authors, as the Daitch-Mokotoff Soundex System. It has been mistakenly called the Jewish Soundex System, the Eastern European Soundex System or the European Soundex System because of its origins, but its new rules are independent or geographic or ethnic considerations and its correct name is the Daitch-Mokotoff Soundex System.

The D-M system has become the standard of all indexing projects done by Jewish genealogical organizations. It has been accepted by the Hebrew Immigrant Aid Society (HIAS), a social welfare organization, as its standard soundex system for retrieving case histories and is the standard at the U.S. Holocaust Memorial Museum in Washington, DC. It is used to search the Ellis Island database of 24 million immigrants at the Stephen P. Morse Searching the Ellis Island Database in One Step site.

To reiterate, the major improvements of the Daitch-Mokotoff Soundex are:

  • Information is coded to the first six meaningful letters rather than four.
  • The initial letter is coded rather than kept as is.
  • Where two consecutive letters have a single sound, they are coded as a single number.
  • When a letter or combination of letters may have two different sounds, it is double coded under the two different codes.
  • A letter or combination of letters maps into ten possible codes rather than seven.

Rules of the Daitch-Mokotoff Soundex System
The rules for converting names into D-M code numbers are listed below. They are followed by the Coding Chart. Turn to the chart briefly to familiarize yourself with the concept and then return to the specific instructions on this page.

  1. Town names are coded to six digits, each digit representing a sound listed in the Coding Chart below.
  2. The letters A, E, I, O, U, J and Y are always coded at the beginning of a name, as in Augsburg (054795). In any other situation, they are ignored except when two of them form a pair and the pair comes before a vowel, as in Breuer (791900), but not Freud. The letter "H" is coded at the beginning of a name, as in Halberstadt (587943) or preceding a vowel as in Mannheim (665600), otherwise it is not coded.
  3. When adjacent sounds can combine to form a larger sound, they are given the code number of the larger sound, as in Chernowitz, which is not coded Chernowi-t- z (496734) but Chernowi-tz (496740).
  4. When adjacent letters have the same code number, they are coded as one sound, as in Cherkassy, which is not coded Cherka-s-sy (495440) but Cherkassy (495400). Exceptions to this rule are the letter combinations "MN" and "NM" whose letters are coded separately, as in Kleinman which is coded 586660 not 586600.
  5. When a name consists of more than one word, it is coded as if one word, such as Nowy Targ, which is treated as Nowytarg.
  6. Several letters and letter combinations pose the problem that they may sound in one of two ways. The letter and letter combinations CH, CK, C, J and RZ (see chart below), are assigned two possible code numbers. Be sure to try both possibilities.
  7. When a name lacks enough coded sounds to fill the six digits, the remaining digits are coded "0" as in Berlin (798600) which has only four coded sounds (B-R- L-N).

     C e n io w (467000)  Ts e n yu v (467000)
    H o l u b i c a (587400) G o l u b i ts a (587400)
    P rz e m y s l (746480) P sh e m e sh i l (746480)
    R o s o ch o w a c ie c R o s o k ho v a ts e ts (945744)
    (944744) or (945744)
                               Start of  Before   All
    Letter Alternate Letter(s) A Name A Vowel Other

    AI AJ, AY 0 1 N/C
    AU 0 7 N/C
    A 0 N/C N/C
    B 7 7 7
    CHS 5 54 54
    CH Try KH (5) and TCH (4)
    CK Try K (5) and TSK (45)
    CZ CS, CSZ, CZS 4 4 4
    C Try K (5) and TZ (4)
    DRZ DRS 4 4 4
    DS DSH, DSZ 4 4 4
    DZ DZH, DZS 4 4 4
    D DT 3 3 3
    EI EJ, EY 0 1 N/C
    EU 1 1 N/C
    E 0 N/C N/C
    FB 7 7 7
    F 7 7 7
    G 5 5 5
    H 5 5 N/C
    IA IE, IO, IU 1 N/C N/C
    I 0 N/C N/C
    J Try Y (1) and DZH (4)
    KS 5 54 54
    KH 5 5 5
    K 5 5 5
    L 8 8 8
    MN 66 66 66
    M 6 6 6
    NM 66 66 66
    N 6 6 6
    OI OJ, OY 0 1 N/C
    O 0 N/C N/C
    P PF, PH 7 7 7
    Q 5 5 5
    RZ, RS Try RTZ (94) and ZH(4)
    R 9 9 9
    SCH 4 4 4
    SHT SCHT, SCHD 2 43 43
    SH 4 4 4
    STCH STSCH, SC 2 4 4
    STRZ STRS, STSH 2 4 4
    ST 2 43 43
    SZCZ SZCS 2 4 4
    SZT SHD, SZD, SD 2 43 43
    SZ 4 4 4
    S 4 4 4
    TH 3 3 3
    TRZ TRS 4 4 4
    TSCH TSH 4 4 4
    S TTS, TTSZ, TC 4 4 4
    TZ TTZ, TZS, TSZ, TS 4 4 4
    T 3 3 3
    UI UJ, UY 0 1 N/C
    U UE 0 N/C N/C
    V 7 7 7
    W 7 7 7
    X 5 54 54
    Y 1 N/C N/C
    ZDZ ZDZH, ZHDZH 2 4 4
    ZD ZHD 2 43 43
    ZH ZS, ZSCH, ZSH 4 4 4
    Z 4 4 4
    N/C = not coded
