NashCoding Yet Another Artificial Intelligence Blog


Implementation of Sukhotin’s Algorithm

I just read an article on HackerNews about a little-known algorithm called Sukhotin's Algorithm. The algorithm takes a dictionary of words and tries to figure out what the vowels are, based on the assumption that vowels are typically next to consonants. This sounded really cool, so I downloaded a big list of English words and implemented it.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace SukhotinsAlgorithm
    /// <summary>
    /// An implementation of Sukhotin's Algorithm for vowel identification.
    /// Based on the psuedo-code article by Matia Giovannini:
    /// Default English wordlist taken from:
    /// Note: I assume you're using the English alphabet of 26 letters, just to make
    ///       it easier on me.
    /// Author: Wesley Tansey (
    /// Date: July 7, 2010
    /// </summary>
    class Program
        const string DEFAULT_WORDS_FILE = "words.txt";
        static void Main(string[] args)
            //the first argument is the words file, or the default words.txt file if no arguments are given
            var wordsFile = args.Length == 0 ? DEFAULT_WORDS_FILE : args[0];

            //read in the words from file
            string[] words;
            using (TextReader reader = new StreamReader(wordsFile))
                words = reader.ReadAllLines();

            //frequency of each letter, called 'r' in the psuedo-code
            int[] freqs = new int[26];

            //symmetric digraph frequencies, called 'd' in the psuedo-code
            int[,] sdFreqs = new int[26, 26];

            //calculate frequencies
            foreach (var word in words)
                for (int i = 0; i < word.Length; i++)
                    //skip words with hyphens -- just an issue with my default word dictionary
                    if (word.Contains('-'))

                    //get the array index of this character
                    int cur = word[i] - 'a';

                    //increment the overall frequency of this letter

                    //increment the frequency of the current character following the previous character
                    if (i > 0)
                        int prev = word[i - 1] - 'a';
                        sdFreqs[cur, prev]++;

                        //we're generating a symmetric matrix, so we need to increment the other
                        //side unless we're on the diagnol
                        if (cur != prev)
                            sdFreqs[prev, cur]++;

            //calculate the letter with maximum frequency, called 'r.m' in the psuedo-code
            int maxFreq = 0;
            for (int i = 0; i < freqs.Length; i++)
                if (freqs[maxFreq] < freqs[i])
                    maxFreq = i;

            //The stack of vowels to find
            Stack<int> v = new Stack<int>();

            //calculate the vowels
            while (freqs[maxFreq] > 0)
                //the highest frequency letter is a vowel

                //the next maximum frequency to find, called 'j' in the psuedo-code
                int nextMaxFreq = 0;

                //remove all occurences of the current vowel
                for (int i = 0; i < freqs.Length; i++)
                    if (i != maxFreq)
                        freqs[i] -= 2 * sdFreqs[i, maxFreq];
                        freqs[i] = 0;

                    if (freqs[nextMaxFreq] < freqs[i])
                        nextMaxFreq = i;

                maxFreq = nextMaxFreq;

            //print out the results
            Console.WriteLine("Vowels found are:");
            foreach (var i in v)
                Console.WriteLine((char)('a' + i));

So does it work?

Vowels found are:

Yep! Very cool little algorithm!

Comments (3) Trackbacks (0)
  1. Very interesting algorithm,but at row 33 i have an error! I am using visual studio 2010 express and i downloaded the list in in the parentheses and typed inside the parentheses the path to the list .Here’s a screenshot of the problem:
    Can you help me,please???

  2. Looks like you’ve got escape characters in your string. If you’re going to use the \ character in a string, you need to escape the characters or make it a string literal. Just put @ before the first quotation mark. :)

  3. Thank you for your answer ! If i may,i would also like to point out one aspect that surprised me ! This algorithm is a statistical one,and i remarked that some of the facts stated in the article in which the article is presented ( tend to make this algorithm quite inexact. For example,it is mentioned in the article that,in the English language,Sukhotin’s algorithm can return “t” as a vocal due to the fact that this letter tends to occur very often near a consonant.
    Also,i can tell you that this also applies to Romanian. From what i remarked,i can tell that the number of times a letter occurs near a vowel or near a consonant determines whether if it is a vowel or a consonant.Thus,the more times a letter occurs near a vowel,the lower the possibility of that letter being a vowel.The same thing applies to consonants.
    Also,could you do a tutorial on image manipulation in C# (i know that there are some projects on the web,but they are very vague and a beginner,such as i am,can’t understand them).
    Thanks again for your response !

Leave a comment


Trackbacks are disabled.