NashCoding Yet Another Artificial Intelligence Blog

7Jul/103

Implementation of Sukhotin’s Algorithm

I just read an article on HackerNews about a little-known algorithm called Sukhotin's Algorithm. The algorithm takes a dictionary of words and tries to figure out what the vowels are, based on the assumption that vowels are typically next to consonants. This sounded really cool, so I downloaded a big list of English words and implemented it.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace SukhotinsAlgorithm
{
    /// <summary>
    /// An implementation of Sukhotin's Algorithm for vowel identification.
    /// Based on the psuedo-code article by Matia Giovannini:
    /// http://alaska-kamtchatka.blogspot.com/2010/07/sukhotins-algorithm.html
    /// 
    /// Default English wordlist taken from: http://www.mieliestronk.com/wordlist.html
    /// 
    /// Note: I assume you're using the English alphabet of 26 letters, just to make
    ///       it easier on me.
    ///       
    /// Author: Wesley Tansey (wes@nashcoding.com)
    /// Date: July 7, 2010
    /// </summary>
    class Program
    {
        const string DEFAULT_WORDS_FILE = "words.txt";
        static void Main(string[] args)
        {
            //the first argument is the words file, or the default words.txt file if no arguments are given
            var wordsFile = args.Length == 0 ? DEFAULT_WORDS_FILE : args[0];

            //read in the words from file
            string[] words;
            using (TextReader reader = new StreamReader(wordsFile))
                words = reader.ReadAllLines();

            //frequency of each letter, called 'r' in the psuedo-code
            int[] freqs = new int[26];

            //symmetric digraph frequencies, called 'd' in the psuedo-code
            int[,] sdFreqs = new int[26, 26];

            //calculate frequencies
            foreach (var word in words)
                for (int i = 0; i < word.Length; i++)
                {
                    //skip words with hyphens -- just an issue with my default word dictionary
                    if (word.Contains('-'))
                        continue;

                    //get the array index of this character
                    int cur = word[i] - 'a';

                    //increment the overall frequency of this letter
                    freqs[cur]++;

                    //increment the frequency of the current character following the previous character
                    if (i > 0)
                    {
                        int prev = word[i - 1] - 'a';
                        sdFreqs[cur, prev]++;

                        //we're generating a symmetric matrix, so we need to increment the other
                        //side unless we're on the diagnol
                        if (cur != prev)
                            sdFreqs[prev, cur]++;
                    }
                }

            //calculate the letter with maximum frequency, called 'r.m' in the psuedo-code
            int maxFreq = 0;
            for (int i = 0; i < freqs.Length; i++)
                if (freqs[maxFreq] < freqs[i])
                    maxFreq = i;

            //The stack of vowels to find
            Stack<int> v = new Stack<int>();

            //calculate the vowels
            while (freqs[maxFreq] > 0)
            {
                //the highest frequency letter is a vowel
                v.Push(maxFreq);

                //the next maximum frequency to find, called 'j' in the psuedo-code
                int nextMaxFreq = 0;

                //remove all occurences of the current vowel
                for (int i = 0; i < freqs.Length; i++)
                {
                    if (i != maxFreq)
                        freqs[i] -= 2 * sdFreqs[i, maxFreq];
                    else
                        freqs[i] = 0;

                    if (freqs[nextMaxFreq] < freqs[i])
                        nextMaxFreq = i;
                }

                maxFreq = nextMaxFreq;
            }

            //print out the results
            Console.WriteLine("Vowels found are:");
            foreach (var i in v)
                Console.WriteLine((char)('a' + i));
        }
    }
}

So does it work?


Vowels found are:
y
u
o
a
i
e

Yep! Very cool little algorithm!

Comments (3) Trackbacks (0)
  1. Very interesting algorithm,but at row 33 i have an error! I am using visual studio 2010 express and i downloaded the list in in the parentheses and typed inside the parentheses the path to the list .Here’s a screenshot of the problem: http://www.4shared.com/photo/u_jKrUhv/error.html?
    Can you help me,please???

  2. Looks like you’ve got escape characters in your string. If you’re going to use the \ character in a string, you need to escape the characters or make it a string literal. Just put @ before the first quotation mark. :)

  3. Thank you for your answer ! If i may,i would also like to point out one aspect that surprised me ! This algorithm is a statistical one,and i remarked that some of the facts stated in the article in which the article is presented (http://www.4shared.com/office/dKlzPwpB/guy1991vowelIdentification.html?) tend to make this algorithm quite inexact. For example,it is mentioned in the article that,in the English language,Sukhotin’s algorithm can return “t” as a vocal due to the fact that this letter tends to occur very often near a consonant.
    Also,i can tell you that this also applies to Romanian. From what i remarked,i can tell that the number of times a letter occurs near a vowel or near a consonant determines whether if it is a vowel or a consonant.Thus,the more times a letter occurs near a vowel,the lower the possibility of that letter being a vowel.The same thing applies to consonants.
    Also,could you do a tutorial on image manipulation in C# (i know that there are some projects on the web,but they are very vague and a beginner,such as i am,can’t understand them).
    Thanks again for your response !


Leave a comment

 

Trackbacks are disabled.