How To: Count word occurences in a String or File using C#

The number of times a word occurs in a piece of text (often called the term frequency) is a good starting point for all kinds of different text analysis metrics like TF*IDF or Latent Semantic Indexing.

I recently needed perform some TF*IDF document analysis using C#, so this is how I did it:

Please note: This code is deliberately bare-bones and “just-get-the-job-done” in order to keep things as simple as possible. You don’t need to tell me that it’s non-optimal or doesn’t use exception handling etc. Also, I deliberately avoided using regex as the code was to be given to inexperienced programmers to work with.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
 
namespace SimpleTermFrequencyAnalyser
{
	class Program
	{
		static void Main(string[] args)
		{
			// Read a file into a string (this file must live in the same directory as the executable)
			string filename = "Alice-in-Wonderland.txt";
			string inputString = File.ReadAllText(filename);
 
			// Convert our input to lowercase
			inputString = inputString.ToLower();        
 
			// Define characters to strip from the input and do it
			string[] stripChars = { ";", ",", ".", "-", "_", "^", "(", ")", "[", "]",
						"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "\n", "\t", "\r" };
			foreach (string character in stripChars)
			{
				inputString = inputString.Replace(character, "");
			}
 
			// Split on spaces into a List of strings
			List<string> wordList = inputString.Split(' ').ToList();
 
			// Define and remove stopwords
			string[] stopwords = new string[] { "and", "the", "she", "for", "this", "you", "but" };
			foreach (string word in stopwords)
			{
				// While there's still an instance of a stopword in the wordList, remove it.
				// If we don't use a while loop on this each call to Remove simply removes a single
				// instance of the stopword from our wordList, and we can't call Replace on the
				// entire string (as opposed to the individual words in the string) as it's
				// too indiscriminate (i.e. removing 'and' will turn words like 'bandage' into 'bdage'!)
				while ( wordList.Contains(word) )
				{
					wordList.Remove(word);
				}
			}
 
			// Create a new Dictionary object
			Dictionary<string, int> dictionary = new Dictionary<string, int>();
 
			// Loop over all over the words in our wordList...
			foreach (string word in wordList)
			{
				// If the length of the word is at least three letters...
				if (word.Length >= 3) 
				{
					// ...check if the dictionary already has the word.
					if ( dictionary.ContainsKey(word) )
					{
						// If we already have the word in the dictionary, increment the count of how many times it appears
						dictionary[word]++;
					}
					else
					{
						// Otherwise, if it's a new word then add it to the dictionary with an initial count of 1
						dictionary[word] = 1;
					}
 
				} // End of word length check
 
			} // End of loop over each word in our input
 
			// Create a dictionary sorted by value (i.e. how many times a word occurs)
			var sortedDict = (from entry in dictionary orderby entry.Value descending select entry).ToDictionary(pair => pair.Key, pair => pair.Value);
 
			// Loop through the sorted dictionary and output the top 10 most frequently occurring words
			int count = 1;
			Console.WriteLine("---- Most Frequent Terms in the File: " + filename + " ----");
			Console.WriteLine();
			foreach (KeyValuePair<string, int> pair in sortedDict)
			{
				// Output the most frequently occurring words and the associated word counts
				Console.WriteLine(count + "\t" + pair.Key + "\t" + pair.Value);
				count++;
 
				// Only display the top 10 words then break out of the loop!
				if (count > 10)
				{
					break;
				}
			}
 
			// Wait for the user to press a key before exiting
			Console.ReadKey();
 
		} // End of Main method
 
	} // End of Program class
 
} // End of namespace

Running this on Alice’s Adventures in Wonderland gives output like this:

---- Most Frequent Terms in the File: Alice-in-Wonderland.txt ----
1       said    427
2       was     309
3       alice   262
4       that    211
5       her     204
6       with    185
7       all     161
8       had     158
9       not     130
10      very    123

Now we have the term frequency, if we want to calculate the TF*IDF score of a document we just have the IDF (Inverse Document Frequency) part to go – which is simply the log of the total number of documents divided by the number of documents containing a given keyword.

TF*IDF Worked Example

So, let’s say we search through a paper and find the term “depression” occurs 20 times:

TF = 20

Then, let’s say we have 50 papers in our corpus (i.e. collection of papers), and that 15 of those papers also contain the word “depression”:

IDF = log (50 / 15) = log (3.33333) = 0.52288

However, if the search term does not occur in any of the documents in our corpus it will lead to a divide by zero, so it’s common to perform the calculation as 1 + the number of documents containing the search term, which makes our IDF:

IDF = log (50 / 1 + 15) = log ( 50 / 16 ) = log ( 3.125 ) = 0.49485

Finally, our TF*IDF value is exactly that – the TF value multiplied by the IDF value:

TF*IDF = 20 * 0.49485 = 9.897

To paraphrase Wikipedia:

A high weight in TF*IDF is reached by a high term frequency (in a given document) and a low frequency in the number of documents that contain that term. As the term appears in more documents, the ratio inside the logarithm approaches 1, bringing the IDF and TF-IDF closer to zero.

Wrap Up

The TF*IDF workings above are correct to the best of my knowledge – however if you know any better then please don’t hesitate to chime in and let me know if anything’s not quite 100%.

Also, newline (\n) characters have to be stripped, but the implications are that words could be concatenated accidentally, for example if “the\n” is the last word on one line, and “world” is the first word on the next line down – stripping “\n” MIGHT leave “theworld” as a single word – I haven’t gone through the entire list looking for obvious word-wrap concatenations – if you care about your results you should prolly check that out.

Asides from that – cheers & happy word counting! =D

17 thoughts on “How To: Count word occurences in a String or File using C#”

  1. it is very helpful but i want to convert the TF-IDF to categorical data (discritaization consent) so can u give me any hint or may be example on it please

    1. Hi Yityal,

      I’ve never had to do anything like that, but if you can give me an example of the type of data you’re working with and the way in which you want it categorised I could have a think about it.

    1. Well, it’s probably not going to do good things on sets of continuous numbers, but it’ll count frequencies of pretty much any kind of character sequences.

  2. really great work , what if i need to return the frequency of a specific word i determine like the frequency of “say” , could you help me pleas

  3. Once the dictionary has been built up (for example, after line 69 in the main code segment), then something like this should do it:

    string desiredWord = "say";
    int numOccurences = dictionary[say];
    Console.WriteLine(numOccurences);

    You could wrap this up into a method something like this:

    int getWordOccurences(Dictionary<string, int> dictionary, string word)
    {
        if ( dictionary.ContainsKey(word) ) { return dictionary[word]; }
        return 0; // If not in dictionary then return zero!
    }

    Then use it via:

    int numOccurences = getWordOccurences(dictionary, "say");

    Hope this helps!

    Some further reading: http://www.dotnetperls.com/dictionary.

  4. Hi this is a great work. can you pls help me on how to use the content of the dictionary to form a tree (say binary tree), so that each word can be encoded with a unique binary code? Thank you.

  5. So how do I determine whether the TF-IDF value means a word is important or not in a document? What are considered high or low scores?

    1. A high TF*IDF value indicates a word is strongly prevalent in a given document while not prevalent in other documents in your corpus, which would mean that the document with the highest score is likely to be the most relevant one regarding that particular word.

      “High” and “Low” scores will be relative to your particular corpus for any given word.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.