How To: Count word occurences in a String or File using C#
r3dux | October 12, 2012The number of times a word occurs in a piece of text (often called the term frequency) is a good starting point for all kinds of different text analysis metrics like TF*IDF or Latent Semantic Indexing.
I recently needed perform some TF*IDF document analysis using C#, so this is how I did it:
Please note: This code is deliberately bare-bones and “just-get-the-job-done” in order to keep things as simple as possible. You don’t need to tell me that it’s non-optimal or doesn’t use exception handling etc. Also, I deliberately avoided using regex as the code was to be given to inexperienced programmers to work with.
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.IO; namespace SimpleTermFrequencyAnalyser { class Program { static void Main(string[] args) { // Read a file into a string (this file must live in the same directory as the executable) string filename = "Alice-in-Wonderland.txt"; string inputString = File.ReadAllText(filename); // Convert our input to lowercase inputString = inputString.ToLower(); // Define characters to strip from the input and do it string[] stripChars = { ";", ",", ".", "-", "_", "^", "(", ")", "[", "]", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "\n", "\t", "\r" }; foreach (string character in stripChars) { inputString = inputString.Replace(character, ""); } // Split on spaces into a List of strings List<string> wordList = inputString.Split(' ').ToList(); // Define and remove stopwords string[] stopwords = new string[] { "and", "the", "she", "for", "this", "you", "but" }; foreach (string word in stopwords) { // While there's still an instance of a stopword in the wordList, remove it. // If we don't use a while loop on this each call to Remove simply removes a single // instance of the stopword from our wordList, and we can't call Replace on the // entire string (as opposed to the individual words in the string) as it's // too indiscriminate (i.e. removing 'and' will turn words like 'bandage' into 'bdage'!) while ( wordList.Contains(word) ) { wordList.Remove(word); } } // Create a new Dictionary object Dictionary<string, int> dictionary = new Dictionary<string, int>(); // Loop over all over the words in our wordList... foreach (string word in wordList) { // If the length of the word is at least three letters... if (word.Length >= 3) { // ...check if the dictionary already has the word. if ( dictionary.ContainsKey(word) ) { // If we already have the word in the dictionary, increment the count of how many times it appears dictionary[word]++; } else { // Otherwise, if it's a new word then add it to the dictionary with an initial count of 1 dictionary[word] = 1; } } // End of word length check } // End of loop over each word in our input // Create a dictionary sorted by value (i.e. how many times a word occurs) var sortedDict = (from entry in dictionary orderby entry.Value descending select entry).ToDictionary(pair => pair.Key, pair => pair.Value); // Loop through the sorted dictionary and output the top 10 most frequently occurring words int count = 1; Console.WriteLine("---- Most Frequent Terms in the File: " + filename + " ----"); Console.WriteLine(); foreach (KeyValuePair<string, int> pair in sortedDict) { // Output the most frequently occurring words and the associated word counts Console.WriteLine(count + "\t" + pair.Key + "\t" + pair.Value); count++; // Only display the top 10 words then break out of the loop! if (count > 10) { break; } } // Wait for the user to press a key before exiting Console.ReadKey(); } // End of Main method } // End of Program class } // End of namespace |
Running this on Alice’s Adventures in Wonderland gives output like this:
---- Most Frequent Terms in the File: Alice-in-Wonderland.txt ---- 1 said 427 2 was 309 3 alice 262 4 that 211 5 her 204 6 with 185 7 all 161 8 had 158 9 not 130 10 very 123 |
Now we have the term frequency, if we want to calculate the TF*IDF score of a document we just have the IDF (Inverse Document Frequency) part to go – which is simply the log of the total number of documents divided by the number of documents containing a given keyword.
TF*IDF Worked Example
So, let’s say we search through a paper and find the term “depression” occurs 20 times:
TF = 20 |
Then, let’s say we have 50 papers in our corpus (i.e. collection of papers), and that 15 of those papers also contain the word “depression”:
IDF = log (50 / 15) = log (3.33333) = 0.52288 |
However, if the search term does not occur in any of the documents in our corpus it will lead to a divide by zero, so it’s common to perform the calculation as 1 + the number of documents containing the search term, which makes our IDF:
IDF = log (50 / 1 + 15) = log ( 50 / 16 ) = log ( 3.125 ) = 0.49485 |
Finally, our TF*IDF value is exactly that – the TF value multiplied by the IDF value:
TF*IDF = 20 * 0.49485 = 9.897 |
To paraphrase Wikipedia:
A high weight in TF*IDF is reached by a high term frequency (in a given document) and a low frequency in the number of documents that contain that term. As the term appears in more documents, the ratio inside the logarithm approaches 1, bringing the IDF and TF-IDF closer to zero.
Wrap Up
The TF*IDF workings above are correct to the best of my knowledge – however if you know any better then please don’t hesitate to chime in and let me know if anything’s not quite 100%.
Also, newline (\n) characters have to be stripped, but the implications are that words could be concatenated accidentally, for example if “the\n” is the last word on one line, and “world” is the first word on the next line down – stripping “\n” MIGHT leave “theworld” as a single word – I haven’t gone through the entire list looking for obvious word-wrap concatenations – if you care about your results you should prolly check that out.
Asides from that – cheers & happy word counting! =D











Neat!
http://koti.mbnet.fi/thamnoph/photos/garbage/benderneat.jpg
Thanks, buddy =D
it is very helpful but i want to convert the TF-IDF to categorical data (discritaization consent) so can u give me any hint or may be example on it please
Hi Yityal,
I’ve never had to do anything like that, but if you can give me an example of the type of data you’re working with and the way in which you want it categorised I could have a think about it.
same coding will work of any type of dataset huh??
Well, it’s probably not going to do good things on sets of continuous numbers, but it’ll count frequencies of pretty much any kind of character sequences.
pls, can you post c# coding for calculating conceptual term frequency..
I don’t have any code for calculating conceptual term frequency… You could probably build your own code from this paper if you’re up to the job: (researchgate.net) A concept based model for enhancing text categorization.