r3dux.org

A number-pimping side project from the valleys in *NEW* upside-down flavour.

  • Home
  • ABOUT
  • OLD SITE
  • SEARCH
  • FEEDBACK

How To: Count word occurences in a String or File using C#

r3dux | October 12, 2012

The number of times a word occurs in a piece of text (often called the term frequency) is a good starting point for all kinds of different text analysis metrics like TF*IDF or Latent Semantic Indexing.

I recently needed perform some TF*IDF document analysis using C#, so this is how I did it:

Please note: This code is deliberately bare-bones and “just-get-the-job-done” in order to keep things as simple as possible. You don’t need to tell me that it’s non-optimal or doesn’t use exception handling etc. Also, I deliberately avoided using regex as the code was to be given to inexperienced programmers to work with.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
 
namespace SimpleTermFrequencyAnalyser
{
	class Program
	{
		static void Main(string[] args)
		{
			// Read a file into a string (this file must live in the same directory as the executable)
			string filename = "Alice-in-Wonderland.txt";
			string inputString = File.ReadAllText(filename);
 
			// Convert our input to lowercase
			inputString = inputString.ToLower();        
 
			// Define characters to strip from the input and do it
			string[] stripChars = { ";", ",", ".", "-", "_", "^", "(", ")", "[", "]",
						"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "\n", "\t", "\r" };
			foreach (string character in stripChars)
			{
				inputString = inputString.Replace(character, "");
			}
 
			// Split on spaces into a List of strings
			List<string> wordList = inputString.Split(' ').ToList();
 
			// Define and remove stopwords
			string[] stopwords = new string[] { "and", "the", "she", "for", "this", "you", "but" };
			foreach (string word in stopwords)
			{
				// While there's still an instance of a stopword in the wordList, remove it.
				// If we don't use a while loop on this each call to Remove simply removes a single
				// instance of the stopword from our wordList, and we can't call Replace on the
				// entire string (as opposed to the individual words in the string) as it's
				// too indiscriminate (i.e. removing 'and' will turn words like 'bandage' into 'bdage'!)
				while ( wordList.Contains(word) )
				{
					wordList.Remove(word);
				}
			}
 
			// Create a new Dictionary object
			Dictionary<string, int> dictionary = new Dictionary<string, int>();
 
			// Loop over all over the words in our wordList...
			foreach (string word in wordList)
			{
				// If the length of the word is at least three letters...
				if (word.Length >= 3) 
				{
					// ...check if the dictionary already has the word.
					if ( dictionary.ContainsKey(word) )
					{
						// If we already have the word in the dictionary, increment the count of how many times it appears
						dictionary[word]++;
					}
					else
					{
						// Otherwise, if it's a new word then add it to the dictionary with an initial count of 1
						dictionary[word] = 1;
					}
 
				} // End of word length check
 
			} // End of loop over each word in our input
 
			// Create a dictionary sorted by value (i.e. how many times a word occurs)
			var sortedDict = (from entry in dictionary orderby entry.Value descending select entry).ToDictionary(pair => pair.Key, pair => pair.Value);
 
			// Loop through the sorted dictionary and output the top 10 most frequently occurring words
			int count = 1;
			Console.WriteLine("---- Most Frequent Terms in the File: " + filename + " ----");
			Console.WriteLine();
			foreach (KeyValuePair<string, int> pair in sortedDict)
			{
				// Output the most frequently occurring words and the associated word counts
				Console.WriteLine(count + "\t" + pair.Key + "\t" + pair.Value);
				count++;
 
				// Only display the top 10 words then break out of the loop!
				if (count > 10)
				{
					break;
				}
			}
 
			// Wait for the user to press a key before exiting
			Console.ReadKey();
 
		} // End of Main method
 
	} // End of Program class
 
} // End of namespace

Running this on Alice’s Adventures in Wonderland gives output like this:

---- Most Frequent Terms in the File: Alice-in-Wonderland.txt ----
1       said    427
2       was     309
3       alice   262
4       that    211
5       her     204
6       with    185
7       all     161
8       had     158
9       not     130
10      very    123

Now we have the term frequency, if we want to calculate the TF*IDF score of a document we just have the IDF (Inverse Document Frequency) part to go – which is simply the log of the total number of documents divided by the number of documents containing a given keyword.

TF*IDF Worked Example

So, let’s say we search through a paper and find the term “depression” occurs 20 times:

TF = 20

Then, let’s say we have 50 papers in our corpus (i.e. collection of papers), and that 15 of those papers also contain the word “depression”:

IDF = log (50 / 15) = log (3.33333) = 0.52288

However, if the search term does not occur in any of the documents in our corpus it will lead to a divide by zero, so it’s common to perform the calculation as 1 + the number of documents containing the search term, which makes our IDF:

IDF = log (50 / 1 + 15) = log ( 50 / 16 ) = log ( 3.125 ) = 0.49485

Finally, our TF*IDF value is exactly that – the TF value multiplied by the IDF value:

TF*IDF = 20 * 0.49485 = 9.897

To paraphrase Wikipedia:

A high weight in TF*IDF is reached by a high term frequency (in a given document) and a low frequency in the number of documents that contain that term. As the term appears in more documents, the ratio inside the logarithm approaches 1, bringing the IDF and TF-IDF closer to zero.

Wrap Up

The TF*IDF workings above are correct to the best of my knowledge – however if you know any better then please don’t hesitate to chime in and let me know if anything’s not quite 100%.

Also, newline (\n) characters have to be stripped, but the implications are that words could be concatenated accidentally, for example if “the\n” is the last word on one line, and “world” is the first word on the next line down – stripping “\n” MIGHT leave “theworld” as a single word – I haven’t gone through the entire list looking for obvious word-wrap concatenations – if you care about your results you should prolly check that out.

Asides from that – cheers & happy word counting! =D

Related posts:

  1. Fix for the Microsoft Word error “A file error has occurred” while saving
  2. How to: Get absolute/relative file paths, filenames and extensions from a Bash script
  3. A Simple GLFW FPS Counter
  4. How To: Force Word 2003 to Repeat Table Headings Across Pages
  5. Java enhanced for-loop FTW
Categories
Coding, How-To
Tags
C++, Count, File, string, Term Frequency, Terms, TF-IDF, TFIDF, Words
Comments rss
Comments rss
Trackback
Trackback
Print This Post Print This Post

« Emperors – Be Ready When I Say Go Grouplove – Love Will Save Your Soul (Live) »

8 Responses to “How To: Count word occurences in a String or File using C#”

  1. shetboy says:
    October 13, 2012 at 12:11 am

    Neat!
    http://koti.mbnet.fi/thamnoph/photos/garbage/benderneat.jpg

    Reply
    • r3dux says:
      October 13, 2012 at 8:34 am

      Thanks, buddy =D

      Reply
  2. Yityal says:
    January 31, 2013 at 1:03 am

    it is very helpful but i want to convert the TF-IDF to categorical data (discritaization consent) so can u give me any hint or may be example on it please

    Reply
    • r3dux says:
      January 31, 2013 at 11:33 am

      Hi Yityal,

      I’ve never had to do anything like that, but if you can give me an example of the type of data you’re working with and the way in which you want it categorised I could have a think about it.

      Reply
  3. japes says:
    January 31, 2013 at 8:40 pm

    same coding will work of any type of dataset huh??

    Reply
    • r3dux says:
      February 1, 2013 at 7:06 am

      Well, it’s probably not going to do good things on sets of continuous numbers, but it’ll count frequencies of pretty much any kind of character sequences.

      Reply
  4. japes says:
    February 4, 2013 at 3:30 pm

    pls, can you post c# coding for calculating conceptual term frequency..

    Reply
    • r3dux says:
      February 5, 2013 at 11:17 am

      I don’t have any code for calculating conceptual term frequency… You could probably build your own code from this paper if you’re up to the job: (researchgate.net) A concept based model for enhancing text categorization.

      Reply

Leave a Reply

Click here to cancel reply.

Translate

Categories

Archives

Tags

3D ActionScript ActionScript 3.0 Adobe AI Ballarat Bash C++ Class Convert CS4 Effect Error Film Flash FPS GLFW Glitch GLSL Hack How-To install Java Kinect Linux Live Mash-Up Microsoft Motion mount OpenGL Particle Problem PS3 Remix Retro script Slides Sound Ubuntu Video VirtualBox Wii Windows XBox

Gamercard

OpenR3dux

Misc.

Flattr this

RSS Feed

r3dux twitter feed



“I have not failed. I have found 10,000 ways that don't work.”

 - Thomas Edison

rss Comments rss valid xhtml 1.1 design by jide powered by Wordpress get firefox