How To: Count word occurences in a String or File using C#

The number of times a word occurs in a piece of text (often called the term frequency) is a good starting point for all kinds of different text analysis metrics like TF*IDF or Latent Semantic Indexing.

I recently needed perform some TF*IDF document analysis using C#, so this is how I did it:

Please note: This code is deliberately bare-bones and “just-get-the-job-done” in order to keep things as simple as possible. You don’t need to tell me that it’s non-optimal or doesn’t use exception handling etc. Also, I deliberately avoided using regex as the code was to be given to inexperienced programmers to work with.

Running this on Alice’s Adventures in Wonderland gives output like this:

Now we have the term frequency, if we want to calculate the TF*IDF score of a document we just have the IDF (Inverse Document Frequency) part to go – which is simply the log of the total number of documents divided by the number of documents containing a given keyword.

TF*IDF Worked Example

So, let’s say we search through a paper and find the term “depression” occurs 20 times:

Then, let’s say we have 50 papers in our corpus (i.e. collection of papers), and that 15 of those papers also contain the word “depression”:

However, if the search term does not occur in any of the documents in our corpus it will lead to a divide by zero, so it’s common to perform the calculation as 1 + the number of documents containing the search term, which makes our IDF:

Finally, our TF*IDF value is exactly that – the TF value multiplied by the IDF value:

To paraphrase Wikipedia:

A high weight in TF*IDF is reached by a high term frequency (in a given document) and a low frequency in the number of documents that contain that term. As the term appears in more documents, the ratio inside the logarithm approaches 1, bringing the IDF and TF-IDF closer to zero.

Wrap Up

The TF*IDF workings above are correct to the best of my knowledge – however if you know any better then please don’t hesitate to chime in and let me know if anything’s not quite 100%.

Also, newline (\n) characters have to be stripped, but the implications are that words could be concatenated accidentally, for example if “the\n” is the last word on one line, and “world” is the first word on the next line down – stripping “\n” MIGHT leave “theworld” as a single word – I haven’t gone through the entire list looking for obvious word-wrap concatenations – if you care about your results you should prolly check that out.

Asides from that – cheers & happy word counting! =D

15 thoughts on “How To: Count word occurences in a String or File using C#”

  1. it is very helpful but i want to convert the TF-IDF to categorical data (discritaization consent) so can u give me any hint or may be example on it please

    1. Hi Yityal,

      I’ve never had to do anything like that, but if you can give me an example of the type of data you’re working with and the way in which you want it categorised I could have a think about it.

    1. Well, it’s probably not going to do good things on sets of continuous numbers, but it’ll count frequencies of pretty much any kind of character sequences.

  2. really great work , what if i need to return the frequency of a specific word i determine like the frequency of “say” , could you help me pleas

  3. Once the dictionary has been built up (for example, after line 69 in the main code segment), then something like this should do it:

    You could wrap this up into a method something like this:

    Then use it via:

    Hope this helps!

    Some further reading: http://www.dotnetperls.com/dictionary.

  4. Hi this is a great work. can you pls help me on how to use the content of the dictionary to form a tree (say binary tree), so that each word can be encoded with a unique binary code? Thank you.

  5. So how do I determine whether the TF-IDF value means a word is important or not in a document? What are considered high or low scores?

    1. A high TF*IDF value indicates a word is strongly prevalent in a given document while not prevalent in other documents in your corpus, which would mean that the document with the highest score is likely to be the most relevant one regarding that particular word.

      “High” and “Low” scores will be relative to your particular corpus for any given word.

Leave a Reply

Your email address will not be published.