20081113

Text Analysis - Continued

To further on my previous post, follows a new screenshot, of a tag cloud sorted alphabetically.
Still using the text from Alice in Wonderland, we now get a much clearer view of which words repeat most often.

Here is the Alphabetically sorted cloud

I have also started eliminating occurrences of very common words, by building up an array with those providing little or no associative meaning. Even so, many words still fall into the low frequency threshold as can be seen by the red tags.

Lastly, no points for style, i am definitely no designer.

Next on the table is some research into psychological concepts, affect in text, emotion portrayed through words, and how it can be determined what feeling is being carried over.

20081112

Data + Context = Information

Lately I have found myself gaining a renewed interest in data, how it can be analyzed, and how one can apply context to a data set and extract useful information... as a developer online user driven text content especially tickles my brain.

So much information lies hidden underneath mountains of ascii, where reading between the lines turns out more interesting facts than the text contents itself.

So as step #1 into the info analysis game I decided to code a basic tag cloud generator.
Input a wad of text, split it into words placed in an array, and count the frequency of each word in the array, which is then placed into a new array of unique words along with its frequency.

So now we have a new set of data about the text.... hardly useful at this point.

I then assign a color and size weighting based on the frequency of each word and print the contents to the screen, now I'm seeing something my brain can make some sense of.

Here's what i have so far, based on the plain text version of Alice in Wonderland.

As can be seen, I have not filtered out common words in this screenshot, though it is there I disabled it to get the screen filled nicely. Words below the viewing threshold are those with a frequency less than half the average of all the word frequencies put together.

This average frequency is also used as the weighting for color and size on each word.
I kept the weighting dynamic, so that smaller text bodies will produce a similar output compared to a larger text.

Sorting this list alphabetically rather than by frequency as seen above, makes for a very interesting visual experience more akin to tag clouds.

My next post will likely be of more screens with different interpretations on the same data set, and where to move forward with extracting further useful knowledge from a plaintext.