20081112

Data + Context = Information

Lately I have found myself gaining a renewed interest in data, how it can be analyzed, and how one can apply context to a data set and extract useful information... as a developer online user driven text content especially tickles my brain.

So much information lies hidden underneath mountains of ascii, where reading between the lines turns out more interesting facts than the text contents itself.

So as step #1 into the info analysis game I decided to code a basic tag cloud generator.
Input a wad of text, split it into words placed in an array, and count the frequency of each word in the array, which is then placed into a new array of unique words along with its frequency.

So now we have a new set of data about the text.... hardly useful at this point.

I then assign a color and size weighting based on the frequency of each word and print the contents to the screen, now I'm seeing something my brain can make some sense of.

Here's what i have so far, based on the plain text version of Alice in Wonderland.

As can be seen, I have not filtered out common words in this screenshot, though it is there I disabled it to get the screen filled nicely. Words below the viewing threshold are those with a frequency less than half the average of all the word frequencies put together.

This average frequency is also used as the weighting for color and size on each word.
I kept the weighting dynamic, so that smaller text bodies will produce a similar output compared to a larger text.

Sorting this list alphabetically rather than by frequency as seen above, makes for a very interesting visual experience more akin to tag clouds.

My next post will likely be of more screens with different interpretations on the same data set, and where to move forward with extracting further useful knowledge from a plaintext.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.