FEED ADDICT: analysis

20090121

Barack Obama's Inaugural Speech

With the inauguration of, and speech from Barack Hussein Obama, i could not resist the urge to once again play with my (very incomplete) tag cloud generator.

The speech transcript contained 893 unique words, filtered to remove the most common ones with little meaning, and sorted alphabetically.

Size and Color determine the frequency with which words were used overall in the entire speech.

So without any more beating around the Bush (har har):

A clear indication on what his main concerns and topics for future conversation are going to be, with almost all of his most frequently used words projecting a positive connotation.
An interesting observation that the word war was only used twice, as opposed to peace appearing 4 times.

Here follows a list of the highest recurring words and their frequency:

(4 times) crisis, come, through, power, words, seek, women, peace, up, things, before, whether, greater, men, long, end, meet.
(5 times) know, nor, generation, spirit, only, day, more.
(6 times) cannot, time, work, world, too, now, common.
(7 times) less, people, no, today, america.
(8 times) because, been, every, must.
(9 times) these, all
(11 times) new
(12 times) nation

20081112

Data + Context = Information

Lately I have found myself gaining a renewed interest in data, how it can be analyzed, and how one can apply context to a data set and extract useful information... as a developer online user driven text content especially tickles my brain.

So much information lies hidden underneath mountains of ascii, where reading between the lines turns out more interesting facts than the text contents itself.

So as step #1 into the info analysis game I decided to code a basic tag cloud generator.
Input a wad of text, split it into words placed in an array, and count the frequency of each word in the array, which is then placed into a new array of unique words along with its frequency.

So now we have a new set of data about the text.... hardly useful at this point.

I then assign a color and size weighting based on the frequency of each word and print the contents to the screen, now I'm seeing something my brain can make some sense of.

Here's what i have so far, based on the plain text version of Alice in Wonderland.

As can be seen, I have not filtered out common words in this screenshot, though it is there I disabled it to get the screen filled nicely. Words below the viewing threshold are those with a frequency less than half the average of all the word frequencies put together.

This average frequency is also used as the weighting for color and size on each word.
I kept the weighting dynamic, so that smaller text bodies will produce a similar output compared to a larger text.

Sorting this list alphabetically rather than by frequency as seen above, makes for a very interesting visual experience more akin to tag clouds.

My next post will likely be of more screens with different interpretations on the same data set, and where to move forward with extracting further useful knowledge from a plaintext.

FEED ADDICT

20090121

Barack Obama's Inaugural Speech

20081112

Data + Context = Information

Feed Addicts

The Archive

Digg Top Stories

Articles Worth Sharing

7 day twitter trending

Facebook vs. Myspace - 7 Day Twitter Trend

Blogosphere: Facebook vs. MySpace

FEED ADDICT

20090121

Barack Obama's Inaugural Speech

20081112

Data + Context = Information

Feed Addicts

RSS Subscribe

The Archive

Digg Top Stories

Articles Worth Sharing

7 day twitter trending

Facebook vs. Myspace - 7 Day Twitter Trend

Blogosphere: Facebook vs. MySpace