So much information lies hidden underneath mountains of ascii, where reading between the lines turns out more interesting facts than the text contents itself.
So as step #1 into the info analysis game I decided to code a basic tag cloud generator.
Input a wad of text, split it into words placed in an array, and count the frequency of each word in the array, which is then placed into a new array of unique words along with its frequency.
So now we have a new set of data about the text.... hardly useful at this point.
I then assign a color and size weighting based on the frequency of each word and print the contents to the screen, now I'm seeing something my brain can make some sense of.
Here's what i have so far, based on the plain text version of Alice in Wonderland.
As can be seen, I have not filtered out common words in this screenshot, though it is there I disabled it to get the screen filled nicely. Words below the viewing threshold are those with a frequency less than half the average of all the word frequencies put together.
This average frequency is also used as the weighting for color and size on each word.
I kept the weighting dynamic, so that smaller text bodies will produce a similar output compared to a larger text.
Sorting this list alphabetically rather than by frequency as seen above, makes for a very interesting visual experience more akin to tag clouds.
My next post will likely be of more screens with different interpretations on the same data set, and where to move forward with extracting further useful knowledge from a plaintext.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.