As of late I’ve been playing with lots of techniques to churn through a corpus of information related to me that I have scraped from the web. Many of these techniques pull back topics, phrases, documents and tell me that they have been identified as important; it is then left to me to work out why. I’ve been thinking about methods I could use that could give me more of an insight into what the topics, phrases and documents that are pulled back actually mean.
While working on data mining MOOC I came across the C4.5 algorithm, an algorithm used to generate a decision tree. Wondering how exactly the algorithm chooses where attributes should sit in the tree, I decided to poke around and found it does so using a concept called information entropy.
While I’m still in kick the tires mode at the moment I find the concept very interesting and am wondering how I could plug something around this into the researcher api/personal corpus. From what I can gather information entropy quantifies how much information is in a message within a communications system. It’s like a scale for information, measuring in units called bits
Communication can take many forms, this blog post, my youtube videos, a picture, somebody whistles, a drawing. etc etc and it seems that the problem in information theory is transmission and receiving of data through these form of communication.
Now I have a large array of data that is essentially communication data and want to find out how important these bits of communications are to the things I want to say. From what I can gather information theory decomposes my communication data into two fields, efficient and reliability. Efficiency is how the communication is compressed, is my method of communication through text, a video, whatever (I guess you may think winzip?) and reliability is about techniques that helps me understand things that have gone wrong/missing in the communication (think about the technologies that let you watch a DVD even after your little sister has scratched it). These fields are broken up further into information theory and coding theory. Where do my R scripts fit in?
|Efficiency (Compression?)||Reliability (Error Correction?)|
|Information theory||Lossless: Source Coding TheormKraft-McMillon
|Noisy Channel Coding Thm|
|Coding theory||HuffmanArithmetic coding
|Hemming CodesBCH Coddes