Thursday, July 11, 2013

BIG data and data mining

In my household, big data is most directly related to the piles of LEGOs (or LEGO-system building components) that my boys have scattered around the house. Needles in haystacks are more often used as examples. My library of books around the house would be another example. In each case, big data basically means a lot of data.

A lot of anything, of course, is subjective. There are thousands of pieces of straw in a haystack. There are a few thousand books around my house. My boys have a couple of thousands of LEGOs. However, in the world of business (and surveillance) big data usually refers to hundreds of thousands (or even millions) of records -- each of which may have many minutes (audio) or many members (items sold in purchase records or words in emails, for example). Big data is just a way to describe lots of data.

Data mining is the process of finding that special yellow 2 by 2 LEGO in the pile, or finding the needle in the haystack, or finding a specific audio record that talks about things that are considered suspicious or dangerous.

Data mining has three basic components -- collection, storage, and analysis. These are not necessarily discrete stages but we'll discuss them separately (calling out exceptions).

As evidenced by the physical examples at the beginning of this blog, big data has always existed. Consider the stacks of paper birth certificates, or other historical documents that exist and which may need, from time to time, to be searched. The ability to effectively handle, and use, big data has gotten much easier since electronic formats have become standard.

  • Collection. Collection usually occurs at the time of transmission (when the originated data is moved to a destination). This might be a phone call. It could be at a point-of-sale (POS) cash register after the order has been finalized. It might be the registration record for a class. Collection may either occur at the intended destination (the company invoice/purchase order database) or via interception. Interception is where collection occurs somewhere other than the intended destination -- "wire tapping", people looking over your shoulder when you enter your credit card security information, and so forth.

    Collection can occur anonymously or personalized. Personalization basically means that the record is associated with a corporate or living entity. In the case of a sale at a grocery store, the data will be associated with that store (and, possibly, that cashier and cash register). If you use a credit/debit card or a store "club" card, then the data can (and probably will) be associated with the person in addition. Generally, anonymous collection is considered innocuous while personalized collection is not. This does not mean there are not "legitimate" (proper, honorable) reasons to collect personal data but it does mean that the person may have concerns as to the purpose and safety of the data.

  • Storage. This always occurs at some point. However, it may be transitory if the data are removed upon receipt and analysis. Consider a "normal" phone call. The audio message exists (and is stored) from the origination (talking) until the receiving person analyzes it. If the message is redirected (to voice mail, for example), intercepted, or copied, this may turn into a permanent record requiring long-term storage.

    Transactions (purchases, registration, email correspondence) where the data needs to be used in the future are almost all "permanently" stored. Of course, they can still be deleted in the future -- but, without advance knowledge of when, or if, this will occur they must be considered permanent.

  • Analysis. This can occur during the process of collection or it may occur later (after storage). Anonymous data is often analyzed statistically. How many of product X were sold by store Y in city Z? How many of product X were sold in state B? How long is the average voice call within a state? Trends can be analyzed over time. Store Y in city X sold NN of product X at price B. They sold GG of product X at price C (can be used to determine overall profit using margin versus quantity sold). Product F sells very well during the time period D through G but not very well in period H through M (seasonal item to be stocked differently depending on time of year).

    Analysis can also be personalized. Customer ABC buys a lot of product F. Product G is similar but there is a greater profit margin on G -- send Customer ABC coupons for product G to get them to start buying product G on a regular basis. Or Customer DEF only buys product F if the price is below $ZZ.ZZ. Customer BEF is now buying baby products -- notify baby supply companies of contact information.

    Finally, analysis can be triggered. Surveillance can use trigger words, or sequences of words (either written or audio) to divert records to further analysis. If you start buying diabetic-related foods and medicines, the data CAN be forwarded to your insurance company (and yes -- if the data is associated with you, then they CAN find your insurance company).
Big data does not change the stages but it does change the methods. There will often be multiple layers of analysis so that each step reduces the number of records to be analyzed. Analysis upon collection will specifically affect the manner in which the data are sorted and stored. And so forth.

People usually don't object to anonymous statistical analysis. They may start feeling threatened with personalized statistical analysis although they may also benefit from the results.

They often will feel threatened with triggered analysis because their "private" data are being used without explicit permission and can be used to exploit the data in some way. In addition, triggers can lead to false conclusions quite easily (you were actually buying diabetes supplies for your great Aunt, you have been reading a book about bad thing XXX and were discussing it with a friend). Big data methods are particularly susceptible to false initial triggers (although, hopefully, further analysis will filter more appropriately).

We Are All Influencers

       A couple of years ago, I wrote a blog on the effect of influencers within our society. All that is still actively happening but I sta...