A recent exchange on Mathbabe’s blog about the meaning of Big Data led me to some insights about where decisions need human judgment and analysis, and where we can turn decisions over to automated data mining. For example, serving up “you might also like X” in a web store will work a lot better than estimating how many people have flu. Why?
Here’s what I wrote. (Not clear if her WordPress interface picked it up.)
Cathy, big data in your sense does not work widely. If you say that “no human judgment is needed,” this is approximately equivalent to “the relationships do not need to be supported by causal theory, just by raw correlation.” This works great in certain domains. But the underlying correlations have to be changing relatively slowly, compared to the amount of data that is available. With enough data for “this month,” an empirical relationship which holds for multiple months can be data mined (discovered) and used to make decisions, without human judgment.
But many of the world’s important problems don’t have that much stability. For example trying to use searches to track the spread of an annual flu, at the state-by-state level, won’t be very reliable without human judgement. The correlation between search terms and flu incidence in 2012 is not likely to be the same in 2013. One reason is that news cycles very from year to year, so in some years people are more frightened of the flu than other years, and do more searches. Consider the following experiment: use the “big data relationships” from 2010, to track the incidence of flu in 2014. It won’t work very well, will it?
On the other hand, if you could get accurate weekly data about flu incidence, the same methods might work much better. Using the correlations between search terms and flu in November might give reasonably accurate estimates in December.
Automated systems based on data mining are a form of closed-loop decision systems. (Closed loop basically means “no human in the loop.”) Closed-loop feedback works great under certain conditions, and very poorly under others. A key difference is whether the system designer has sufficient (accurate) knowledge about the system’s true behavior.
Once again “it all comes back to knowledge.”