Simplifying data-and why it can be dangerous

Posted on 05/04/2013 by

0



Is simplicity in data always desirable?

Part 1

I recently stumbled on a really cool blog post looking at the alignment of buildings on the Open Street Map (OSM) database.  The author loaded the OSM database for the British Isles, and then calculated the azimuth* of each building.   His results are interesting.

BuildingAzimuth

What does this tell us?  He noted that,

If buildings were arranged randomly you would expect the plot to be more-or-less circular: with roughly equal lengths of the building perimeter heading in every direction.

Instead, many buildings are directed along the North-South and East-West points.  Specifically, there is a,

tendency to align buildings so that longer walls run roughly East-West and shorter walls run roughly North – South. Hence the oval shape in the chart, with the major axis running horizontally

While he found that roads and other man-made features run along an east-west axis, the spikes in the main points of the compass are unique to buildings.  The comments suggest a number of reasons: churches were built East-West, and newer buildings sprouted around them; buildings are arranged to maximize sunlight, and so on.

However, the author eventually concluded that there were some errors in the way the data was computed.  Due to a projection problem, the data appeared skewed.

Part 2

This kind of scared me.  Before reading through the comments, I really liked the simplicity of the presentation.  In fact, it was the simplicity itself that was convincing.  This reminded me of an article I had seen in the New York Times entitled “The Mind of a Con Man.”  It describes the rise and fall of Dutch professor Diederik Stapel, a social psychologist who admitted to falsifying data and even fabricating entire studies.

28stapel1-articleLarge-v5

Diederik Stapel, looking sad

One quote in particular stood out to me:

He soon realized that journal editors preferred simplicity. “They are actually telling you: ‘Leave out this stuff. Make it simpler,’ ” Stapel told me. Before long, he was striving to write elegant articles…His lifelong obsession with elegance and order, he said, led him to concoct sexy results that journals found attractive. “It was a quest for aesthetics, for beauty — instead of the truth,” he said.

This is striking for me particularly because much of this blog is focused on the aesthetics of data.  Now, I am not at all intimating that the author of the OSM graph intentionally falsified data, or even put beauty above truth.  Not even close–he even noted in the comments that he had, in fact, made a small mistake.  It was rather the fact that I was so easily convinced that there was some hidden meaning in the graph that surprised me. I took it at face value.  At a time when some very fundamental findings in economics are being debunked by re-analyzing data (I’m referring here to the UMass grad student who got a hold of Reinhart and Rogoff’s austerity data and found that they had left out some important bits in excel), I was taken aback by my own willingness to accept “nice looking” data.  Even though I often work with big data, I got caught up in the simplicity of his results.  This brings me back to my original question: is simplicity desirable?

A basic google search for simplifying data results in nearly 15 million hits.  IBM and other firms explicitly advertise simplicity in their database solutions.  So this is clearly an important topic.  But on the conceptual end, it seems that simplicity comes at a price.  Because our brains are pattern recognition machines, we are wired to see patterns–even when those may have underlying factors that are initially unclear.

Part 3

As we become increasingly capable at dealing with big data, I think it might be a good idea to keep in mind that our brains are attracted to patterns.  We like simplicity because it reveals patterns more readily.  And of course, we need to be able to interpret the truly incredible amount of data that is becoming available.  But as we do so, be cognizant of the dangers of simplicity, and that it does not (necessarily) equate with clarity.  A single variable can skew data (which is why any social scientist would be wary of an r of .99), but there is always something there, something that desires that perfect correlation, that hidden relationship, that elegance of simplicity.

Thoughts?

Advertisements