Monday, August 24, 2020

Data is the Story

In middle school English I remember we did a drama activity in which the class divided into groups. Each group wrote, produced, and performed a play. The point of the activity was that each part of the process – the writers, the actors, and the stage crew/promoters etc. was important to putting on a good play. I got it and when I talk to my friends in Hollywood, many of whom work on production (although I know a few writers as well) they emphasize how critical the sound engineer’s work (for example) is to making a decent movie. Conversely a bad sound engineer or video editor can absolutely ruin a movie. Many of these functions will go unnoticed – a prop-master has done a good job if the viewer never even thinks about the props because that they fit seamlessly into the story. 

All well and good, but I didn’t buy it: because without a writer, you have nothing. Seeing a tremendous production of a Shakespeare play is a sublime experience. But reading a Shakespeare play is pretty great in and of itself. The same cannot be said about the other components of the play (or movie.) The writing is the engine, the essential thing.

Go figure that I aspired to be a writer…

 

Now, in my data analytics adjacent professional world, this lesson has returned to me. AI/ML is all the rage these days. Yet, high-end analytics are useless without data. At the same time, with data, even simple analytics can be extremely useful and provide important insights. Mapping to the drama comparison: the data is the script, the analytics is the acting, and production is everything else – visualization, communication etc.

 

Analytics can play a critical role in bringing meaning from the data. Visualization and communication (the latter is my part of the business) are critical for making the findings interpretable to decision-makers and meaningful to the public. The importance of both of these elements should not be underplayed. But without data, there is nothing. I’ve been to meetings where potential users get very excited over a flashy visualization, not recognizing the dearth of substance beneath it.

 

To people outside of this world, it should be understood that in many cases, the data on a particular issue or problem is not collected or is not collected consistently. There are a range of social and policy issues around this. We are seeing it now in the pandemic response, where data on COVID-19 cases is collected inconsistently. There are divergences in timing, delivery, data format, and a plethora of issues that complicate obtaining an accurate read on the national situation.

 

Caveat: This is not to fuel conspiracy theories that COVID-19 isn’t a big deal. It is. The data we have is imperfect but undeniable – millions have been infected and over 170,000 people have died, most needlessly. The point is more accurate data would enable more effective policy response: identifying points of origin, trajectory of spread, and more accurate projections of future outbreaks.

 

In many cases, what is called algorithmic bias is really a matter of data bias. The data was collected improperly or annotated and curated improperly. This is a vast issue. The way data is collected reflects institutional priorities and the way in which instances are categorized as types shapes what kinds of analytics can be run. Bad analytics can generate lousy findings from good data, just as a lousy actor can ruin a great scene. Glossy visualizations and smooth patter can sell weak conclusions (who hasn’t gone to a lousy movie because of a great looking trailer.) But if you don’t start with good data – that as much as possible reflects reality and takes into account the multiple facets of reality – you are unlikely to end up with quality results.

 

This is hard to do, just like writing is hard: I complain about it constantly. But good writing and good data both share a quality of the real and can unlock deep truths.

No comments: