25

Jul

2013

The “Not Ready For Prime Time” Classifier

At the Strata 2013 conference last February, I presented a felon classifier (here’s the video, slides, and conference interview) that estimated whether a person has felonies based on other data in their criminal record, like misdemeanors and publicly available information about them.

The training data, features, and resulting classifier (but not the underlying machine learning software) were released on the inome Github repo at Strata. To see how you might be classified, check out this widget (→) which is just a Javascript implementation of the classifier data (yaml, pdf) released at the conference.

I was also interviewed by Strata Conference Chair Edd Dumbill for the inaugural issue of Big Data Journal on whether big data inferences, like criminal profiling, should be outlawed — effectively defining a new category of machine-perpetrated thoughtcrime.

Update: Bloomberg’s Jordan Robertson wrote an article and did an on-camera interview on this work. Dave Merrill also created the very cool infographic linked below. 

One clarifying tweak I’d make to my quote in Jordan’s piece is to add the word “alone”: 

"Because geeks like me can do stuff like this, we can make stuff work - it’s not our job ALONE to figure out if it’s right or not. We often don’t know.”

imageMy point is that geeks alone can’t make the tough policy calls without the help of the more wonkish humanitarians among us — historians, philosophers, activists, ethicists, anthropologists, economists, and journalists. Technologists, too often, just don’t have the training or experience.

Update 2: But that’s not all! Jordan also wrote a companion piece on how London police used low-tech predictive policing in the 1990’s to reduce rapes. Essentially, they noticed that clothesline theft was a gateway crime to rape. Woah.

Update 3The Takeaway with John Hockenberry did a radio story too.

Why open this can of worms?

First, I wanted to show what a “big data inference” might look like in real life. At last Fall’s Strata+Hadoop, I tried to frame the privacy issues (video, slides) around big data inferences. Essentially, thoughts can only be crimes when they are acted upon, except in the aberrant case of criminal conspiracy. Thoughts, without acts, are not crimes whether committed inside our skulls or by machines.

Second, we need to ask some hard, interdisciplinary questions of technologists, business people, and policy thinkers — what I affectionately refer to as the geeks, suits, and wonks. Criminal profiling is a very sensitive subject. It is riddled with statistical and cognitive biases. I’ll examine some of these biases here, since I’m a bit of a geek, wonk, and suit myself.

Data scientists are sometimes compared to artists because of the skilled creativity that their novel work demands. Well, data science needs data critics as much as art needs art critics. Like good art, good data science elicits as many questions as answers. These questions can be harsh but make the science better and, ultimately, leads to better outcomes. Friction sharpens the blade. As Cathy O’Neil, Kate Crawford, and Monica Rogati have discussed, a healthy dose of data skepticism can be, well, healthy.

So, let’s ask a few critical questions of our felon classifier to see why it’s not ready for prime time.

Does the classifier predict if someone will become a felon?

No, the classifier is not a Minority Report precog. I don’t refer to the classifier as a felon predictor, because it’s not. The classifier is a descriptor of the data, not a predictor from it. It describes who might have felonies based on misdemeanors and personal information available from their criminal record and other publicly-available information.

A predictive crime model could be trained using, say, a timeline of misdemeanors and then asking whether the next crime would be a felony. A Hidden Markov Model (HMM) might be a reasonable start at such a predictive model — a homework assignment for the studious reader would be to define possible HMM states.

Is the classifier any good?

We used an Alternating Decision Tree (ADTree) to learn how to find felons from the defendant data. ADTrees have the advantage that the classifier tree is interpretable — it’s easy to see how the classifier is weighing the features to make its decision. Here’s a visualization of the felon classifier ADTree. When powerful players (like governments) make decisions, there must be transparency into their deliberation process. That’s why we released the training data and resulting classifier.

ADTrees have disadvantages, as well. Like all machine learning classifiers, they tend to do better as you feed them more data. We trained the classifier with a data set of only 14,932 defendants, which may be too small a training set. See the next section for more scrutiny on the training data itself.

And then there’s the features we chose. Our paltry eight features likely don’t capture all the relevant information in the data, resulting in high estimator bias. Also, we don’t know if any of the features we did choose were correlated with each other. If, for example, hair color and eye color rise and fall together with felons, eye color and hair color are effectively the same feature. Then we have just seven features, and the ADTree model might learn more slowly because of the overhead of dealing with the overlapping features.

Is the training data any good?

Noise creeps into the data at every stage. Sample bias, reporting error, and target leakage all contribute to the classifier’s performance.

To train the classifier, we arbitrarily chose the state of Kentucky because the data set was small enough to easily download, but big enough to provide reasonable results. We could only use the data for 14,932 of the 63,079 defendants because we could only decipher the offense severity (felony or non-felony) for these 14,932 defendants. We also had a lot more felons (12,131) than non-felons (2,801). These hard choices and harsh realities introduce a sample bias that restricts the classifier’s competency. If you don’t statistically look like the Kentuckians we looked at, the classifier’s decision should come with several grains of salt.

Once we looked at the data, we wondered whether the government’s data entry for felons was better or worse than the data for non-felons. For example, one thought was that felons are in the system longer, so maybe their records were more complete. Here are the fill rates for a few of our classifier features:

image

Skin color is recorded nine times more often for felons, but eye and hair color are recorded more often for non-felons. In fact, the ADTree uses this information in the first feature that it learns. If the person’s hair color is unknown, the ADTree sees that as a strong indicator of a felon. To see this in action, try the Could You Be A Felon? classifier widget (top of the post) with any hair color, and it’ll classify you as a felon. Who knows why the data fill rate is so varied? Feel free to vent your theories of incompetence or conspiracy in the comments.

Also note that when skin color is filled with anything, there’s a 20% greater chance that the defendant is a felon. This is an example of target leakage where the data leaks how the classifier should make its decision.

Is the classifier learning to find felons, or just learning prejudices in the justice system?

A particular area of concern is to understand what the classifier is really learning. Ideally, it’s learning to identify felons, without bias, based on prior misdemeanor criminal records and personal information. But we just showed that that can’t be true. Models are simplifications of reality, riddled with bias.

What’s worse is that police may then use the biased classifier to identify felony suspects and successfully convict them. The classifier will then be trained on this new conviction data. The new training reinforces its prior predictions, resulting in new predictions that point police to similar criminal profiles. An insidious loop, of machine-learned confirmation bias, results.

Bernard Harcourt, Professor, University of Chicago Law School, calls this ratcheting — over time, police increase the profiling based on increased success rates. So, what’s wrong with successfully arresting guilty criminals? Harcourt gives the example of Chicago police attempting to control drug use by targeting the heavily black South Side drug supply while forsaking the largely white North Side drug demand. South Side drug dealers may be more unresponsive to policing while the North Side may be very responsive. So, if the goal is to reduce overall drug trafficking, the trick may be to police the entire market, not just one side of it.

Cathy O’Neil calls this feedback loop the death spiral of modeling. I think the hazards of this self-fulfilling prophecy were best put by North Carolina Appeals Judge Charles Becton in 1987 when he invoked Alexander Pope: “… all looks yellow to the jaundiced eye.”

Final thoughts

Like it or not, we now live in David Brin’s Transparent Society. So, we need to ask tough questions about where technology is taking our culture. When does freedom of the public (e.g., security, speech, press, history) trump freedom of the person (e.g., liberty, privacy, equal protection, due process)? As Franklin chided us, if we trade essential liberty for a little temporary security, we deserve neither and will lose both. Knowledge is easy, wisdom is hard. The geeks, suits, and wonks need to figure this out sooner than later.

Acknowledgements: Thanks to Ang Sun and Andrew Borthwick for reviewing some of the technical issues.

Esquire Theme by Matthew Buchanan
Social icons by Tim van Damme     Hacks by Jim Adler