The mess that was the women’s USO final has spurred several questions and debates, perhaps the most core to the integrity of the sport is, are umps handing out code violations without discrimination on the basis of race or sex? New York Times reporter Chris Clarey made no secret of feeling that Serena William’s claim to having been the victim of sexism were off-base. Now, he’s written what he considers to be definitive proof refuting her claims.
The article goes on to break down the types of code violations handed down in Grand Slam events by gender of recipient to come to the conclusion that because men are fined more frequently than women, there is no gender discrimination from the chair.
Sigh. It is sad to think that one has to explain basic statistics to a sports reporter but here we are. Clarey would likely calculate a batter’s average by counting the number of times he hit the ball. It’s frustrating, and maddening, that this article made it to press, because several days before, Clarey tweeted out the aggregate amount of code violations given to men and women at this USO, and was told over and over that that was a meaningless statistic.
Props to Amy Lucas, PhD in sociology, who broke down exactly what is wrong with Clarey’s methodology in a tweet-thread:
Clarey’s data tells us nothing because it doesn’t tell us when a person should have gotten a warning and didn’t, Amy goes on to explain. But then she goes on to propose a solution that actually might work. Observe the matches over a long period of time and take note of each interaction, counting violations that received codes and violations that didn’t.
Does this sound laborious? It is. And I’m going to suggest we do it. Because this is what researchers in Education, Sociology, Linguistics, Communications and related fields do ALL THE TIME. And it works.
Maybe you think the interactions between ump and player are too varied to characterize reliably. They’re not. Let me tell you a story. My mentor in graduate school became fascinated with student-teacher communication. It turns out that a frightening percent of classrooms are characterized by what we now understand as IRE discourse: Teacher Initiation, Student Response, and Teacher Evaluation. You know how this goes. Teacher asks a question they already know the answer to, like, “What is the capital of Wisconsin?” Student responds, “Madison!” Teacher says, “Very good!” Or maybe the Student gets it wrong and says, “Green Bay!” and the Teacher says, “Are you sure about that?”
The problem with IRE discourse is that it is nothing like a real conversation. In a real conversation, people generally ask questions they don’t already know the answer to. For example, a person will ask, “What time is it?” An interlocutor responds, “3 pm.” What the person then does not do is say, “Are you sure about that?” Because they actually wanted to know the time. My advisor theorized that “authentic” questions would encourage more learning because they would invite new knowledge into the classroom.
But he had to prove it. So he trained people to recognize patterns of discourse. And then sent them out into 500 classrooms. And collected tapes of classroom interactions over the course of a semester. They coded all the interactions, and then compared them with achievement tests given at the end of the semester. Voila, his proof.
If you’re still with me, I hope your immediate reaction to this is, wow, that guy must be crazy. He was a little crazy. But he also knew his methodology. He knew the only way to prove a point is to collect interactions in natural environments over a long period of time, code those interactions, have multiple people do the coding so you can control for subjectivity (inter-rater reliability testing it’s called), and then compare with a measure of achievement.
So now let’s imagine how this works in tennis. As Amy (let’s call her Dr. Lucas) pointed out in her thread, you have to get all the interactions. You can’t just listen to the TV. You need people in the stadiums coding gesture and speech of both umps and players. You need the audio feed from the ump’s chair to catch all the dialogue on changeovers. You would need to decide what your discourse categories were. Obviously, expletives as we saw during the women’s final weren’t what ticked Carlos off. Instead it was demands (you owe me an apology), pejoratives (thief; liar), and gestures (pointing).
It would be fascinating to do this full analysis of what exactly umpires consider verbal abuse before we talk about it. And it would be interesting to see if there are other issues beyond the player involved determining when umps snap and hand out a code. It might be something we don’t expect. A study of judges granting sentencing discovered that the most determining factor in the severity of the punishment was whether the judge had just eaten lunch, or not. The methodology of that study has been widely critiqued, but the point still stands that other factors should be considered when we look for patterns of behavior.
So let’s do the work. It’s been done before. It could, and should, be done here. ITF, my fees are reasonable. I’m sure Dr. Lucas’ are as well. Give us a shout.
UPDATE: John Burn-Murdoch, stats reporter for the Financial Times (the only good newspaper left in my opinion) breaks down what’s wrong with Clarey’s stats (and what’s revealing in people’s responses to them) in a tweet thread this morning. Give it a look.