This week’s readings reminded me of my first week as a data analyst at a political organization. I was introduced to our membership database which supplemented user-supplied data with official voter file data and commercial and modeled data from Comscore. I was surprised to find that I had two records—apparently because I had created a new voter registration record when I moved for college rather than simply updating my record.
On one level, I was crushed because I wouldn’t be classified as a voter with a 100% voting rate (i.e., a voter who casts a vote in all 4 of the past 4 general elections they’re eligible for). When I looked closer, I also saw that my demographic data was different in each record. My New Jersey record showed my race as Asian (which is accurate), whereas my Pennsylvania record showed me as Black (which is not). I immediately attributed it to a bad model, especially given my name.
As I learned more about our models (and modelers), I learned that nearly all their explanatory power (I was told 95%) came from basic demographic properties: age, gender, state, zip, etc.—even though we had access to massive troves of commercial, political, and civic data, demographics were usually enough to produce satisfactory models. It seemed ironic that even though these models were expensive in many ways—including the fees charged by consultants to present long decks with complex analyses and visualizations, as well as the cost in terms of system extractions documented by Crawford & Joler—it simply wasn’t worth the computational cost of evaluating, for example, names to increase accuracy by a miniscule amount on a national scale. So I was likely coded as Black in Pennsylvania simply because my name was coded as racially ambiguous and my zip code had a large Black population.
Well, supposedly it does. Census records under-count student populations and immigrants—which meant that the Asian population (among others) in my zip code was heavily undercounted. So the racial bias embedded in the Census Bureau’s measurement of my zip code was reiterated and compounded, as Cheney-Lipold describes, in the statistical model that predicted my race. And, interestingly, even though race was a modeled value that we knew could be inaccurate, it was an immutable property in our database, so we couldn’t overwrite it locally even if someone provided it to us.
While this is a relatively innocuous case compared to Chun’s examples drawn from O’Neil and Eubanks, I often wonder about the macro-level stakes of being misclassified, especially since the Pennsylvania record is the one I have transferred as I’ve moved across states. If my record is still coded as racially Black, that affects how campaigns target me, which non-profit organizations get my name when they purchase lists, how much money fundraisers ask for, etc. Beyond electoral politics, commercial data brokers draw heavily from voter file data, too, so my record must shape, on some level, commercial marketing, too.
Since I’ve spent so many years wondering about my metadata, I enjoyed reading Cheney-Lipold’s conceptualization of the measurable type. It expands on the idea of a datafied self to attend to the algorithmic construction of not an identity but rather a distinct object. This clarifies that my voter record—and its modeled demographics—are not actually describing me, but rather an object that refers to me—which isn’t all that concerned with the truth of my life as it is with matching its own version of the “ground truth” (or deep fake, per Chun).
At the same time, the stakes can be ambiguous and high, as Cheney-Lipold demonstrates with his description of targeted drone strikes based on signatures. His discussion of signatures reminded me of digital fingerprints in which individuals are identified based on device and session data—such as time zone, operation system version, and screen resolution—rather than user or behavioral data. For a while, I used a tool to mask this information from my browser, but eventually I realized that my null or default values might actually attract more scrutiny if my metadata are highly correlated with that of people who are trying to evade surveillance for more sinister reasons. It can be challenging and quite stressful to keep track of all my measurable types that might have nothing in common other than myself as a referent. At the same time, it’s quite satisfying knowing the extent to which systems misunderstand me, as Hyejoo discussed. For example, I’m pretty sure I was in the Being Confused After Waking Up From Naps Facebook group. I hope that threw off Kosinski et al.’s predictive model, among others. Maybe I’ll re-join that group now just to corrupt one of my algorithmic constructions.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.