Thursday, October 6, 2022

The Unreasonable Effectiveness of Extraction at Scale - Will Core #2

 In 2009, Google researchers Halevy, Norvig, and Pereira published an article titled, "The Unreasonable Effectiveness of Data". In it, they detail how, holding all else equal, machine learning systems can be made more efficient and effective by increasing the scale of data upon which systems are trained. They argue that large scale datasets can 'drown out' any noise, errors or inconsistencies in the data, resulting in a veracious representation of reality. Consequently, they suggest that researchers should make use of the "plentiful" unlabelled data available to be scraped online and unsupervised learning to analyze it (or "network neighborhoods" as Chun calls them). I came across this article recently, and can see lots of parallels with our readings for this week. Indeed, in this weeks readings, we see this drive for scale play out on multiple fronts - on a technical level through the large scale scraping of Facebook pages, on a material level through the extensive mining of non-renewable resources, and on a social level through the mass alienation of workers to label and generate data. As AI systems continue to improve and become household items, the imperative for scale continues to grow.  

While data may be "unreasonably effective" for Google researchers, this weeks readings highlight the multiple layers of extraction and exploitation upon which these advancements are grounded. First, I loved the Crawford and Joler piece for taking on the impossible task of mapping these layers of extraction of a single Amazon echo (The map is on show as a full wall installation at MoMA, if anyone finds themselves in NY). Contemporary analyses often interrogate the "here and now" of AI systems; namely, their outputs and their impacts(such as the COMPAS system shown in class recently). Crawford and Joler instead trace the "planetary network" of exploitation upon which every output relies. By gazing backwards to uncover the material chains of production, they look forward to project the cost of our own large scale appetite for AI systems.

Chun, on the other hand, looks at the social costs of prediction. Tracing the history of mathematical correlation, Chun details how mathematical techniques for understanding people have always been intertwined with racism and eugenics. (The history of Pearson particularly shocked me - something that has been conveniently omitted from my stats classes). While Google researchers detail how more data can lead to more "effective" predictions about people, Chun asks us to reflect on what exactly is being optimised, and what is lost in the process. By looking back, Chun highlights how these systems will only reproduce the harmful racist, sexist and tribalist discourses that inform them. Rather than resting on homophily as an "axiom" of data science, Chun encourages us to look within the group to "pull together all the elements of the tribe", such as the people, the land and the characters, throughout time. And yet, trying to grapple with the scale of these predictive tools, I wonder how effective this can be to resist the homogenising lens of clustering.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.