Human workers have to categorize posts to create training data for AI systems
A new report from Reuters reveals that contract workers are looking at private posts on Facebook and Instagram in order to label them for AI systems.
Like many tech companies, Facebook uses machine learning and AI to sort content on its platforms. But in order to do this, the software needs to be trained to identify different types of content. To train these algorithms they have to analyze sample data, all of which needs to be categorized and labeled by humans — a process known as “data annotation.”
Reuters’ report focuses on Indian outsourcing firm WiPro, which has employed up to 260 workers to annotate posts according to five categories. These include the content of the post (is it a selfie, for example, or a picture of food); the occasion (is it for a birthday or a wedding); and the author’s intent (are they making a joke, trying to inspire others, or organizing a party).
Employees at WiPro have to sort a range of content from Facebook and Instagram, including status updates, videos, photos, shared links, and Stories. Each piece of content is checked by two workers for accuracy and workers annotate roughly 700 items each day.
Facebook confirmed to Reuters that the content being examined by WiPro’s workers includes private posts shared to a select numbers of friends, and that the data sometimes includes users’ names and other sensitive information. Facebook says it has 200 such content-labeling projects worldwide, employing thousands of people in total.
“It’s a core part of what you need,” Facebook’s Nipun Mathur, director of product management for AI, told Reuters. “I don’t see the need going away.”
Such data annotation projects are key to developing AI, and have become a little like call center work — outsourced to countries where human labor is cheaper.
In China, for example, huge offices of people label images from self-driving cars in order to train them on how to identify cyclists and pedestrians. Most internet users have performed this sort of work without even knowing. Google’s CAPTCHA system, which asks you to identify objects in pictures to “prove” you’re human, is used to digitize info and train AI.
This sort of work is necessary, but troubling when the data in question is private. Recent investigations have highlighted how teams of workers label sensitive information collected by Amazon Echo devices and Ring security cameras. When you talk to Alexa, you don’t imagine someone else will listen to your conversation, but that’s exactly what can happen.
The issue is even more troubling when the work is outsourced to companies that might have lower standards of security and privacy than big tech firms.
Facebook says its legal and privacy teams approve all data-labeling efforts, and the company told Reuters that it recently introduced an auditing system “to ensure that privacy expectations are being followed and parameters in place are working as expected.”
However, the company could still be infringing the European Union’s recent GDPR regulations, which set strict limits on how companies can collect and use personal data.
Facebook says the data labeled by human workers is used to train a number of machine learning systems. These include recommending content in the company’s Marketplace shopping feature; describing photos and videos for visually-impaired users; and sorting posts so certain adverts don’t appear alongside political or adult content.
some point last year, Google’s constant requests to prove I’m human began to feel increasingly aggressive. More and more, the simple, slightly too-cute button saying “I’m not a robot” was followed by demands to prove it — by selecting all the traffic lights, crosswalks, and storefronts in an image grid. Soon the traffic lights were buried in distant foliage, the crosswalks warped and half around a corner, the storefront signage blurry and in Korean. There’s something uniquely dispiriting about being asked to identify a fire hydrant and struggling at it.
These tests are called CAPTCHA, an acronym for Completely Automated Public Turing test to tell Computers and Humans Apart, and they’ve reached this sort of inscrutability plateau before. In the early 2000s, simple images of text were enough to stump most spambots. But a decade later, after Google had bought the program from Carnegie Mellon researchers and was using it to digitize Google Books, texts had to be increasingly warped and obscured to stay ahead of improving optical character recognition programs — programs which, in a roundabout way, all those humans solving CAPTCHAs were helping to improve.
Because CAPTCHA is such an elegant tool for training AI, any given test could only ever be temporary, something its inventors acknowledged at the outset. With all those researchers, scammers, and ordinary humans solving billions of puzzles just at the threshold of what AI can do, at some point, the machines were going to pass us by. In 2014, Google pitted one of its machine learning algorithms against humans in solving the most distorted text CAPTCHAs: the computer got the test right 99.8 percent of the time, while the humans got a mere 33 percent.