Meet the gig workers making AI machines more accurate, capable and powerful

ILLUSTRATION BY KUBA FERENC/THE GLOBE AND MAIL

Kristin left her job in veterinary services a few years ago, as the emotional toll of dealing with distraught pet owners left her drained. Instead, in 2021, she took an independent contractor position working from her home in the United States for Telus International, which is based in Vancouver. It was supposed to be a side gig, but with few work-from-home jobs available during the COVID-19 pandemic, it became her main source of income.

Kristin, 35, is what’s known as a rater. Tech giants such as Google and Microsoft outsource countless tasks to keep their algorithms running smoothly, relying on raters to evaluate websites or compare two sets of search results.

Earlier this year, Kristin started getting a new task on her computer. She’ll often see a question or a prompt, including fanciful requests such as, “Write a children’s story about a dinosaur.” Alongside are two responses, and Kristin selects the one that best fits the guidelines she’s given, which usually concern accuracy. She doesn’t know where the text comes from or what will be done with her feedback, other than that it helps improve an artificial intelligence model, such as the one underlying ChatGPT. Kristin is not explicitly told who she’s doing this for, either (Telus International is a contractor in this regard) though she’s deduced the client is Google. A spinoff of Telus Corp., Telus International has said on conference calls that Google, owned by Alphabet Inc., is its third-largest client.

Because Kristin is fact-checking text – chatbots can make things up – she sometimes hops over to Google or Wikipedia. Telus International provides an estimate of how long each task should take, usually a few minutes, and she generally has to stick to it. “Unfortunately because of the time constraints given to us, sometimes that research is thorough and sometimes it’s not,” she told me. Kristin earns about US$14 an hour, receives no health benefits or sick leave, and sees no prospect for advancement. “I’m going to keep this gig for as long as they’ll let me,” she said, “but it’s certainly not something I can plan on.” (The Globe and Mail is identifying her and other workers by pseudonyms because they’re worried about losing employment.)

AI or human? Take this quiz to spot the fake

Kristin is part of a spectral global work force that helps make AI models more accurate, capable and powerful. These freelancers draw boxes around images of cars and pedestrians to train self-driving vehicles; rank responses from chatbots and converse with them in real time; edit AI-written copy and catch mistakes; decide which AI-generated image is best; and teach AI about biology, chemistry, computer science, history, law, marketing, physics, poetry, creative writing and other domains, a massive transfer of centuries of discovery and humanity into impenetrable AI models to be mapped, parsed and regurgitated.

AI companies regularly use reinforcement learning from human feedback, or RLHF, on large language models, which power ChatGPT and other chatbots.Richard Drew/The Associated Press

AI, at its best, can seem like sorcery. A lot of us experienced an uncanny thrill first seeing ChatGPT write a coherent paragraph or an image generator render a photorealistic picture. But somewhere along the line, people had to label a data set of images so that an AI model could learn that a cat is a cat, and a tree is a tree. Chatbots seem competent because people have made them so, often by choosing between two blocks of text, over and over, as Kristin has done.

“It’s really important to stress that artificial intelligence is actually based on human intelligence – thousands of hours of human intelligence,” said Sasha Luccioni, an AI researcher in Montreal with Hugging Face Inc., which develops open-source machine learning tools. “It’s becoming a whole gig economy.”

The boom in generative AI has created an opportunity for companies to round up workers and co-ordinate feedback to fine-tune generative AI technology for OpenAI, Google, Meta Platforms Inc. and others. One of these companies, San Francisco-based Scale AI Inc., was valued at US$7.3-billion. Another called Surge AI boasts of its “elite workforce,” while Telus International oversees people in more than 150 countries who deliver “human-powered data.”

A dive into how AI models are trained, and interviews with 20 gig workers, shows there are flaws with this approach. For one thing, many of the problems of the gig economy are inherent in building AI. While some freelancers enjoy it and earn decent income, these roles can also be low-paying, precarious and unreliable. Assignments are sporadic, and because people are usually compensated by the task, they can end up wasting time waiting for work. Moreover, the millions and millions of annotations and scraps of feedback collected from these workers are still not enough to overcome the limitations of AI today.

When you look behind the curtain to glimpse how some AI models are made, it’s clear how crucial humans are to the process. We are the ghost in the machine – but maybe not for long.

A Tesla Model 3 vehicle uses Autopilot Full Self Driving Beta software to navigate a city road in Encinitas, Calif., in February, 2023. Companies developing self-driving vehicle technology heavily depend on data annotation and labelling services.MIKE BLAKE/Reuters

Data is the lifeblood of artificial intelligence. Algorithms learn by discerning patterns in large volumes of information, and quality is crucial. Consider how in 2017, a group of medical researchers built an algorithm that determined whether a skin lesion was malignant based on photograph. It worked, but there was a hiccup. The algorithm was more likely to conclude a lesion was cancerous when a ruler, used to measure the skin aberration, also appeared in the picture. “Thus the algorithm inadvertently ‘learned’ that rulers are malignant,” the researchers wrote.

Such issues have helped create a business opportunity to collect and label data to ensure AI models learn from high-quality material. Telus paid $1.2-billion in 2020 for U.S. data annotation outfit Lionbridge AI, and then bought an Indian company that annotates data for computer vision algorithms.

Telus International also collects original material for clients, and its job postings show the unending need of AI developers to capture every facet of human existence. One posting sought participants to film themselves miming security footage by skulking around a room “with or without a disguise.” In Kenya, Costa Rica, Mexico and the U.S., Telus International needed to find parents with minor children for a “facial video data collection project.” Children aged 13 to 17 were to record videos wearing props such as sunglasses, face masks and hats.

Companies developing self-driving vehicle technology depend heavily on data annotation and labelling services. Autonomous vehicles have to spot cars and pedestrians, among other road objects and hazards, and understand how they move through space.

In many cases, that work has fallen to people in developing countries. Aesop Khaemba, who lives in Nairobi, first heard about a website called Remotasks in 2020, a crowdsourcing platform operated by Scale AI. When he logged in, he would see pictures and videos of street scenes, along with images generated by an autonomous vehicle’s various sensors, and painstakingly draw boundary boxes around cars, pedestrians, traffic lights and other objects. One scene could take a few hours or as long as two days to complete.

It was tedious, but the bigger problem was the volume of work was unpredictable. Sometimes Mr. Khaemba could spend the whole day labelling, whereas other times he would be lucky to snag an hour. “It was quite frustrating,” he said. “Once you do this kind of work for a while, you get used to it and you start depending on it for paying your bills.”

In a good week, he earned up to US$50, which goes a long way in Nairobi, he said. In a bad week, he would bring in a few dollars – and there were more bad weeks than good. His social life withered because he spent so much time in front of his computer, waiting for work. “You don’t leave the house. You don’t know when the tasks are coming, so you just have to be there,” he said.

Mr. Khaemba, who is 24, earned an economics degree, but with few employment opportunities, he stuck with Remotasks into 2021. Eventually, he signed up to two other platforms, hoping to find more reliable work, but he still spends a lot of time waiting. “Sometimes you can stay there all day and all night and nothing comes,” he said.

Carlos, who lives in the Philippines, found a steady flow of tasks with Telus International at first. (Carlos is a pseudonym.) Up until the pandemic, he worked as an IT and security consultant, but he started losing clients. He signed up to Telus International after seeing a Facebook ad and found himself annotating images – identifying people wearing T-shirts, or labelling earrings and bracelets.

Some tasks are supposed to be finished in just 10 to 20 seconds, and pay is allotted in fractions of a penny per task, according to a list obtained by The Globe. If he takes too long, he runs the risk of getting flagged for poor performance. He’s also been temporarily disqualified from tasks for zipping through them too quickly, which means he’s forced to come as close as possible to the company’s time frame. When he started, his pay worked out to about US$6 an hour. In May, Telus International cut pay rates in his geography, citing “global economic conditions.” Now he earns about US$5.76 an hour, but he’s found there are also fewer tasks available. Carlos had to sign up with a competing platform to supplement his income.

Enda Cunnane, vice-president of operations with Telus International’s AI division, said in an e-mail that the company increased rates for other tasks, but did not provide further details. The company is clear that its rater roles are meant to be flexible and depend on task availability, which means hours cannot be guaranteed. “While task volumes may fluctuate, this is very much market specific and would be considered normal for the business that we are in,” he said. The estimated time to complete tasks, meanwhile, is an average of sorts, based on the actual time spent by workers, and is periodically reviewed.

As for Carlos, there’s an irony to his gig. One IT contract he lost in recent years was with a medical transcription company. The firm once had employees typing out voice-recorded medical reports, but later switched to AI transcription. With no need for typists, the company had no need for Carlos. I asked how he felt about losing a client to AI, and now, partly as a result, he’s completing microtasks to build better algorithms. He didn’t dwell on it. “People just have to keep on improving themselves,” he said.

From left, Nick Frosst, a founder of Toronto-based LLM developer Cohere, Martin Kon, the company’s president, and Aidan Gomez, the chief executive. Cohere turned to Surge AI to fine-tune one of its models, and saw “big lifts” in performance, according to a case study.NATHAN CYPRYS/The New York Times News Service

One of the most vexing problems in artificial intelligence is this: How do you get an AI agent to do exactly what you want? Difficult or abstract tasks are tricky to communicate to a computer, and an AI model might figure out how to achieve an objective in unpredictable, even dangerous, ways.

To deal with this, researchers from OpenAI and Google DeepMind developed an algorithm in 2017 that learned from human feedback, more closely aligning the AI model with the designer’s intent. The researchers trained an algorithm, represented as a worm-like creature in a simulation, to do a backflip. A human evaluator was shown two video clips of the worm attempting the feat, and chose the video that most closely resembled a flip. After 900 rounds, the little worm could flip, land and repeat.

The technique – called reinforcement learning from human feedback, or RLHF – wasn’t new, but the study showed that it could be applied to more complicated goals. AI companies now regularly used RLHF on large language models, which power ChatGPT and other chatbots.

LLMs work by getting pumped full of text, much of it scraped from the internet, and learn to predict the next word in a sequence. But that’s only part of it.

An LLM in this state is like an alien that has learned about humanity from reading all we’ve written without ever interacting with one of us. Without tweaking, LLMs are more prone to misinterpreting instructions, making things up, rambling and veering off-topic. Human feedback can also help steer LLMs away from exhibiting bias, and dispensing harmful information and conspiracy theories (though these problems still exist).

When Meta released a new LLM in July, the company said it had been trained on more than one million human annotations. Cohere, an LLM developer in Toronto, turned to U.S. company Surge AI to fine-tune one of its models, and saw “big lifts” in performance, according to a case study. Telus International, meanwhile, scaled up its RLHF services late last year.

The tasks devised to train LLMs go beyond selecting one response over another, though there are plenty of those. One person I spoke to judged AI-generated tweets and Instagram captions based on which sounded like it was written by a human. Others were given a product – baby carriages – and would write up a fictitious conversation between an online shopper and a customer service chatbot. Another saw a prompt, a request by a teacher to compose a parent letter about a student refusing homework, and ranked responses on a scale of 1 to 5, over and over again.

Scale AI recently posted job openings for experts to train LLMs in subjects such as law, philosophy and marketing, with offers up to US$45 an hour. Ahrom Kim signed up as a biology expert to put her immunology PhD to work, though she was never asked to prove she had one.

Scale AI founder and CEO Alexandr Wang poses for photos at the company's office in San Francisco in May 2023. Part of the boom in generative AI, the company was was valued at US$7.3-billion last year.Jeff Chiu/The Associated Press

Through the Remotasks platform, she asked a chatbot to summarize scientific papers or develop cell biology research plans, looking for factual errors. She soon started getting tasks related to other sciences, which she was not equipped to handle. “I was spending hours and hours in front of a computer, just skipping the task,” she said. “I wasn’t getting paid for it, which was really annoying.”

Finding consistent work on the platform is a frequent challenge, even for people brought on for specialized roles and paid hourly. A freelance writer who I’ll call Josh signed up with Scale AI earlier this year to help train chatbots, but he’s only paid when work is available. He was in a prolonged dry spell when we spoke in the summer. “I’ve been out of work for a good chunk of time,” he said of why he signed up.

Scale AI also sought workers fluent in various languages, including Polish, Bulgarian and Bangla, and the pay indicates a willingness to use cheap labour. A Bangla writing expert in the U.S. could earn US$22 an hour; one in India would get US$2 an hour. In August, The Washington Post reported many Remotasks workers in the Philippines earn far below the minimum wage, while Fairwork, a project at the Oxford Internet Institute in Britain, found that Scale AI met only one out of 10 principles of fair labour.

Scale AI said in a statement that it ensures fair and competitive compensation, but did not comment on minimum wage thresholds. A recent survey of 5,000 Remotasks workers found nearly three-quarters were satisfied working on the platform and more than 80 per cent appreciated the flexibility, according to the company.

For other workers, training language models became almost addictive. In July, Liam Gallagher came across the website Data Annotation Tech, which promised “unlimited earnings,” while browsing Reddit. He was soon making US$20 per hour conversing with a chatbot, asking it to write up pitches for movies and rating the responses.

Whenever he got an e-mail notifying him that new tasks were available, he immediately signed in, but found the jobs were gone within minutes. “People would just dogpile on these things,” Mr. Gallagher said. He stopped waiting for e-mails and constantly refreshed the site instead, noticing that new tasks tended to be posted at night.

The weird thing is that he has no idea who owns Data Annotation Tech. He was paid through Paypal by an entity called WFH Tasks, which has a defunct web page and a San Francisco phone number. When I called the number, the man who answered sounded very confused when I asked about this. According to reporting by the Verge, Data Annotation Tech and a couple of other sites “appear” to be owned by Surge AI, but the company would not confirm it. Surge AI did not respond to questions from The Globe.

Soon, Mr. Gallagher received a message saying his account was disabled because of quality issues. There was no further explanation, and his e-mails to user support (once he actually tracked down an address) went unanswered. He felt he put care into his work, and was never told what he’d done wrong. He’s not alone, either: Others have shared similar experiences on Reddit.

Some people I spoke to experienced niggling doubts. Laura, which is a pseudonym, spends about 20 hours a week on Data Annotation Tech, usually conversing with a chatbot. She sometimes runs out of ideas, and has resorted to asking the chatbot for tips on overcoming writer’s block. Laura, who is 58, derives close to 40 per cent of her income this way, and while the hourly rate is decent, she never quite knows when a project is going to end.

She also wonders about the bigger impact of helping train language models. “I know in my heart that AI is going to replace jobs, and that gives me pause,” she said. But she has to work, too. “There’s not a whole lot of skills I had that are marketable.”

Josh, the freelance writer, is even more conflicted. He’s made a livelihood as a writer, yet he’s helping AI become better at that very skill. “It’s depressing,” he said, “but I don’t feel like I have a choice.”

Over the past few months, I dabbled on two AI training platforms. My first task on one site was judging the relevance of images to search queries – in French, even though I hadn’t indicated any knowledge of it. After a few tries, an e-mail arrived telling me that “due to low accuracy and speed issues,” I was disqualified. “This decision is not reversible.”

Other tasks were so baffling, I gave up. One required me to rank product attributes (size, appearance, quality) to train a chatbot for a retailer. My rankings were almost always wrong. It turns out when buying a “turtle wine bottle stopper,” sturdiness is the most important factor, not appearance. Later, I was banned from another task (low accuracy, again) where I had to give my opinion on which AI-generated image best matched a prompt. Both jobs wanted my judgment, but I had to conform to a consensus.

I had better luck on another platform and qualified to help pair smart glasses with a chatbot. I would be shown videos and come up with a command someone might give to an AI assistant if viewing the scene, and provide the answer. I read through a 45-page guide (“Do NOT ask the Assistant to ‘roast’ people”), watched a video that contradicted the guide at least once and got to work.

On my screen were stock video clips – women laughing with champagne, businessmen laughing with documents – and a pre-filled command for each one. Most asked me to write a haiku or a limerick about the scene. For a short video of concrete tumbling down a chute, I composed a wish-you-were-here text to a non-existent friend. “It’s fascinating to witness,” I wrote. “Hope to see you here next time.”

While these tasks required some creativity, annotating images broke my brain. I read through pages and pages of guidelines titled Instructions for Human Parsing. The project consisted of what looked like security camera snapshots of shoppers perusing grocery stores, and I had to trace and label every visible body part. There were detailed (and convoluted) tips for handling long hair, hats, jackets and shopping bags. The images were taken by a ceiling camera, so everyone was shrunken and unfamiliar. Quotidian matters turned existential. When is hair considered long? Where does a calf end and a foot begin?

In my unpaid training exercise, the system was exacting down to the pixel, and it was dispiriting to think of someone spending so much time trying to satisfy its demands without being paid to learn and without clarity on how much work would be available.

Scale AI, for one, offers boot camps for some projects. A former trainer I spoke to used to go to an office in Nairobi before the pandemic to instruct a roomful of recruits on labelling data for self-driving vehicles. When lockdowns hit, the sessions turned virtual. His classes were filled with people from the Philippines and Venezuela, and he worked 14-hour days. Each week, he had to bring in around 20 new people.

The former trainer previously worked for another company as an annotator where one of his tasks was to train a refuse-sorting machine by looking at photos of garbage and labelling plastic and paper. He liked his training role, though, not necessarily because of anything to do with AI. He enjoyed meeting people from around the world in the boot camps, and figuring out how to motivate his trainees. Working in AI ironically helped him understand a little more about human nature.

Pumping AI models with more curated data has taken the technology far – but not far enough. Self-driving vehicles, for example, can respond incorrectly in unfamiliar situations. Mary Cummings, an engineering professor at George Mason University, wrote recently that autonomous vehicles “struggle to perform even basic operations when the world does not match their training data.”

Meanwhile, chatbots can still dispense harmful information and make factual errors, while more research has emerged about the challenges of RLHF. Paying people by the task, for example, could incentivize them to cut corners. One study published in June by researchers at the Swiss Federal Institute of Technology estimated that up to 46 per cent of workers on Amazon Mechanical Turk, a crowdsourcing platform owned by the online retailer, used an LLM to complete a summarization project instead of doing it themselves. People make mistakes, too, owing to time limits imposed on them and mental fatigue.

We also disagree with one another. That’s not a big deal when polling for what we most desire in a turtle wine bottle stopper, but it’s thornier with divisive topics. Feedback for a chatbot’s definition of “woke” could vary drastically, for example. (One Telus International contractor told me he did, in fact, have to choose the best explanation for “woke.”) Indeed, while one of the goals of RLHF is to reduce bias, people who provide feedback bring their own biases. Stephen Casper, a computer science PhD candidate at the Massachusetts Institute of Technology, wrote in a recent paper that, “When preferences differ, the majority wins, potentially disadvantaging under-represented groups.”

And then there are the employment conditions, which some groups are aiming to improve. The Alphabet Workers-Union-Communications Workers of America is seeking to unionize workers and contractors for the Google parent company, including raters. Today, the union has more than 1,400 members in Canada and the U.S.

Toni Allen, a Telus International rater and union organizer in Oklahoma, told me the union is ultimately seeking basic improvements – the ability to bargain for paid time off, health benefits, better wages and more consistent work. While Google has stipulated that its suppliers in the U.S. are required to provide a US$15 minimum wage, Ms. Allen gets US$14 in her independent contractor role. “We are important,” she said. “It takes a lot of hard work, someone who can do research and has the critical thinking skills to fact-check these things. It’s not just sitting on a couch pushing buttons.”

Mr. Cunnane with Telus International said in an e-mail that the company maintains wage compliance where it operates and that the main draw of the role is the flexibility for workers to create their own schedules. That allows them to work two or more jobs, if they choose, he said. The company is also “uplevelling” the worker experience through wellness programs and learning resources.

It shouldn’t be surprising to learn that AI companies want to automate more of the process. Telus International has a goal to automate 90 per cent of the data annotation work for self-driving vehicles, according to a recent webinar on YouTube. The volumes of data Telus International handles – one recent project required millions of annotations – renders manual labelling impractical. “More automation does not eliminate human involvement, rather, it shifts the focus to more complex work,” Mr. Cunnane said.

LLMs, he said, will continue to need feedback from real people, particularly to improve accuracy and eliminate bias. “Humanity-in-the-loop will be an important component as these models continue to improve.”

A smartphone operating Anthropic's new Claude A.I. assistant, in July 2023. The company is researching a technique in which one model trains another, reducing the need for human feedback.JACKIE MOLLOY/The New York Times News Service

Still, LLM developers are exploring alternate methods. Anthropic, a competitor of OpenAI, is researching a technique in which one model trains another, reducing the need for human feedback. Google researchers published a paper in September along similar lines, exploring a method called reinforcement learning from AI feedback, which they said could address the logistical challenges of collecting human input. Meanwhile, Cohere chief executive officer Aidan Gomez suggested in a VentureBeat interview that AI models may soon have gleaned all they can from us. “We’re starting to run up on the extent of human knowledge,” he said. “As you start to approach the performance of the best humans in a particular field, there are increasingly few people for you to turn to.”

Whichever way the technology goes, Kristin will be playing a smaller role with Telus International. The job had begun to wear on her, particularly the isolation of sitting at home and fact-checking material for an unseen AI model. She felt no connection to the company; her only interaction with other contractors came through unofficial Reddit forums.

When Kristin and I spoke again in August, she had started as a nanny and was limiting her Telus International work to evenings and weekends. The change has been like fresh air. “It’s nice to have actual interactions with people,” she said, “and getting to do something that feels far more valuable.”

Follow related authors and topics

Interact with The Globe

Trending

Canadian climate finance is ‘patchwork, delivered late, falling short,’ Carney says

U of T protesters say university giving them ‘the runaround,’ not addressing demands

Suncor wants to cut out oil ‘middlemen’ and sell direct to customers

Air France Boeing 787 makes emergency landing in Iqaluit after heat smell in cabin

The five-year 5% GIC makes a surprise comeback

The free ride for conservative investors is over. Plus, the GIC rate surprise

Man charged in assassination of Hardeep Nijjar came to Canada and enrolled in hospital administrative studies

My daughter’s life was abruptly ending. How does a mother prepare for this?

Latest in