Male attractive student of chemistry working in laboratory

A laboratory technician examines test tube samples, the type of hands-on experimental work that remains largely unaffected by current genAI capabilities, according to Microsoft’s study. [Adobe Stock]

How is generative AI like Copilot reshaping work? A fresh Microsoft Research paper, published just this week on arXiv, dives into real-world usage data from anonymized Bing Copilot conversations to map AI’s impact on jobs.

While the analysis excludes other R&D-related applications of AI, ranging from, say, AlphaFold’s potential in predicting protein folding to custom AI-powered search tools from science publishers, the report finds that even off-the-shelf AI offerings could have a big R&D impact.

Despite occasional high-profile cases of AI hallucinations and corner-cutting by researchers, journalists, and other professionals, genAI achieves generally high user satisfaction. The Microsoft study reports over 70% positive feedback for its most common activities. GenAI excels at assisting with information gathering, research, and writing—core R&D workflows like literature reviews, hypothesis testing and report drafting. Yet it struggles with data analysis, visual design, and creative problem-solving, where satisfaction rates drop below 50%.

The core message from Microsoft’s research: generative AI is not a universal solvent for R&D challenges. Tools like Bing Copilot could widen the gap between the productivity of desk work and bench work. R&D organizations that understand this distinction and adopt a tailored, two-track strategy to such systems will be positioned to adapt.

A closer look at the study design and findings

The Microsoft Research paper, analyzing 200,000 Bing Copilot conversations from 2024. It focused on how generative AI impacts work by distinguishing user goals (tasks users seek help with) and AI actions (tasks the AI performs). User goals tended to focus on knowledge work, like researching, writing and analyzing, while AI actions focus on providing information, teaching, and advising. The study maps these to O*NET’s Generalized Work Activities (GWAs) and Intermediate Work Activities (IWAs).

High-prevalence GWAs include “Getting Information” and “Thinking Creatively,” accounting for over 20% of Copilot usage, far above their 10-15% share in U.S. workforce tasks (per O*NET/BLS data). Top IWAs are highly prevalent; for instance, “Gather information from physical or electronic sources” alone accounts for nearly 20% of user goal activity. These align with R&D needs, think of lit reviews and drafting reports, and boast over 70% positive feedback and high task completion rates. Conversely, physical and monitoring GWAs, like “Handling and Moving Objects” or “Inspecting Equipment,” are not exactly a forté of a large language model.

Success metrics highlight AI’s strengths and limits. Thumbs-up feedback and task completion rates correlate strongly (r=0.8 for user goals, 0.76 for AI actions). The study showed reliability in research and writing. Yet, data analysis and visual design IWAs receive the lowest satisfaction scores among common tasks. The study’s AI applicability score, reflecting task overlap and completion, aligns with prior work (r=0.91 at broad occupational levels), like Eloundou et al., but shows weak wage/education ties.

Most and least affected R&D-relevant roles

Microsoft’s study pegs AI applicability scores at 0 to 1, based on how much occupational work overlaps with Bing Copilot usage, adjusted for completion rates. For R&D, STEM roles feature prominently among high scorers, while hands-on lab work shows minimal AI overlap.

Most affected

Among STEM-relevant occupations, CNC Tool Programmers lead with a 0.44 score, meaning 44% of their importance-weighted tasks align with AI-assisted activities. Mathematicians follow at 0.39, benefiting from AI’s strength in analyzing data and developing models. Technical writers score 0.38, driven by IWAs such as “Edit written materials or documents” and “Explain technical details of products/services.”

Data Scientists achieve 0.36, with market research analysts and web developers both at 0.35. These scores stem from high-prevalence IWAs in Copilot chats: gathering information, writing, and analyzing data. Overall, the activities show over 70% positive feedback rates. For R&D, this promises to translate to faster literature reviews, data analysis and technical documentation, but accuracy concerns persist. .

Benefits include quicker prototyping and grant writing. Dell’Acqua et al. (2023) found AI lifts consulting output; similar gains apply to medical diagnosis (Goh et al., 2024). But limits exist: AI shows lower satisfaction scores for “Thinking Creatively” (especially visual design) and “Analyzing Data or Information” tasks.

Top R&D-relevant occupations from the paper with some other roles (CNC tooling and web dev) thrown in for context:

Occupation	AI Applicability Score	Key Contributing IWAs	Employment (BLS)
CNC Tool Programmers	0.44	Program computer systems, Edit written materials	28,030
Mathematicians	0.39	Analyze data, Develop models, Research	2,220
Technical writers	0.38	Edit written materials, Explain technical details	47,970
Data scientists	0.36	Gather information, Analyze data, Write reports	192,710
Market research analysts	0.35	Research market conditions, Evaluate products	846,370
Web developers	0.35	Program computer systems, Create visual designs	85,350

Least affected

The paper’s bottom 40 occupations (See Table 4 in the paper for the full list) show scores under 0.03, dominated by physical tasks. While specific lab technicians aren’t listed, related roles like medical equipment preparers score just 0.02, and ophthalmic medical technicians score 0.01. These low scores reflect minimal Copilot overlap with IWAs like “Operate equipment,” “Inspect materials,” and “Handle specimens.”

While AI-robotics integration is advancing rapidly, with several research groups developing genAI-powered robotic systems, such embodied AI applications were outside the scope of the Microsoft paper, which focused on LLM chatbot interactions.

Here, the paper identifies core gaps: AI shows near-zero usage for “Performing General Physical Activities,” “Handling and Moving Objects,” and “Monitoring Processes, Materials, or Surroundings.” These GWAs are underrepresented in Copilot data compared to their workforce prevalence. For R&D, this means AI can’t assist with hands-on experiments, equipment calibration, or sample collection.

Least affected R&D-adjacent occupations (sample):

Occupation	AI Applicability Score	Key Non-AI Activities	Employment (BLS)
Medical equipment preparers	0.02	Operate equipment, Sterilize instruments	66,790
Machine feeders and offbearers	0.02	Feed materials, Monitor equipment	44,500
Ophthalmic medical technicians	0.01	Operate diagnostic equipment, Assist procedures	73,390
Chemical plant operators*	0.00	Monitor chemical processes, Adjust controls	120,710
Foundry mold and coremakers	0.00	Create molds, Handle materials	11,780

*Listed as “Water Treatment Plant and System Operators” in the paper, representing similar process monitoring roles.

In R&D, large language models shine, for the most part, at desk work: literature synthesis, data analysis, technical writing, but have less of a bench impact. The 40% of conversations where user goals and AI actions are completely disjoint underscores this divide. Leaders should integrate AI for knowledge work efficiency while recognizing its limits in physical experimentation and equipment operation.

A closer look at the study design and findings

Most and least affected R&D-relevant roles

Most affected

Least affected

Related Articles Read More >

Is Karpathy’s viral LLM wiki helpful? My opinion after one month of experimenting with one.

Leica, Indica Labs and Lunit team up as AI biomarker scoring moves toward clinical scale

Causaly and Microsoft target one of drug discovery’s most expensive decisions: which target to pursue

How Claude Fable 5 stacks up against Opus 4.8 and GPT 5.5

Search R&D World