Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

Enron becomes unlikely data source for computer science researchers

By R&D Editors | April 29, 2015

Image credit: Roscoe Ellis, shared under a Creative Commons license via Flickr.Computer science researchers have turned to unlikely sources – including Enron – for assembling huge collections of spreadsheets that can be used to study how people use this software. The goal is for the data to facilitate research to make spreadsheets more useful.

“We study spreadsheets because spreadsheet software is used to track everything from corporate earnings to employee benefits, and even simple errors can cost organizations millions of dollars,” says Emerson Murphy-Hill, an assistant professor of computer science at NC State and co-author of two new papers on the work.

However, there are relatively few public collections of spreadsheet data available for research purposes. For example, the collection currently used by most researchers consists of approximately 4,500 spreadsheets.

But researchers are now making two new collections available – one has 15,000 spreadsheets and the other has more than 249,000.

“In addition, we are publishing a technique that other researchers can use to collect additional spreadsheet data,” Murphy-Hill says.

The 15,000 spreadsheet collection consists entirely of spreadsheets collected from internal Enron emails, which were made public after the emails were subpoenaed by prosecutors.

“Our focus is on how users interact with spreadsheets,” Murphy-Hill says. “And these spreadsheets actually tell us a lot about how users represent and manipulate data.”

To assemble the second set of spreadsheets, called Fuse, the researchers developed their own technique to identify and extract spreadsheets from an online archive of over 5 billion webpages. Using their technique, the researchers collected 249,376 spreadsheets – including spreadsheets made as recently as 2014.

“Fuse used cloud infrastructure to search through billions of webpages to identify and extract the spreadsheets we write about in this paper,” says Titus Barik, a Ph.D. student at NC State, researcher at ABB Corporate Research, and lead author of the paper on Fuse. “Commodity cloud computing is incredibly exciting – searching those pages would take about seven years of continuous computation on a single computer, but the economies of scale with cloud computing allowed us to accomplish this with Fuse in only a few days.”

“And the fact that Fuse includes recent spreadsheets is a significant advantage over other spreadsheet collections, because the information is more up-to-date and reflects changes in Excel and other spreadsheet software,” Murphy-Hill says.

“Fuse is also more reproducible than other spreadsheet collections,” says Kevin Lubick, a Ph.D. student at NC State and co-author of a paper about Fuse. “Reproducibility is the cornerstone of good scientific research, but many existing spreadsheet collections are difficult to reproduce. Our technique can be used by anyone, and they’ll get the same results we get. But the results will also include any new spreadsheets made available since the last time the program was run.”

Source: NC State University

Related Articles Read More >

From solar system simulations to SaaS savings, how Codeium’s AI agent empowers non-coders and scientists alike
Aardvark AI forecasts rival supercomputer simulations while using over 99.9% less compute
Quantum Brilliance, Pawsey integrate room-temp quantum with HPC on NVIDIA GH200
Frontier supercomputer reveals new detail in nuclear structure
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.
RD 25 Power Index

R&D World Digital Issues

Fall 2024 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

Research & Development World
  • Subscribe to R&D World Magazine
  • Enews Sign Up
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE