Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

Text Mining: The Next Data Frontier

By R&D Editors | January 6, 2014

By some estimates, 80 percent of available information occurs as free-form text

Figure 1: Text Mining and Related FieldsJosiah Stamp said: “The individual source of the statistics may easily be the weakest link.” Nowhere is this more true than in the new field of text mining, given the wide variety of textual information. By some estimates, 80 percent of the information available occurs as free-form text which, prior to the development of text mining, needed to be read in its entirety in order for information to be obtained from it. It has been applied to spam filters, fraud detection, sentiment analysis, identification of trends and authorship.

Text mining can be defined as the analysis of semi-structured or unstructured text data. The goal is to turn text information into numbers so that data mining algorithms can be applied. It arose from the related fields of data mining, artificial intelligence, statistics, databases, library science, and linguistics (Figure 1).

Figure 2: Text Mining Decision TreeThere are seven specialties within text mining that have different objectives. These can be decided by answers to the questions shown in the decision tree in Figure 2. These specialties are:
1.    Information retrieval: storage and retrieval of documents
2.    Document Clustering: group and categorize documents using data mining clustering algorithms
3. Document Classification: group and categorize documents based on labeled examples
4. Web mining: understand relationships of hyper linkages of documents on the web
5.    Information Extraction: identify specific facts and relationships of unstructured text
6.    Natural language processing: understanding language structure, such as parts of speech
7.    Concept extraction: group words into similar semantic groups

Figure 3: Text Mining MethodologyOnce the objective (specialty) has been determined, the basic methodology can be applied (Figure 3). It can be summarized in the following scheme:
1. Create the corpus by collecting the documents. This can be manual or automatic, such as via a Web crawler or database query. These are organized into a similar format.
2. Clean up the text according to the following steps:

a.    Tokenization: Fragment the text into items that can be counted.
b.    Stemming: Identify common core word fragments (e.g. “result,” “results,” “resulting,” “resulted” all become “result.”
c.    Eliminate stop words: Create a dictionary of low predictive value words (e.g. a, an, the).
d.    Eliminate frequent words: Create a dictionary of common, repetitive words often found as a group (e.g. “For information only”).
e.    Normalize spelling: Use or create a dictionary to correct misspellings using fuzzy matching.
f.    Case normalization: Convert all text to lower case to prevent counting words with different capitalization separately.
g.    Eliminate punctuation: Remove punctuation to prevent counting words with and without punctuation being counted separately.

3. Reduce dimensionality and select features.

a. Term Document Matrix (TDM): Create a two dimensional matrix where each document is one row and each column is a term from the abbreviated list generated by cleaning up the text. The relationship between the row and column is represented by indices. Singular Value Decomposition (SVD) is used to expose the underlying meaning and structure by reducing the dimensionality. It is related to principal components analysis.
b. Indices: At the simplest level, this can be the count, or number of times a term appears in a document. Log or binary frequencies can be used to dampen large number of occurrences. The most commonly used index is the inverse document frequency (ITF). It represents the relative importance of a term and reflects the relative frequency of occurrence of terms and their document frequencies.

4. Extract knowledge (examples of methods)

a. Classification: Assign terms into a predetermined set of categories. A training data set is used with documents and categories. This is used in genre detection, spam filtering and Web page categorization.
b. Clustering: Terms are placed into natural occurring but meaningful groups. This is used in document retrieval to enable improved Web searches and analysis of large text collections.
c. Association: Identify terms that are frequently found together. This is known as market basket analysis in the retail industry where items are bought together.
d. Trend analysis: Identify time dependent changes in a term. This is used in identifying rising popularity of technologies.

There are numerous vendors supplying sophisticated text mining packages, such as SAS Text Miner, IBM SPSS Modeler and Statistica Text Miner. There also are free and open-source tools available, such as R and RapidMiner.

Due to the rapidly expanding amount of text-based information, text mining has already been applied regularly in the areas of spam filters, fraud detection, sentiment analysis, identification of trends and authorship. Future challenges will be applications to large corpora, domain knowledge, personalization, multi-lingual capabilities, and maintenance of dictionaries as area-specific language evolves.

References
1. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, by John Elder, Gary Miner, and Bob Nisbet, Academic Press, 2012
2. Text Mining Handbook, Louise Francis, FCAS, MAAA, and Matt Flynn, PhD, Casualty Actuarial Society, E-Forum, Spring, 2010
3. An Introduction to Text Mining in R, Ingo Feirer, R News, Vol. 8/2, p. 19–22, Oct. 2008

Mark Anawis is a Principal Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at [email protected].

Related Articles Read More >

Why IBM predicts quantum advantage within two years
Aardvark AI forecasts rival supercomputer simulations while using over 99.9% less compute
This week in AI research: Latest Insilico Medicine drug enters the clinic, a $0.55/M token model R1 rivals OpenAI’s $60 flagship, and more
How the startup ALAFIA Supercomputers is deploying on-prem AI for medical research and clinical care
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.
RD 25 Power Index

R&D World Digital Issues

Fall 2024 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

Research & Development World
  • Subscribe to R&D World Magazine
  • Enews Sign Up
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • Educational Assets
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE