Outline:

What Is Big Data?

Introduction

  • Big Data is often described as extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools.
  • In other words, the data set has grown so large that it is difficult to manage and even harder to garner value out of it.
  • The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visualization of data.
  • The concept has evolved to include not only the size of the data set but also the processes involved in leveraging the data.
  • Big Data has even become synonymous with other business concepts, such as business intelligence, analytics, and data mining.
  • Paradoxically, Big Data is not that new.

THE ARRIVAL OF ANALYTICS

As analytics and research were applied to large data sets,

  • Scientists came to the conclusion that more is better—in this case,
    • more data,
    • more analysis,
    • and more results.
  • Researchers started to incorporate data into the process, which in turn gave birth to what we now call Big Data.
    • related data sets,
    • unstructured data,
    • archival data,
    • and real-time data

In the business world, Big Data is all about opportunity.

  • According to IBM,
    • every day we create 2.5 quintillion (2.5 × 1018) bytes of data,
    • so much that 90 percent of the data in the world today has been created in the last two years.
  • These data come from everywhere:
    • sensors used to gather climate information,
    • posts to social media sites,
    • digital pictures and videos posted online,
    • transaction records of online purchases,
    • and cell phone GPS signals, to name just a few.
  • That is the catalyst for Big Data, along with the more important fact that
    • all of these data have intrinsic value
    • that can be extrapolated using analytics, algorithms, and other techniques.

Big Data has already proved its importance and value in several areas.

  • NOAA uses Big Data approaches to aid in climate, ecosystem, weather, and commercial research,
  • while NASA uses Big Data for aeronautical and other research.
  • Pharmaceutical companies and energy companies have leveraged Big Data for more tangible results, such as drug testing and geophysical analysis.
  • The New York Times has used Big Data tools for text analysis and Web mining,
  • while the Walt Disney Company uses them to correlate and understand customer behavior in all of its stores, theme parks, and Web properties.

Big Data plays another role in today’s businesses:

  • Large organizations increasingly face the need to maintain massive amounts of structured and unstructured data — from transaction information in data warehouses to employee tweets,
    • from supplier records to regulatory filings — to comply with government regulations.
  • That need has been driven even more by recent court cases that have encouraged companies to keep
    • large quantities of documents,
    • e-mail messages, and other electronic communications,
    • such as instant messaging and Internet provider telephony, that may be required for e-discovery if they face litigation.

WHERE IS THE VALUE?

Extracting value is much more easily said than done.

  • Big Data is full of challenges, ranging from the technical to the conceptual to the operational,
  • any of which can derail the ability to discover value and leverage what Big Data is all about.

Perhaps it is best to think of Big Data in multidimensional terms, in which four dimensions relate to the primary aspects of Big Data.

  1. Volume. Big Data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
  2. Variety. Big Data extends beyond structured data to include unstructured data of all varieties: text, audio, video, click streams, log files, and more.
  3. Veracity. The massive amounts of data collected for Big Data purposes can lead to statistical errors and misinterpretation of the collected information. Purity of the information is critical for value.
  4. Velocity. Often time sensitive, Big Data must be used as it is streaming into the enterprise in order to maximize its value to the business, but it must also still be available from the archival sources as well.

Best defined as analysis categories, these technologies and concepts include the following:

  • Traditional business intelligence (BI).
    • This consists of a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data.
    • BI delivers actionable information, which helps enterprise users make better business decisions using fact-based support systems.
    • BI works by using an in-depth analysis of detailed business data, provided by databases, application data, and other tangible data sources.
    • In some circles, BI can provide historical, current, and predictive views of business operations.
  • Data mining.
    • This is a process in which data are analyzed from different perspectives and then turned into summary data that are deemed useful.
    • Data mining is normally used with data at rest or with archival data.
    • Data mining techniques focus on modeling and knowledge discovery for predictive, rather than purely descriptive, purposes—an ideal process for uncovering new patterns from large data sets.
  • Statistical applications.
    • These look at data using algorithms based on statistical principles and normally concentrate on data sets related to polls, census, and other static data sets.
    • Statistical applications ideally deliver sample observations that can be used to study populated data sets for the purpose of estimating, testing, and predictive analysis.
    • Empirical data, such as surveys and experimental reporting, are the primary sources for analyzable information.
  • Predictive analysis.
    • This is a subset of statistical applications in which data sets are examined to come up with predictions, based on trends and information gleaned from databases.
    • Predictive analysis tends to be big in the financial and scientific worlds, where trending tends to drive predictions, once external elements are added to the data set.
    • One of the main goals of predictive analysis is to identify the risks and opportunities for business process, markets, and manufacturing.
  • Data modeling.
    • This is a conceptual application of analytics in which multiple “what-if” scenarios can be applied via algorithms to multiple data sets.
    • Ideally, the modeled information changes based on the information made available to the algorithms, which then provide insight to the effects of the change on the data sets.
    • Data modeling works hand in hand with data visualization, in which uncovering information can help with a particular business endeavor.