What is Data Science?

What is Data Science?

This document provides several perspectives about data science. If you have a suggestion to add an additional perspective, please send email to paturi@cs.ucsd.edu

50 years of Data Science
David Donoho

This paper reviews some ingredients of the current “Data Science moment”, including recent commentary about data science in the popular media, and about how/whether Data Science is really different from Statistics.

The now-contemplated field of Data Science amounts to a superset of the fields of statistics and machine learning which adds some technology for ‘scaling up’ to ‘big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fifty years.

Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere ‘scaling up’, but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field.

Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are ‘learning from data’, and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.

Realizing the Potential of Data Science
Final Report from the CISE Data Science Committee, September 2016

Much has been made of the rise of digital data as driver that is advancing virtually every intellectual endeavor. In the research community, data-driven discovery is extending fundamental approaches that started with observational discovery and theoretical discovery, and has embraced the quantitative assets of the Information Age with computational discovery. In commerce, digital data often serves as a disruptive technology – fundamentally changing our ability to create value from information and understand, interpret, and respond to the needs of customers and clients.

It is not too extreme to say that data is changing everything. As a result, we see the emergence of a new field – Data Science – that focuses on the processes and systems that enable us to extract knowledge or insights from data in various forms, either structured or unstructured. In practice, Data Science has evolved as an interdisciplinary field that integrates approaches from data analysis fields such as Statistics, Data Mining, and Predictive Analytics. Of particular interest for this report is the deep connection between Data Science and Computer Science; as noted recently in Forbes “[Data Science is] the story of the coupling of the mature discipline of statistics with a very young one–computer science.

A Very Short History of Data Science
G. Press

The story of how data scientists became sexy is mostly the story of the coupling of the mature discipline of statistics with a very young one--computer science. The term “Data Science” has emerged only recently to specifically designate a new profession that is expected to make sense of the vast stores of big data. But making sense of data has a long history and has been discussed by scientists, statisticians, librarians, computer scientists and others for years. The following timeline traces the evolution of the term “Data Science” and its use, attempts to define it, and related terms.

Frontiers of Massive Data Analysis
National Academies of Science Report, 2013

Experiments, observations, and numerical simulations in many areas of science and business are currently generating terabytes of data, and in some cases are on the verge of generating petabytes and beyond. Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high energy physics and to the development of new information-based industries. Traditional methods of analysis have been based largely on the assumption that analysts can work with data within the confines of their own computing environment, but the growth of “big data” is changing that paradigm, especially in cases in which massive amounts of data are distributed across Locations.

While the scientific community and the defense enterprise have long been leaders in generating and using large data sets, the emergence of e-commerce and massive search engines has led other sectors to confront the challenges of massive data. For example, Google, Yahoo!, Microsoft, and other Internet-based companies have data that is measured in exabytes (10^18 bytes). Social media (e.g., Facebook, YouTube, Twitter) have exploded beyond anyone’s wildest imagination, and today some of these companies have hundreds of millions of users. Data mining of these massive data sets is transforming the way we think about crisis response, marketing, entertainment, cybersecurity, and national intelligence. It is also transforming how we think about information storage and retrieval. Collections of documents, images, videos, and networks are being thought of not merely as bit strings to be stored, indexed, and retrieved, but also as potential sources of discovery and knowledge, requiring sophisticated analysis techniques that go far beyond classical indexing and keyword counting, aiming to find relational and semantic interpretations of the phenomena underlying the data.

Big Data and its Technical Challenges
H.V. Jagdish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. Patel, R. Ramakrishnan, C. Shahabi

In a broad range of application areas, data is being collected at an unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly handcrafted models of reality, can now be made using data-driven mathematical models. Such Big Data analysis now drives nearly every aspect of society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences.

Science in the Age of Selfies
D.Geman and S. Geman

These days, scientists spend much of their time taking “professional selfies”— effectively spending more time announcing ideas than formulating them.

Curriculum Guidelines for Undergraduate Programs in Data Science
Annual Review of Statistics and its Applications, 2017

Even though an exact definition of data science remains elusive, we have taken as our starting point a view that seems to have emerged as a consensus from the StatSNSF (National Science Foundation Directorate for Mathematical and Physical Sciences Support for the Statistical Sciences at NSF—a subcommittee of the Mathematical and Physical Sciences Advisory Committee) committee statement that data science comprises the “science of planning for, acquisition, management, analysis of, and inference from data” (NSF 2014, p. 4). At the undergraduate level, we conceive of data science as an applied field akin to engineering, with its emphasis on using data to describe the world. At present, the theoretical foundations are drawn primarily from established strains in statistics, computer science, and mathematics. The practical real-world meanings come from interpreting the data in the context of the domain in which the data arose. For an undergraduate program, we envision a case-based focus and hands-on approach, as is common in fields such as engineering and computer science.

Computational and Inferential Thinking (A textbook developed for an introductory undergraduate course in data science)
Ani Adhikari and John Denero, UC, Berkeley

Data Science is about drawing useful conclusions from large and diverse data sets through exploration, prediction, and inference. Exploration involves identifying patterns in information. Prediction involves using information we know to make informed guesses about values we wish we knew. Inference involves quantifying our degree of certainty: will those patterns we found also appear in new observations? How accurate are our predictions? Our primary tools for exploration are visualizations and descriptive statistics, for prediction are machine learning and optimization, and for inference are statistical tests and models.

Statistics is a central component of data science because statistics studies how to make robust conclusions with incomplete information. Computing is a central component because programming allows us to apply analysis techniques to the large and diverse data sets that arise in real-world applications: not just numbers, but text, images, videos, and sensor readings. Data science is all of these things, but it more than the sum of its parts because of the applications. Through understanding a particular domain, data scientists learn to ask appropriate questions about their data and correctly interpret the answers provided by our inferential and computational tools.

Foundations of Data science
A.Blum, J. Hopcroft, and R. Kannan

A textbook developed to provide mathematical foundations of data science at the undergraduate and graduate levels.

Computer science as an academic discipline began in the 1960's. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context free languages, and computability. In the 1970's, the study of algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks as central aspects of daily life presents both opportunities and challenges for theory.

While traditional areas of computer science remain highly important, increasingly researchers of the future will be involved with using computers to understand and extract usable information from massive data arising in applications, not just how to make computers useful on specific well-defined problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as an understanding of automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods.

Early drafts of the book have been used for both undergraduate and graduate courses. Background material needed for an undergraduate course has been put in the appendix. For this reason, the appendix has homework problems.

Undergraduate Data Science at Berkeley (Video)
John DeNero. May 4, 2017 at UC San Diego

The new Foundations of Data Science course (data8.org) is the fastest growing elective course in UC Berkeley history, with over 1700 students completing the course during the first four semesters it was offered. This course teaches students the fundamentals of programming and statistics together, leveraging sampling and simulation in order to reduce the amount of manual calculation required to reach statistical conclusions.

The rapid growth of this course, along with a surge in student enthusiasm for data science, has inspired faculty from across campus to create a variety of new courses, ranging from small introductory seminars to upper-division project courses. This talk will first highlight some unique aspects of our foundations course, including recent improvements to its open-source software infrastructure. Then, we will discuss the design choices that have supported rapid development of a data science education program which combines new offerings with existing course resources.