I received a free copy of this book from the O'Reilly Media publicity department. I have been an O'Reilly Media author since 1996.
Provost and Fawcett target three types of readers:
What this book is not is a quick introduction to data science in the style of the Essential Knowledge series from MIT Press. The Essential Knowledge series is meant for executives and senior managers who want an informative and well-written, though high-level, introduction to subjects such as intellectual property law or crowdsourcing. Data Science for Business is intended for individuals who have time and a bit of math on their side, but chapters 1-3 (and potentially 4-5) could serve as an executive summary. Because O'Reilly sells unrestricted digital versons of their books, Kindle or laptop readers could purchase that version to travel with and peruse the first few chapters from the comfort of seat 1A.
The first somewhat technical section of the book covers segmentation, which is the process of dividing entities into classes using information from a database table. The authors use mushroom classification as one segmentation example. Mushroom identification makes for a great example because it has a relatively small search space with a variety of attributes such as shape, color, and smell. You might also be familiar with the 20 Questions toy or the site http://www.20q.net/, which uses what used to be called an expert system to guess who or what you're thinking of. Provost and Fawcett present two main business cases as their motivating examples for segmention: classifying credit card applicants as either likely or unlikely to default on their debts, and mobile phone subscribers who might move to another provider when their contract as up. The examples support the narrative flawlessly.
You can use many different techniques to analyze data, including linear regression, logistic regression, support vector machines, and clustering (by centroid analysis) among numerous others. These techniques have complementary strengths, but they share one weakness: overfitting. The authors devote an entire chapter to the problem and point out that it's trivially easy to develop a data model that predicts every member of a set accurately: just create a rule for each row in your table. One problem is that this approach has no predictive value for data not in the sample. If you generalize your rules a bit, you can create more general rules that model the training set (the proportion of the data you use to derive the rules, perhaps 70% of the available data) and are reasonably accurate on the remaining 30% of the data and whatever new examples come in. The trick to effective modeling is finding the line between accuracy and overfitting; ultimately, it's a matter of experience and rigorous testing.
The authors conclude with other data science tasks such as text mining, profiling, identifying co-occurences (such as buying beer and lottery tickets on the same trip to the convenience store) before wrapping up with their thoughts on how data science fits into corporate decision-making. Text analysis is a vast and difficult topic, but Provost and Fawcett explain the concepts well. The other tasks are less common but important to know about, so the authors were right to include them.
I think Data Science for Business is the perfect overview for readers who might not have the rigorous technical background required to do data science on their own. The first three chapters are an excellent executive summary and, if you have the time and patience to read the rest of the book, you'll be well-prepared to manage data science projects within your organization.
Curtis Frye is the editor of Technology and Society Book Reviews. He is the author of more than 30 books, including Improspectives, his look at applying the principles of improv comedy to business and life. His list includes more than 20 books for Microsoft Press and O'Reilly Media; he has also created over a dozen online training courses for lynda.com. In addition to his writing, Curt is a keynote speaker and entertainer. You can find more information about him at www.curtisfrye.com.