Technology and Society Book Reviews

Link to index page Link to privacy reviews Link to commerce, security, and the law reviews Link to culture reviews. Link to politics, security, and the law reviews Link to ethics, rhetoric, and metaphysics reviews. Link to science fiction reviews.

Cover image of Data Science for Business

 

Title: Data Science for Business

Authors: Foster Provost and Tom Fawcett

Publisher: O'Reilly Media

Copyright: 2013

ISBN13: 978-1-449-36132-7

Length: 384

Price: $34.99

Rating: 95%

I received a free copy of this book from the O'Reilly Media publicity department. I have been an O'Reilly Media author since 1996.


Data science is the hot topic for organizations of all sizes these days. With the promise of enhancing your marketing campaigns and improving sales, the potential return on investment is hard to ignore. The purpose of Data Science for Business is to introduce the fundamentals of data analysis. At 384 pages, you probably guessed that the authors go into significantly more depth than similar books on the topic. You would be correct.

Provost and Fawcett target three types of readers:

  • Business people who must work with or manage data scientists and venture capitalists considering investments in data science companies.
  • Developers supporting data science applications.
  • Aspiring data scientists.

What this book is not is a quick introduction to data science in the style of the Essential Knowledge series from MIT Press. The Essential Knowledge series is meant for executives and senior managers who want an informative and well-written, though high-level, introduction to subjects such as intellectual property law or crowdsourcing. Data Science for Business is intended for individuals who have time and a bit of math on their side, but chapters 1-3 (and potentially 4-5) could serve as an executive summary. Because O'Reilly sells unrestricted digital versons of their books, Kindle or laptop readers could purchase that version to travel with and peruse the first few chapters from the comfort of seat 1A.

The first somewhat technical section of the book covers segmentation, which is the process of dividing entities into classes using information from a database table. The authors use mushroom classification as one segmentation example. Mushroom identification makes for a great example because it has a relatively small search space with a variety of attributes such as shape, color, and smell. You might also be familiar with the 20 Questions toy or the site http://www.20q.net/, which uses what used to be called an expert system to guess who or what you're thinking of. Provost and Fawcett present two main business cases as their motivating examples for segmention: classifying credit card applicants as either likely or unlikely to default on their debts, and mobile phone subscribers who might move to another provider when their contract as up. The examples support the narrative flawlessly.

You can use many different techniques to analyze data, including linear regression, logistic regression, support vector machines, and clustering (by centroid analysis) among numerous others. These techniques have complementary strengths, but they share one weakness: overfitting. The authors devote an entire chapter to the problem and point out that it's trivially easy to develop a data model that predicts every member of a set accurately: just create a rule for each row in your table. One problem is that this approach has no predictive value for data not in the sample. If you generalize your rules a bit, you can create more general rules that model the training set (the proportion of the data you use to derive the rules, perhaps 70% of the available data) and are reasonably accurate on the remaining 30% of the data and whatever new examples come in. The trick to effective modeling is finding the line between accuracy and overfitting; ultimately, it's a matter of experience and rigorous testing.

The authors conclude with other data science tasks such as text mining, profiling, identifying co-occurences (such as buying beer and lottery tickets on the same trip to the convenience store) before wrapping up with their thoughts on how data science fits into corporate decision-making. Text analysis is a vast and difficult topic, but Provost and Fawcett explain the concepts well. The other tasks are less common but important to know about, so the authors were right to include them.

I think Data Science for Business is the perfect overview for readers who might not have the rigorous technical background required to do data science on their own. The first three chapters are an excellent executive summary and, if you have the time and patience to read the rest of the book, you'll be well-prepared to manage data science projects within your organization.

 

Curtis Frye is the editor of Technology and Society Book Reviews. He is the author of more than 30 books, including Improspectives, his look at applying the principles of improv comedy to business and life. His list includes more than 20 books for Microsoft Press and O'Reilly Media; he has also created over a dozen online training courses for lynda.com. In addition to his writing, Curt is a keynote speaker and entertainer. You can find more information about him at www.curtisfrye.com.

 

Link to new reviews.Link to list of reviews by publisherLink to page with contact information.