Preface xix Chapter 1 The Bazaar of Storytellers 1 Data Science: The Sexiest Job in the 21st Century 4 Storytelling at Google and Walmart 6 Getting Started with Data Science 8 Do We Need Another Book on Analytics? 8 Repeat, Repeat, Repeat, and Simplify 10 Chapters' Structure and Features 10 Analytics Software Used 12 What Makes Someone a Data Scientist? 12 Existential Angst of a Data Scientist 15 Data Scientists: Rarer Than Unicorns 16 Beyond the Big Data Hype 17 Big Data: Beyond Cheerleading 18 Big Data Hubris 19 Leading by Miles 20 Predicting Pregnancies, Missing Abortions 20 What's Beyond This Book? 21 Summary 23 Endnotes 24 Chapter 2 Data in the 24/7 Connected World 29 The Liberated Data: The Open Data 30 The Caged Data 30 Big Data Is Big News 31 It's Not the Size of Big Data; It's What You Do with It 33 Free Data as in Free Lunch 34 FRED 34 Quandl 38 U.S. Census Bureau and Other National Statistical Agencies 38 Search-Based Internet Data 39 Google Trends 40 Google Correlate 42 Survey Data 44 PEW Surveys 44 ICPSR 45 Summary 45 Endnotes 46 Chapter 3 The Deliverable 49 The Final Deliverable 52 What Is the Research Question? 53 What Answers Are Needed? 54 How Have Others Researched the Same Question in the Past? 54 What Information Do You Need to Answer the Question? 58 What Analytical Techniques/Methods Do You Need? 58 The Narrative 59 The Report Structure 60 Have You Done Your Job as a Writer? 62 Building Narratives with Data 62 "Big Data, Big Analytics, Big Opportunity" 63 Urban Transport and Housing Challenges 68 Human Development in South Asia 77 The Big Move 82 Summary 95 Endnotes 96 Chapter 4 Serving Tables 99 2014: The Year of Soccer and Brazil 100 Using Percentages Is Better Than Using Raw Numbers 104 Data Cleaning 106 Weighted Data 106 Cross Tabulations 109 Going Beyond the Basics in Tables 113 Seeing Whether Beauty Pays 115 Data Set 117 What Determines Teaching Evaluations? 118 Does Beauty Affect Teaching Evaluations? 124 Putting It All on (in) a Table 125 Generating Output with Stata 129 Summary Statistics Using Built-In Stata 130 Using Descriptive Statistics 130 Weighted Statistics 134 Correlation Matrix 134 Reproducing the Results for the Hamermesh and Parker Paper 135 Statistical Analysis Using Custom Tables 136 Summary 137 Endnotes 139 Chapter 5 Graphic Details 141 Telling Stories with Figures 142 Data Types 144 Teaching Ratings 144 The Congested Lives in Big Cities 168 Summary 185 Endnotes 185 Chapter 6 Hypothetically Speaking 187 Random Numbers and Probability Distributions 188 Casino Royale: Roll the Dice 190 Normal Distribution 194 The Student Who Taught Everyone Else 195 Statistical Distributions in Action 196 Z-Transformation 198 Probability of Getting a High or Low Course Evaluation 199 Probabilities with Standard Normal Table 201 Hypothetically Yours 205 Consistently Better or Happenstance 205 Mean and Not So Mean Differences 206 Handling Rejections 207 The Mean and Kind Differences 211 Comparing a Sample Mean When the Population SD Is Known 211 Left Tail Between the Legs 214 Comparing Means with Unknown Population SD 217 Comparing Two Means with Unequal Variances 219 Comparing Two Means with Equal Variances 223 Worked-Out Examples of Hypothesis Testing 226 Best Buy-Apple Store Comparison 226 Assuming Equal Variances 227 Exercises for Comparison of Means 228 Regression for Hypothesis Testing 228 Analysis of Variance 231 Significantly Correlated 232 Summary 233 Endnotes 234 Chapter 7 Why Tall Parents Don't Have Even Taller Children 235 The Department of Obvious Conclusions 235 Why Regress? 236 Introducing Regression Models 238 All Else Being Equal 239 Holding Other Factors Constant 242 Spuriously Correlated 244 A Step-By-Step Approach to Regression 244 Learning to Speak Regression 247 The Math Behind Regression 248 Ordinary Least Squares Method 250 Regression in Action 259 This Just In: Bigger Homes Sell for More 260 Does Beauty Pay? Ask the Students 272 Survey Data, Weights, and Independence of Observations 276 What Determines Household Spending on Alcohol and Food 279 What Influences Household Spending on Food? 285 Advanced Topics 289 Homoskedasticity 289 Multicollinearity 293 Summary 296 Endnotes 296 Chapter 8 To Be or Not to Be 299 To Smoke or Not to Smoke: That Is the Question 300 Binary Outcomes 301 Binary Dependent Variables 301 Let's Question the Decision to Smoke or Not 303 Smoking Data Set 304 Exploratory Data Analysis 305 What Makes People Smoke: Asking Regression for Answers 307 Ordinary Least Squares Regression 307 Interpreting Models at the Margins 310 The Logit Model 311 Interpreting Odds in a Logit Model 315 Probit Model 321 Interpreting the Probit Model 324 Using Zelig for Estimation and Post-Estimation Strategies 329 Estimating Logit Models for Grouped Data 334 Using SPSS to Explore the Smoking Data Set 338 Regression Analysis in SPSS 341 Estimating Logit and Probit Models in SPSS 343 Summary 346 Endnotes 347 Chapter 9 Categorically Speaking About Categorical Data 349 What Is Categorical Data? 351 Analyzing Categorical Data 352 Econometric Models of Binomial Data 354 Estimation of Binary Logit Models 355 Odds Ratio 356 Log of Odds Ratio 357 Interpreting Binary Logit Models 357 Statistical Inference of Binary Logit Models 362 How I Met Your Mother? Analyzing Survey Data 363 A Blind Date with the Pew Online Dating Data Set 365 Demographics of Affection 365 High-Techies 368 Romancing the Internet 368 Dating Models 371 Multinomial Logit Models 378 Interpreting Multinomial Logit Models 379 Choosing an Online Dating Service 380 Pew Phone Type Model 382 Why Some Women Work Full-Time and Others Don't 389 Conditional Logit Models 398 Random Utility Model 400 Independence From Irrelevant Alternatives 404 Interpretation of Conditional Logit Models 405 Estimating Logit Models in SPSS 410 Summary 411 Endnotes 413 Chapter 10 Spatial Data Analytics 415 Fundamentals of GIS 417 GIS Platforms 418 Freeware GIS 420 GIS Data Structure 420 GIS Applications in Business Research 420 Retail Research 421 Hospitality and Tourism Research 422 Lifestyle Data: Consumer Health Profiling 423 Competitor Location Analysis 423 Market Segmentation 423 Spatial Analysis of Urban Challenges 424 The Hard Truths About Public Transit in North America 424 Toronto Is a City Divided into the Haves, Will Haves, and Have Nots 429 Income Disparities in Urban Canada 434 Where Is Toronto's Missing Middle Class? It Has Suburbanized Out of Toronto 435 Adding Spatial Analytics to Data Science 444 Race and Space in Chicago 447 Developing Research Questions 448 Race, Space, and Poverty 450 Race, Space, and Commuting 454 Regression with Spatial Lags 457 Summary 460 Endnotes 461 Chapter 11 Doing Serious Time with Time Series 463 Introducing Time Series Data and How to Visualize It 464 How Is Time Series Data Different? 468 Starting with Basic Regression Models 471 What Is Wrong with Using OLS Models for Time Series Data? 473 Newey-West Standard Errors 473 Regressing Prices with Robust Standard Errors 474 Time Series Econometrics 478 Stationary Time Series 479 Autocorrelation Function (ACF) 479 Partial Autocorrelation Function (PCF) 481 White Noise Tests 483 Augmented Dickey Fuller Test 483 Econometric Models for Time Series Data 484 Correlation Diagnostics 485 Invertible Time Series and Lag Operators 485 The ARMA Model 487 ARIMA Models 487 Distributed Lag and VAR Models 488 Applying Time Series Tools to Housing Construction 492 Macro-Economic and Socio-Demographic Variables Influencing Housing Starts 498 Estimating Time Series Models to Forecast New Housing Construction 500 OLS Models 501 Distributed Lag Model 505 Out-of-Sample Forecasting with Vector Autoregressive Models 508 ARIMA Models 510 Summary 522 Endnotes 524 Chapter 12 Data Mining for Gold 525 Can Cheating on Your Spouse Kill You? 526 Are Cheating Men Alpha Males? 526 UnFair Comments: New Evidence Critiques Fair's Research 527 Data Mining: An Introduction 527 Seven Steps Down the Data Mine 529 Establishing Data Mining Goals 529 Selecting Data 529 Preprocessing Data 530 Transforming Data 530 Storing Data 531 Mining Data 531 Evaluating Mining Results 531 Rattle Your Data 531 What Does Religiosity Have to Do with Extramarital Affairs? 533 The Principal Components of an Extramarital Affair 539 Will It Rain Tomorrow? Using PCA For Weather Forecasting 540 Do Men Have More Affairs Than Females? 542 Two Kinds of People: Those Who Have Affairs, and Those Who Don't 542 Models to Mine Data with Rattle 544 Summary 550 Endnotes 550 Index 553
Murtaza Haider, Ph.D., is an Associate Professor at the Ted Rogers School of Management, Ryerson University, and the Director of a consulting firm Regionomics Inc. He is also a visiting research fellow at the Munk School of Global Affairs at the University of Toronto (2014-15). In addition, he is a senior research affiliate with the Canadian Network for Research on Terrorism, Security, and Society, and an adjunct professor of engineering at McGill University. Haider specializes in applying analytics and statistical methods to find solutions for socioeconomic challenges. His research interests include analytics; data science; housing market dynamics; infrastructure, transportation, and urban planning; and human development in North America and South Asia. He is an avid blogger/data journalist and writes weekly for the Dawn newspaper and occasionally for the Huffington Post. Haider holds a Masters in transport engineering and planning and a Ph.D. in Urban Systems Analysis from the University of Toronto.