Stanford MS&E 226 – Fundamentals of Data Science

Class description – Autumn 2023

This course is about understanding “small data”: these are datasets that allow interaction, visualization, exploration, and analysis on a local machine. The material provides an introduction to applied data analysis, with an emphasis on providing a conceptual framework for thinking about data from both statistical and machine learning perspectives. Topics will be drawn from the following list, depending on time constraints and class interest: approaches to data analysis: statistics (frequentist, Bayesian) and machine learning; binary classification; regression; bootstrapping; causal inference and experimental design; multiple hypothesis testing. Class lectures will be supplemented by data-driven problem sets and a project. Prerequisites: CME 100 or MATH 51; 120, 220 or STATS 116; experience with R at the level of CME/ STATS 195 or equivalent.

Homeworks will have a significant practical and computational load to help students apply the concepts discussed in class.

Outline

  1. Summarization (2 weeks). Given a single data set, how do we summarize it? Basic sample statistics. Using models to succinctly summarize data. The algebra of linear regression and logistic regression. In-sample measures of fit: R2 and residuals.

  2. Prediction (2-3 weeks). How do we generalize our understanding of a data set to new samples? Formalizing the prediction problem. Binary classification. Linear regression and logistic regression as approaches to prediction. Model complexity and the bias-variance tradeoff. Training vs. test sets and cross validation.

  3. Inference (2-3 weeks). How do we generalize our understanding of a data set to draw inferences about the population or system from which the data came? The basics of frequentist estimation and hypothesis testing. Application to linear regression. The bootstrap. The multiple hypothesis testing problem. Comparison to Bayesian estimation and hypothesis testing.

  4. Causality (2 weeks). How do we determine the effect that changing a system will have? The Rubin causal model, potential outcomes, and counterfactuals. The “gold standard”: randomized experiments. The basics of causal inference from observational data. From causal inference to data-driven decisions.

Course info

All logistical information about the course is available in the syllabus linked from the menu at left.

Enrolled students should use Ed Discussion via Canvas for course announcements.

Professor

Ramesh Johari