Courses

MACHINE LEARNING FOR SOCIAL SCIENTISTS

(August 23rd – August 27th , 2022)

PROF. DR. MARCO STEENBERGEN

 

Professor Marco Steenbergen is professor of political methodology at the University of Zurich. His methodological interests span choice models, machine learning, measurement, and multilevel analysis. His substantive interests cover voting behavior and digital democracy, in particular, online deliberative processes. Originally hailing from The Netherlands, he previously taught at Carnegie Mellon University, the University of North Carolina at Chapel Hill, and the University of Bern. He has published extensively and is co-author of the award-winning book The Ambivalent Partisan (OUP 2012).

 

MACHINE LEARNING FOR SOCIAL SCIENTISTS

PROF. DR. MARCO STEENBERGEN

COURSE DESCRIPTION:

Machine learning algorithms are deployed everywhere. The business, economic, behavioral and social sciences are no exception. Still, many scholars in those fields wonder what machine learning is and how they can put it to good use in their own research.

 

In this course, you will get answers to those questions. You will first learn how learning from data works compared to more traditional workflows in the business, economic, behavioral and social sciences. We are used to theory-driven models that are quite heavy on assumptions. Machine learning operates much more inductively. To get valid results from such an approach, it is important to follow a precise logic that you will learn.

 

Next, you will learn several ways in which machine learning can be useful for business, behavioral, economic, and social scientists. We focus on finding structure in data, on detecting anomalies (crucial in cyber-security), generating new data, and developing new theoretical ideas.

 

The core of the course consists of learning many specific algorithms that serve the purposes described above. Algorithms and procedures you will learn include:

 

  1. Principal component analysis for identifying clusters of variables.
  2. Cluster analysis (hierarchical and k-means) for finding clusters of cases.
  3. Algorithms for classifying cases, including k-nearest neighbors, greedy algorithms, naïve and Gaussian Bayes methods, linear and quadratic discriminant analysis, logistic regression with regularization, artificial neural networks, and support vector machines.
  4. Algorithms for regression tasks, including regression, partial least squares, principal component regression, classification, and regression trees.
  5. Ensemble learners such as random forests, extreme gradient boosting, and stacked learners.
  6. Procedures for dealing with unbalanced data (e.g., SMOTE).
  7. Procedures for visualizing and interpreting complex algorithms such as DALEX, IML, and LIME.

 

The focus is on hands-on learning in R. Each algorithm is discussed with a minimum of mathematics and then applied to real-world data drawn from the behavioral, business, economic, and social sciences.

The Course

Through lectures and group exercises, the course shows applications in each area. After discussing the general principles of machine learning, the course spends three days on discussing supervised machine learning techniques (relevant for application areas 1 and 2), one day on pattern recognition (relevant for application area 3), and one day on anomaly detection (application area 4). Each day, students will learn the intuition behind the techniques, how they can be implemented in R, how should be interpreted, and how they can be applied in the social sciences. The course is designed to minimize the level of mathematical complexity, although students can always look up the details in vignettes made available for the course. Classification as well as regression tasks are considered. In the former, we seek to predict class membership; in the latter, we predict a numeric score. Interpretation is key and we spend a great deal of time on various metrics and their implementations.

The course covers the following algorithms/techniques:

  • (1) k-nearest neighbors;
  • (2) probabilistic learning (including naïve Bayes, linear, and quadratic discriminant analysis);
  • (3) classification and regression trees, random forests, and model trees;
  • (4) regression with regularization;
  • (5) artificial neural network analysis;
  • (6) boosting;
  • (7) principal component analysis;
  • (8) cluster analysis;
  • (9) SMOTE; and
  • (10) support vector machines.

 

Prerequisites

The course assumes a basic familiarity with probability theory and with linear regression analysis. Prior familiarity with machine learning or related fields (e.g., NLP) is not required. On the other hand, a good knowledge of R is essential for the successful completion of the course. Students should know how to read in data, how to transform variables, how to work with model objects, and how to create graphs.