(August 23rd – August 27th , 2022)



Professor Marco Steenbergen is professor of political methodology at the University of Zurich. His methodological interests span choice models, machine learning, measurement, and multilevel analysis. His substantive interests cover voting behavior and digital democracy, in particular, online deliberative processes. Originally hailing from The Netherlands, he previously taught at Carnegie Mellon University, the University of North Carolina at Chapel Hill, and the University of Bern. He has published extensively and is co-author of the award-winning book The Ambivalent Partisan (OUP 2012).


Machine learning is fashionable. But what is it and how can it be put to good use in the social sciences? This introductory course provides an overview of some of the most important machine learning techniques and their social science applications. Those applications can be grouped into several sub-categories:

  1. Preparing data for statistical analysis: Sometimes data are so voluminous that hand-coding them is near-impossible. We can leverage clever computer algorithms to do the coding for us. For example, we could use an artificial neural network to detect if tweets, of which there are millions, come from a social bot or from a legitimate source.
  2. Doing statistical analysis: As social scientists we are used to building models with numerous parametric assumptions. What if we would let algorithms leverage the data to obtain the model for us? That way, we may detect complex contingencies not previously theorized.
  3. Pattern recognition: How do variables hang together and what groups do our cases form in terms of those variables? For example, political parties take positions on numerous issues. Can we group those issues into ideologies? Based on the issues can we place the parties into clusters?
  4. Anomaly detection: Some phenomena such as war are fortunately rare. However, this makes analyzing them challenging. A whole subfield of machine learning is dedicated to the detection of such rare events or anomalies.

The Course

Through lectures and group exercises, the course shows applications in each area. After discussing the general principles of machine learning, the course spends three days on discussing supervised machine learning techniques (relevant for application areas 1 and 2), one day on pattern recognition (relevant for application area 3), and one day on anomaly detection (application area 4). Each day, students will learn the intuition behind the techniques, how they can be implemented in R, how should be interpreted, and how they can be applied in the social sciences. The course is designed to minimize the level of mathematical complexity, although students can always look up the details in vignettes made available for the course. Classification as well as regression tasks are considered. In the former, we seek to predict class membership; in the latter, we predict a numeric score. Interpretation is key and we spend a great deal of time on various metrics and their implementations.

The course covers the following algorithms/techniques:

  • (1) k-nearest neighbors;
  • (2) probabilistic learning (including naïve Bayes, linear, and quadratic discriminant analysis);
  • (3) classification and regression trees, random forests, and model trees;
  • (4) regression with regularization;
  • (5) artificial neural network analysis;
  • (6) boosting;
  • (7) principal component analysis;
  • (8) cluster analysis;
  • (9) SMOTE; and
  • (10) support vector machines.


The course assumes a basic familiarity with probability theory and with linear regression analysis. Prior familiarity with machine learning or related fields (e.g., NLP) is not required. On the other hand, a good knowledge of R is essential for the successful completion of the course. Students should know how to read in data, how to transform variables, how to work with model objects, and how to create graphs.