Data Science using R

1. Course Overview

The goal of this course is to provide hands-on experience on key data science methods
and procedures using one particular tool – the R language and environment.

R is a fast-growing technology that has been witnessing widespread acceptance both in academia and
industry. Together with Python they form the key tools for any aspiring data scientist. There are
many factors contributing for the widespread acceptance R but clearly these include the price
(free), being open source (trustworthy software that can be easily inspected/checked for flaws),
the extension of available methods (exponential growth of the set of available methods
for different application areas), and the available support from the community (an extremely large
community of knowledgeable experts proving top-notch support for free).

In this short course we will illustrate the use of R for several key data science processes. This
illustration will be driven by concrete case studies that we will “solve” using R.

After taking this course you should be able to use R for:

  1. Understanding your data. Exploratory analysis of data frequently provides key insights to data
    properties and problems that can have a big impact on posterior mining steps and may help in
    mapping business problems into data mining tasks. We will provide practical illustrations of
    methods for summarizing, visualizing and preparing your data for model construction.
  2. Master frequently used modeling techniques. Data can be modeled in many different
    ways. The outcomes of these models can provide useful information for decision makers. We will
    address several concrete modeling tasks with frequently used techniques. We will learn how to
    obtain and apply these models in R.
  3.  Correctly assess the performance of models. Performance assessment is a key step for taking
    advantage of the results of data mining models. Being able to carry out this task in a reliable way is of key importance to make sure future deployment of data mining pays off.
  4. Easily report and deliver results of data mining. The outcome of data mining often needs to
    be reported, communicated and made available to different types of people (e.g. decision
    makers, key personnel of other departments lacking knowledge of data mining, etc.). We will learn
    how specific tools available in R can boost our productivity in this type of tasks.

2. Focus and interaction

The course will illustrate the use of R for several key data science tasks. The main focus will be
on how to carry out these tasks in R and not on the principles and theory behind
these approaches. This means that this is a practical course that will illustrate several of the
techniques you may have learned on previous courses. We will cover the full data
science cycle from importing data into R till deployment of the results. The course is driven by
concrete case studies and full solutions will be provided to allow you to replicate and re-use the
solutions, in the spirit of open source software like R.

You are expected to attend the entire class sessions, to arrive prior to the starting time, and to
follow basic classroom etiquette, including having all electronic devices turned off and put away
for the duration of the class and refraining from chatting or doing other work or reading during
class.

If you have questions about class material that you do not want to ask in class, or that would take
us well off topic, please detain me after class, or send me an email with your questions.

3. Readings

Book: An (optional) textbook that can help you in the class is

Data Mining with R: learning with case studies (2nd edition), by Luis Torgo (2017), CRC Press.
This book illustrates the use of R for data science through the use of concrete case studies, some
of which we will cove in this course.

At the course web page (see URL above) you may find other references.

4. Software

This course is about using R for data science. In this context, you are required to have an up-to-
date installation of R in your laptop. R can be freely obtained from http://www.r-project.org
RStudio is another free software tool that provides an integrated development environment to

R. I strongly recommend that you use RStudio as your tool for interacting with R. RStudio can be
freely downloaded from http://www.rstudio.com .

R comes with an extensive set of tools pre-installed. Still, it can be easily extended through the
(free) installation of extra packages. We will use several of these. You can easily install any of
these extra packages in RStudio. In the course web page you may find information on all the
packages that will be used throughout the course. You are advised to have them pre-installed on
your R installation before the course starts.

5. Class Topics and Schedule

Day 1: Introduction to Data Mining and R

  • A brief introduction to Data Science
    • Main tasks and objectives
    • Illustrative case studies
  • A brief introduction to R and RStudio
  • Basic concepts of the R language
    • Hands on practice

Day 2: Data Munging

  • Presentation of the first case study
  • Importing data into R
    •  Hands on practice
  • Data summarization
  • Data pre-processing
    • Examples and exercises in R

Day 3: Data Visualization and Reporting

  • Data visualization
    • Examples and exercises in R
  • Reporting
    • Dynamic reports and presentations in R using Rmarkdown
    • Examples and exercises in R

Day 4: Predictive Analytics

  •  Introduction to predictive modelling
    • Classification and regression tasks
  • Evaluation metrics
  • Linear discriminants and linear regression
    • Hands on exercises in R
  • Classification and regression trees
    • Hands on exercises in R

Day 5: Predictive Analytics (cont.)

  • Support vector machines
    • Hands on exercises in R
  • Ensembles and Random forests
    • Hands on exercises in R
  • Model evaluation strategies
    • Reliability of estimates

Day 6: Model Evaluation and Selection

  • Experimental methods for performance estimation
    • Cross validation
    • Holdout
    • Bootstrap
  • The performanceEstimation package
    • Illustrations in R
  • Hands on project