Information Complexity And Multivariate Modeling In High-Dimensions With Applications


McKenzie Professor
The University of Tennessee
Knoxville, TN 37996 U.S.A.

JULY 9-14, 2018
Course hours are 9 to 1 pm on all workshop days

Lecture Theme & Topics

In general statistical modeling and model evaluation problems, the concept of model complexity plays an important role. At the philosophical level, complexity involves notions such as connectivity patterns, and the interactions of model components. Without a measure of overall model complexity, prediction of model behavior and assessing model quality is difficult. This requires detailed statistical analysis and computation to choose the best fitting model among a portfolio of competing models for a given finite sample. Since the introduction of the celebrated Akaike’s Information Criterion in 1973, Statistical Modeling and Model Complexity has become a cutting-edge research domain in cross-disciplinary research and scholarly activity in many fields including the social and behavioral sciences. Today, in the information age we live in, with increasingly sophisticated technology for gathering and storing data, many organizations collect massive amounts of data at accelerated rates and in ever-increasing detail. Such data structures are multivariate in nature. They pose a great deal of challenge to model, analyze and interpret the results. They have very large high-dimensions and often have large numbers as well as small number of observations. Data are skinny and wide. They are categorical, discrete, quantitative, and often are mixed data types. This high dimensionality and different data types and structures have now outstripped the capability of traditional statistical methods, data analysis, graphical and data visualization tools. It is because of this, multivariate analysis and modeling is one of the important areas of Statistics, Computer Science, and Machine Learning. Multivariate techniques and modeling have been widely applied to problems in business, healthcare and medicine, industry, science, engineering and government, social and behavioral sciences. These techniques have profound impact on our society and societal applications. The growing consensus that multivariate methods can bring real value has led to an explosion in demand for novel data analysis procedures. Students who are trained in this area and who have an understanding of multivariate techniques are in high demand in job market by many companies and businesses that can apply these methods to real-life problems, and are trained for research and development in high-dimensional data.

To address the challenges in multivariate modeling in high-dimensions, Istanbul Quantitative Lectures during July 9-14, 2018 will present modular lectures on selected topics with special emphasis on innovative interdisciplinary research and the interaction between theory and practice of information complexity and multivariate modeling in high-dimensions. The learning process will be reinforced using real benchmark data sets related to cross-disciplinary problems. The working of computational tools and algorithms will be shown and illustrated using an open-architecture high-level computational language with Matlab. Hands-on computations will be shown on real benchmark data sets.
Istanbul Quantitative Lectures language will be in English and will be delivered by Prof. Dr. Hamparsum Bozdogan, a doyen and renowned expert in statistical modeling and model selection in the Department of Business Analytics and Statistics at the University of Tennessee, in Knoxville, Tennessee, U.S.A.

Course Modular Topics:

Day 1: 

Multivariate Data Types and Structures:

This module will cover different data types and structures and give an overview of multivariate data. It
will present:

>> The general structure of multivariate data sets
>> Multi-sample Data
>> Multivariate Linear or Nonlinear Data Structure
>> Multivariate Mixed data types, i.e., both qualitative and quantitative
>> Multivariate Data Repositories for Business Analytics

Visual Data Mining of Multivariate Data:

This module will cover the multivariate graphical techniques for visual data mining

>> Pairwise and Matrix plots
>> Parallel coordinate plots
>> Andrews’ curves
>> Chernoff faces
>> Correlation mapping
>> 3D Convex hull plots
>> Computational tools with examples

Multivariate Gaussian Model:

This module will cover the multivariate Gaussians since we need them to understand optimal classifiers; regression; neural nets; mixture model cluster analysis, etc.
>> Univariate and Multivariate Gaussians
>> Bayes Rule and Gaussians
>> Maximum Likelihood and MAP using Gaussians
>> Other important multivariate distributions

Day 2: 

Information Criteria for Model Selection:

An Overview

This module will present an overview of most popular information criteria for model selection to choose the best fitting model among a portfolio of competing models to fit the data.

>> Introduction and motivation
>> What is model selection?
>> Akaike’s Information Criterion (AIC)
>> Takeuchi’s TIC, or AICT
>> Schwarz’s SBC/ Rissanen’s Minimum Description Length (MDL)
>> Bozdogan’s Consistent AIC, CAIC
>> Consistent AIC with Fisher Information: CAICF
>> Information Complexity Criterion: ICOMP
>> COMP for misspecified models
>> Examples of applications
>> Weights of model selection criteria

New Advances in Predictive Models:

This module will cover multivariate predictive models for several responses with many predictor
>> Review of usual regression model
>> Advances in multiple and multivariate regression models
>> Subset selection of best predictors variables using the genetic algorithm (GA)
>> Subset selection of best predictors using LASSO and Adaptive Elastic Net
>> Subset selection of best predictors in multivariate regression model with the GA
>> Subset selection of best predictors in logistic regression using the GA
>> Illustration of computational tools

Hybridized Smoothed Covariances and Computation:

This module will show how to estimate covariance matrix of a high-dimensional data when the sample size is small relative to the dimension of the data using several hybridized smoothed covariance estimators.
>> Computation of covariance and correlation matrices in high-dimensions
>> Color mapping of the correlations

Day 3:  

Dimension Reduction Techniques and Continuous Latent Variable Models:

This module will cover the concepts and techniques for dimension reduction of high dimensional data.

>> Classical Principal Component Analysis
>> Probabilistic Principal Component Analysis (PPCA) for Dimension Reduction
>> Sparse PCA (SPCA)
>> Factor Analysis (FA)
>> Learning Factor Patterns in Exploratory FA using GA
>> Structural Equation Models (SEM)
>> Bayesian Factor Analysis (BAYFA)
>> Independent Component Analysis (ICA)
>> Partial Least Squares (PLS)

Day 4:

Computational Lab for Dimension Reduction on Real and Simulated Data Sets:

This module will show the computational tools for PCA, SPCA, PPCA, FA and the computation of model selection criteria in these models for real benchmark and simulated data sets.

>> Computation of Principle Components
>> Computation of Sparse PCA
>> Computation of PPCA
>> Computation of Factor Analysis
>> Model Selection in PPCA and FA

Day 5:

Supervised and Unsupervised Classification and Clustering:

This module will cover the concepts and techniques for supervised and unsupervised classification and cluster analysis techniques.

>> Linear and Quadratic Discriminant Analysis (LDA) and (QDA)
>> Sparse Discriminant Analysis (SDA)
>> Logistic Discriminant analysis (LOGDA)
>> Classification Support Vector Machine (SVM) analysis
>> Model-based clustering:
(1) Mixture Model Cluster Analysis and Choosing the Number of Clusters
(2) Flexible Kernel Density Estimation to Mixture Model Cluster Analysis
(3) Robust Mixture Model Cluster Analysis Using Elliptically Contour (EC) Distributions

Day 6:

Computational Lab for Supervised and Unsupervised Classification:

This module will show the computational tools for supervised and unsupervised classification and clustering on real benchmark and simulated data sets using the model selection criteria.

>> Computational tools for Linear and Quadratic Discriminant Analysis (LDA) and (QDA)
>> Computation of Sparse Discriminant Analysis (SDA)
>> Computational tools for model-based clustering in mixture model cluster analysis and choosing the number of clusters