Schedule of Module 2: Machine Learning for Classification and Regressions
Day 1, Tuesday, 1 February  
16:00  18:00  Decision Trees in classification and regression (Part I) By TRUFIN Julien 
Day 2, Tuesday, 8 February  
16:00  18:00  Programming : Basics of regression and classification trees in Python 
Day 3, Tuesday, 15 February  
16:00  18:00  Decision Trees in classification and regression (Part II) By TRUFIN Julien 
Day 4, Tuesday, 22 February  
16:00  18:00  Programming : From simple regression and classification trees to ensembles of trees (bagging and random forests) 
Day 5, Tuesday, 8 March  
16:00  18:00  Theory Boosted and Bagged ensembles By TRUFIN Julien 
Day 6, Tuesday, 15 March  
16:00  18:00  Programming: Stochastic gradient boosting machines and XGBoost 
Day 7, Tuesday, 22 March  
16:00  18:00  Clustering methods By HAINAUT Donatien 
Day 8, Tuesday, 29 March  
16:00  18:00  Programming : Clustering 
18:00  23:59  Assignment after Module 2 

This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 1 : Classification and regression trees :
 Binary regression trees
 Right sized trees
 Measure of performance
 Relative importance of features
 Interactions
 Limitations of trees

From 16:00 to 18:00
Programming : Basics of regression and classification trees in Python
 Fit a first example of a regression tree using Python’s [scikitlearn: sklearn.tree] on a simulated (toy) data set for regression. The goal here is to gain understanding of the control parameters, the loss function, and the output. We’ll work from a stump to a very deep tree.
 Tuning of the parameters: kfold cross validation, minimal cv error and one standard error rule in Python’s [scikitlearn: sklearn.model_selection].
 Acquire an (empirical) understanding of the (high) variability of a constructed tree by repeating the above steps on an alternatively simulated input data.
 Repeat the above steps on a classification problem with a simulated (toy) data set. Discuss loss functions, appropriate measures.
 Build a regression tree for the MTPL claim frequency and severity data: focus on loss functions available in Python’s [scikitlearn: e.g. Poisson available, gamma not], discuss limitations and sketch possible alternatives.

This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 2 : Bagging trees and random forests :
 Bagging trees : bias, variance and expected generalization error
 Random forests
 Interpretability : relative importances and partial dependence plots

From 16:00 to 18:00
Programming : From simple regression and classification trees to ensembles of trees (bagging and random forests)
 First discussion of some basic interpretation tools: variable importance plot and partial dependence plot using Python’s [scikitlearn: sklearn.inspection].
 Demonstrate bagging on bootstrapped toy data sets (for regression and classification): fit deep trees using Python’s [scikitlearn: sklearn.tree] and average predictions. Then continue with bagging done properly using Python’s [scikitlearn: sklearn.ensemble Bagging] (on the toy data sets introduced in previous session). Tuning of parameters, construction of predictions.
 Same for random forests using Python’s [scikitlearn: sklearn.ensemble RandomForest]. Tuning of parameters, construction of predictions.
 Build a random forest on the Ames Housing Data set and extract first insights.
 Discuss the loss functions available for bagging + random forest on claim frequency and claim severity; comparison of tools [eg H2O].

This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 3 : Boosting trees :
 Boosting trees
 Gradient boosting trees
 Regularization and randomness
 Interpretability : relative importances, partial dependence plots and Friedman's Hstatistics

From 16:00 to 18:00
Programming: Stochastic gradient boosting machines and XGBoost
 Basics of fitting (stochastic) gradient boosting machines with Python’s [scikitlearn: sklearn.ensemble GradientBoosting] and XGBoost: discussion of control parameters, outline of tuning process.
 To illustrate first principles we work again on the toy data sets for regression and classification introduced in previous sessions.
 Claim frequency and severity modelling with GBMs in Python’s [scikitlearn: sklearn.ensemble GradientBoosting]: tuning, variable importance, PDPs, predictions and construction of technical tariff.

The aim of this session is to cover popular unsupervised learning techniques for analysing a dataset.
 Principal components in a nutshell
 Multiple correspondence analysis
 Kmeans, Kmeans++, batch kmeans
 Fuzzy clustering
 Spectral clustering
 DBSCAN clustering algorithm

From 16:00 to 18:00
Programming : Clustering
 Focus on case study provided by the data science working group from the Swiss Actuarial Association, with several characteristics of cars available (e.g. brand, type, weight, …); start with data exploration and preprocessing steps.
 Demonstrate dimension reduction methods (e.g. PCA) using Python’s [scikitlearn: sklearn.decomposition] and clustering methods (e.g. Kmeans clustering, hierarchical clustering) using Python’s [scikitlearn: sklearn.cluster].

From 18:00 to 23:59
Assignment after Module 2
Important comment: The assignment is not a strict examination. Its purpose is to apply the concepts learned during the previous sessions.
Continue Module 1’s assignment but now extend the analysis with treebased models on the claim frequency and severity data or classification. Compare models constructed in first assignment with the treebased alternatives.