Schedule of Module 2: Machine Learning for Classification and Regressions
Day 1, Tuesday, 19 April | |
16:00 - 18:00 | Decision Trees in classification and regression (Part I) By TRUFIN Julien |
Day 2, Thursday, 28 April | |
16:00 - 18:00 | Programming : Basics of regression and classification trees in Python |
Day 3, Thursday, 5 May | |
16:00 - 18:00 | Decision Trees in classification and regression (Part II) By TRUFIN Julien |
Day 4, Thursday, 12 May | |
16:00 - 18:00 | Programming : From simple regression and classification trees to ensembles of trees (bagging and random forests) |
Day 5, Wednesday, 18 May | |
16:00 - 18:00 | Theory Boosted and Bagged ensembles By TRUFIN Julien |
Day 6, Thursday, 2 June | |
16:00 - 18:00 | Programming: Stochastic gradient boosting machines and XGBoost |
Day 7, Thursday, 9 June | |
16:00 - 18:00 | Clustering methods By HAINAUT Donatien |
Day 8, Wednesday, 15 June | |
16:00 - 18:00 | Programming : Clustering |
18:00 - 23:59 | Assignment after Module 2 |
-
This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 1 : Classification and regression trees :
- Binary regression trees
- Right sized trees
- Measure of performance
- Relative importance of features
- Interactions
- Limitations of trees
-
From 16:00 to 18:00
Programming : Basics of regression and classification trees in Python
- Fit a first example of a regression tree using Python’s [scikit-learn: sklearn.tree] on a simulated (toy) data set for regression. The goal here is to gain understanding of the control parameters, the loss function, and the output. We’ll work from a stump to a very deep tree.
- Tuning of the parameters: k-fold cross validation, minimal cv error and one standard error rule in Python’s [scikit-learn: sklearn.model_selection].
- Acquire an (empirical) understanding of the (high) variability of a constructed tree by repeating the above steps on an alternatively simulated input data.
- Repeat the above steps on a classification problem with a simulated (toy) data set. Discuss loss functions, appropriate measures.
- Build a regression tree for the MTPL claim frequency and severity data: focus on loss functions available in Python’s [scikit-learn: e.g. Poisson available, gamma not], discuss limitations and sketch possible alternatives.
-
This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 2 : Bagging trees and random forests :
- Bagging trees : bias, variance and expected generalization error
- Random forests
- Interpretability : relative importances and partial dependence plots
-
From 16:00 to 18:00
Programming : From simple regression and classification trees to ensembles of trees (bagging and random forests)
- First discussion of some basic interpretation tools: variable importance plot and partial dependence plot using Python’s [scikit-learn: sklearn.inspection].
- Demonstrate bagging on bootstrapped toy data sets (for regression and classification): fit deep trees using Python’s [scikit-learn: sklearn.tree] and average predictions. Then continue with bagging done properly using Python’s [scikit-learn: sklearn.ensemble Bagging] (on the toy data sets introduced in previous session). Tuning of parameters, construction of predictions.
- Same for random forests using Python’s [scikit-learn: sklearn.ensemble RandomForest]. Tuning of parameters, construction of predictions.
- Build a random forest on the Ames Housing Data set and extract first insights.
- Discuss the loss functions available for bagging + random forest on claim frequency and claim severity; comparison of tools [eg H2O].
-
This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 3 : Boosting trees :
- Boosting trees
- Gradient boosting trees
- Regularization and randomness
- Interpretability : relative importances, partial dependence plots and Friedman's H-statistics
-
From 16:00 to 18:00
Programming: Stochastic gradient boosting machines and XGBoost
- Basics of fitting (stochastic) gradient boosting machines with Python’s [scikit-learn: sklearn.ensemble GradientBoosting] and XGBoost: discussion of control parameters, outline of tuning process.
- To illustrate first principles we work again on the toy data sets for regression and classification introduced in previous sessions.
- Claim frequency and severity modelling with GBMs in Python’s [scikit-learn: sklearn.ensemble GradientBoosting]: tuning, variable importance, PDPs, predictions and construction of technical tariff.
-
The aim of this session is to cover popular unsupervised learning techniques for analysing a dataset.
- Principal components in a nutshell
- Multiple correspondence analysis
- K-means, Kmeans++, batch k-means
- Fuzzy clustering
- Spectral clustering
- DBSCAN clustering algorithm
-
From 16:00 to 18:00
Programming : Clustering
- Focus on case study provided by the data science working group from the Swiss Actuarial Association, with several characteristics of cars available (e.g. brand, type, weight, …); start with data exploration and preprocessing steps.
- Demonstrate dimension reduction methods (e.g. PCA) using Python’s [scikit-learn: sklearn.decomposition] and clustering methods (e.g. K-means clustering, hierarchical clustering) using Python’s [scikit-learn: sklearn.cluster].
-
From 18:00 to 23:59
Assignment after Module 2
Important comment: The assignment is not a strict examination. Its purpose is to apply the concepts learned during the previous sessions.
Continue Module 1’s assignment but now extend the analysis with tree-based models on the claim frequency and severity data or classification. Compare models constructed in first assignment with the tree-based alternatives.