Schedule of Module 2: Machine Learning for Classification and Regressions
Day 1, Thursday, 6 February | |
16:00 - 18:00 | Decision Trees in classification and regression (Part I) By TRUFIN Julien |
Day 2, Thursday, 13 February | |
16:00 - 18:00 | Programming : Basics of regression and classification trees in Python By VAN DAM Daniel, VAN ES Raymond |
Day 3, Thursday, 20 February | |
16:00 - 18:00 | Decision Trees in classification and regression (Part II) By TRUFIN Julien |
Day 4, Thursday, 13 March | |
16:00 - 18:00 | Programming : From simple regression and classification trees to ensembles of trees (bagging and random forests) By VAN DAM Daniel, VAN ES Raymond |
Day 5, Monday, 24 March | |
16:00 - 18:00 | Theory Boosted and Bagged ensembles By TRUFIN Julien |
Day 6, Thursday, 27 March | |
16:00 - 18:00 | Programming: Stochastic gradient boosting machines and XGBoost By VAN DAM Daniel, VAN ES Raymond |
Day 7, Monday, 31 March | |
16:00 - 18:00 | Neural Networks and deep learning overview By ANTONIO Katrien |
Day 8, Thursday, 10 April | |
16:00 - 18:00 | Programming : Deep Learning By VAN DAM Daniel, VAN ES Raymond |
Day 9, Wednesday, 30 April | |
18:00 - 23:59 | Assignment after Module 2 |
-
This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 1 : Classification and regression trees :
- Binary regression trees
- Right sized trees
- Measure of performance
- Relative importance of features
- Interactions
- Limitations of trees
-
From 16:00 to 18:00
Programming : Basics of regression and classification trees in Python
By VAN DAM Daniel, VAN ES Raymond- Fit a first example of a regression tree using Python’s [scikit-learn: sklearn.tree] on a simulated (toy) data set for regression. The goal here is to gain understanding of the control parameters, the loss function, and the output. We’ll work from a stump to a very deep tree.
- Tuning of the parameters: k-fold cross validation, minimal cv error and one standard error rule in Python’s [scikit-learn: sklearn.model_selection].
- Acquire an (empirical) understanding of the (high) variability of a constructed tree by repeating the above steps on an alternatively simulated input data.
- Repeat the above steps on a classification problem with a simulated (toy) data set. Discuss loss functions, appropriate measures.
- Build a regression tree for the MTPL claim frequency and severity data: focus on loss functions available in Python’s [scikit-learn: e.g. Poisson available, gamma not], discuss limitations and sketch possible alternatives.
-
This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 2 : Bagging trees and random forests :
- Bagging trees : bias, variance and expected generalization error
- Random forests
- Interpretability : relative importances and partial dependence plots
-
From 16:00 to 18:00
Programming : From simple regression and classification trees to ensembles of trees (bagging and random forests)
By VAN DAM Daniel, VAN ES Raymond- First discussion of some basic interpretation tools: variable importance plot and partial dependence plot using Python’s [scikit-learn: sklearn.inspection].
- Demonstrate bagging on bootstrapped toy data sets (for regression and classification): fit deep trees using Python’s [scikit-learn: sklearn.tree] and average predictions. Then continue with bagging done properly using Python’s [scikit-learn: sklearn.ensemble Bagging] (on the toy data sets introduced in previous session). Tuning of parameters, construction of predictions.
- Same for random forests using Python’s [scikit-learn: sklearn.ensemble RandomForest]. Tuning of parameters, construction of predictions.
- Build a random forest on the Ames Housing Data set and extract first insights.
- Discuss the loss functions available for bagging + random forest on claim frequency and claim severity; comparison of tools [eg H2O].
-
This module gradually introduces classification and regression trees up to competition winning ensemble methods for regression problems.
Participants will explore how ensembles of decision trees achieve superior performance and learn how to calibrate them in practice.
Part 3 : Boosting trees :
- Boosting trees
- Gradient boosting trees
- Regularization and randomness
- Interpretability : relative importances, partial dependence plots and Friedman's H-statistics
-
From 16:00 to 18:00
Programming: Stochastic gradient boosting machines and XGBoost
By VAN DAM Daniel, VAN ES Raymond- Basics of fitting (stochastic) gradient boosting machines with Python’s [scikit-learn: sklearn.ensemble GradientBoosting] and XGBoost: discussion of control parameters, outline of tuning process.
- To illustrate first principles we work again on the toy data sets for regression and classification introduced in previous sessions.
- Claim frequency and severity modelling with GBMs in Python’s [scikit-learn: sklearn.ensemble GradientBoosting]: tuning, variable importance, PDPs, predictions and construction of technical tariff.
-
From 18:00 to 23:59
Assignment after Module 2
Important comment: The assignment is not a strict examination. Its purpose is to apply the concepts learned during the previous sessions.
Deadline handing in assignment 2 : 29 April 2023.