Internship to explore model interpretation in scikit-learn
Interested in research and empirical analysis, this internship is for you. You'll explore some model interpretation techniques and contribute to the open source ecosystem around scikit-learn.
Context of the internship
The internship will be carried out at Probabl. The company has offices in Paris at the Montparnasse Tower and in Palaiseau at the Inria Paris-Saclay research center. Probabl is a startup working on AI/ML solutions to help data scientists conduct data science projects. Within Probabl, the open-source team is responsible for maintaining and developing the scikit-learn library and related open-source projects. The internship will take place within the open-source software team.
Internship problem statement
Model inspection is crucial to understanding the behavior of machine learning models. For tree-based models, one particular method to explain feature importance is called mean decrease impurity (MDI). However, research has shown that depending on how the MDI is computed, this measure can be biased. The goal of the internship is to characterize the sources of these biases and develop techniques to alleviate this issue while comparing them with other state-of-the-art alternative approaches. We are particularly interested in studying these issues for the family of gradient boosting trees.
State-of-the-art and related work
Mean decrease in impurity (MDI) has already been theoretically studied in the past [1]. However, in practice, depending on how the MDI is computed, this measure can be biased [2]. When it comes to ensembles of bagged trees, several approaches leverage the out-of-bag (OOB) samples to get an unbiased estimate of the feature importance [3, 4, 5]. However, this approach is only suitable for bagged trees (e.g., random forest, extra random trees, totally random trees). While gradient boosting trees is probably the most popular and effective technique in tabular machine learning, the out-of-bag samples are not an option anymore. There are no actual studies in the literature focusing on this family of models when it comes to characterizing the sources of biases in the MDI and the way to debias it.
Internship proposal
The internship will focus on a comprehensive study of Mean Decrease Impurity (MDI) bias in tree-based models, with particular emphasis on gradient boosting trees. The work will be structured in three main phases:
First, we will conduct a thorough characterization of MDI biases across different tree-based models. This includes analyzing how categorical features with varying cardinality affect the bias [2], studying the impact of model overfitting on MDI estimates [2], and investigating how these biases manifest differently in various tree ensemble methods (random forests, gradient boosting, etc.) [1].
Second, we will implement and evaluate existing debiasing techniques that leverage out-of-bag (OOB) samples, specifically for bagged tree models. This will involve reproducing state-of-the-art methods from the literature [3, 4, 5] and conducting comparative analyses of their effectiveness on both synthetic and real-world datasets.
Third, we will propose a novel approach to compute MDI using external test sets, particularly addressing the challenge of gradient boosting trees where OOB samples are not available. This method will aim to provide more reliable feature importance estimates while maintaining computational efficiency.
Throughout the project, we will benchmark these different MDI computation and debiasing techniques against other popular feature importance methods, including permutation importance and Shapley values. The benchmarking will consider both the quality of the importance estimates and the computational requirements. Special attention will be given to gradient boosting trees, as they represent a crucial gap in the current literature regarding unbiased feature importance estimation.
Expected results
We expect to provide a better understanding of the sources of biases in the MDI measure for all available tree-based models and potentially provide debiased alternatives. These methods will be compared to existing state-of-the-art alternatives such as permutation importance or Shapley values. We will also investigate the link between these MDI estimation and Shapley values as proposed in [6]. These analyses will be carried out empirically on synthetic and real-world datasets. In addition, we want to provide an open-source implementation of the proposed methods such that it can serve of starting point to be consolidated in the scikit-learn library [7].
Recruitment process
Here's what to expect:
- Send your resume and a cover letter detailing your interest in the position and understanding of the topic (a couple of paragraphs).
- Shortlisted candidates will be invited to a 1-hour interview with members of the open-source team. During the interview, we will discuss your background, your technical skills, and your motivations for the internship. You will also have the opportunity to showcase one of your technical project and ask questions about the internship, the team, and Probabl.
References
[1] Louppe, G., Wehenkel, L., Sutera, A., & Geurts, P. (2013). Understanding variable importances in forests of randomized trees. Advances in neural information processing systems, 26.
[2] Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8, 1-21.
[3] Loecher, M. (2022). Unbiased variable importance for random forests. Communications in Statistics-Theory and Methods, 51(5), 1413-1425.
[4] Zhou, Z., & Hooker, G. (2021). Unbiased measurement of feature importance in tree-based methods. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(2), 1-21.
[5] Li, X., Wang, Y., Basu, S., Kumbier, K., & Yu, B. (2019). A debiased MDI feature importance measure for random forests. Advances in Neural Information Processing Systems, 32.
[6] Sutera, A., Louppe, G., Huynh-Thu, V. A., Wehenkel, L., & Geurts, P. (2021). From global to local MDI variable importances for random forests and when they are Shapley values. Advances in Neural Information Processing Systems, 34, 3533-3543.
[7] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
- Department
- Open Source
- Locations
- Paris / Montparnasse - Office, Saclay / Palaiseau - Office
- Remote status
- Hybrid
- Open to freelancing
- false
About Probabl
We develop, maintain at the state of art, and sustain a complete suite of open source tools for data science.
For more info, check probabl.ai
Internship to explore model interpretation in scikit-learn
Interested in research and empirical analysis, this internship is for you. You'll explore some model interpretation techniques and contribute to the open source ecosystem around scikit-learn.
Loading application form
Already working at Probabl?
Let’s recruit together and find your next colleague.