Much of my work in these areas has been motivated by collaborations in citizen science data in ecology, mathematical and laboratory-based ecology, vehicular emissions and medical decision support. But I have come to these through chance and welcome further serendipitous collaborations.
I have also done some work in robust statistics and in item response theory. I give a brief overview of each here. A research statement from my promotion materials in 2020 gives more, if more outdated, details.
My interests in machine learning lie in 1. adapting ML models for frequentist statistical inference and 2. methods to understand ML models or explain their predictions.
In recent work, my lab developed central limit theorems for predictions from random forests and much of my recent effort has been on using and extending these. This has included improving methods to estimate the variance of the predictions, formal statistical tests for variable importance and for interactions between features, ways to compare and combine random forests with other models, and extensions into boosting methods where I would like to develop methods to use boosted models as components of partially linear models. I am also interested in corresponding results and methods using deep learning.
I have a long-term (all the way back to my thesis) interest in interpretation and explanation of machine learning methods. We think of the result of machine learning as an algebraically-complex black box model, and there has been much recent interest in intelligibility and explainability both as a way of increasing confidence in these models and also to help ensure fairness and allow redress. My early work looked at ANOVA decompositions as a formal means of representing functions in terms of low dimensional components. More recently, I have been interested in the statistical stability of explanations, both associated with the randomness inherent in generating them, and with respect to the uncertainty in the underlying model.
These concerns: "What does it mean to interpret a black box?", "Does interpreting a function provide any value?" along with more general questions in the philosophy of statistics are explored in my blog Of Models and Meanings.
My work in dynamical systems started with attempts to do practical statistical inference with Ordinary Differential Equation (ODE) models. While applied mathematicians have used ODEs since Newton, they have received minimal attention from the statistics community. My work sought to extend standard statistical procedures to these models: estimate parameters, test hypotheses, assess goodness of fit.
These are made challenging by the highly nonlinear nature of ODEs, resulting in complex likelihood surfaces, the simplification that goes into developing ODEs, leading to poor fit to data, and the complexity of choices for alternative models. The profiling methods that I developed with Jim Ramsay helped account for these by allowing the ODE to be inexact without specifying how; this turns out to alleviate both complex likelihood surfaces and lack of fit.
More recently, I have been interested in explicitly stochastic models where I am very interested in designing experiments in ecology and pharmako-dynamics, in how to model random perturbations, and in interpreting inexact methods, such as profiling, for ODEs as accounting for stochastic effects.
Functional data analysis deals with high-frequency measurements of repeated smooth processes -- motion capture technologies are a canonical example -- but can be extended somewhat more broadly to anywhere we think there is a process over space or time. I have been involved in applications of fda ranging from modeling vehicle exhaust, plant responses to weather, and high performance sport.
Methodolgicaly and theoretically, I have been interested in the things that made functional data different from multivariate data. These include: choosing which derivative of an observed function to use in a model, choosing which parts of a function to use, incorporating shifts in the timing of events ("time warping") into models formally.
Importantly, functional data gives us access to derivatives, and therefore relationships between derivatives: hence relating to my interest in dynamical systems above. An ongoing area of interest is in exploring and understanding these relationships.
Item response theory deals with estimating students abilities from test results; both in finding the difficulty level of individual questions and in using them to estimate student abilities.
Paradoxical results occur when we attempt to measure multiple abilities at once: it's possible that flipping an answer from "incorrect" to "correct" can cause one of your abilities to decrease. I formulated a mathematical description of this behavior and investigated it in a number of contexts, although the description is not yet complete.