Projects

This page is devoted to projects for which I am actively looking for collaborators, and some (sometimes fairly old) ideas that I wouldn't mind taking a second look at. Many of these are suitable student projects and they are partly here to remind me of things I've thought of.

As a reminder, I really can't support outside visitors. That said, if you are interested in collaborating on these, or have relevant data, please get in touch. Of course, I'm also more than happy to be told that something has already been done and I no longer need to worry about doing it myself!

Active Projects:

These are projects for which I am actively looking for students, either because I have funding or have enough work developed that I would like to see them through:

Model Selection in Mathematical Ecology a large component of mathematical ecology involves extrapolating models, parts of which are obtained from data. For example, we might, examine three species' responses to environmental variability and use this to quantify how stable their coexistence is. This involves a complex nonlinear functional of the original model that describes each species' ability to rebound from near extinction. What can be said about the best way to choose the statistical models that go into this analysis? And how do we account for the variability in that choice? Methods like Bayesian Model Averaging, or efficient influence functions are superficially appealing, but can more be said?

Variance Estimates for U-Statistics most distributional results for random forests come from relating them to U or V statistics (doing the same thing -- build a tree -- to subsamples of the data, then averaging). This has lead to central limit theorems and a variety of ways to estimate the resulting variance, with some success. However, while we have results on the consistency of these estimates, we know very little about their distribution or convergence properties. Simulations suggest that these may be important, and incorporating them into inference could improve statistical properties.

A recent area that my lab has explored has been in V-statistics (subsamples with replacement). These have been successful because we can represent the process as simulating from the empirical distribution. This ought to allow us to gain greater insight into convergence properties by directly relating them to the convergence of the empirical distribution; this is also a way to start understanding how convergence scales with dimension.

Inference and Boosting while a lot of progress has been made around theoretical developments for random forests and associated methods, much less progress has been made for boosting. This is significantly more challenging; some progress was made in this paper but I think much more can be done both in modifying the boosting process and in developing distributional results. One of the advantages of boosting is in its ability to impose structure on models (additive models, partially linear models, etc). Producing distributional results for boosting processes potentially allows us to incorporate tree-based methods as part of a larger statistical model and still produce valid inference.

Automatic Hyper-parameter Tuning and Random Forests much more experimental, this focusses on something like a boosting process within random forests. Although relatively insensitive to hyper-parameters, choices like subsample size and mtry can make a difference to performance. The nice thing about random forests is that we can explore some of these values as we grow trees and either discard bad ones or just average them out.

Principal Differential Analysis in functional data analysis, PDA involves relating derivatives of a function to each other. There are a number of problems to be addressed including sparsity, inference, and selecting the order of derivatives to use. I have a number of motion capture data sets in which, for example, we want to disentangle motor control processes from the physical forces on limbs. Extensions of these problems to spatio-temporal processes (so PDEs) are also of interest.

fda package for a keen programmer; the fda package for functional data analysis, while providing the underlying structure for a number of other packages, is sorely out of date. We really need help writing new functions for glms, gams, methods for uniform confidence bands, and a host of other standard models.

Miscellaneous Ideas:

These are projects that I either never had time for, or which I don't have immediate applications, but that I still think have real potential. Some of them date to quite a while ago, but I will gladly spend time looking at them again if you are interested in helping.

"A Tale of Two ANOVAs": Machine Learning cook-offs have a tendency to declare a winner and stop there. But can we learn something from what the winner did differently from its competitors? There are lots of ways in which we could compare the behavior of functions -- the title is a play on decomposing differences between ML models in terms of additive and interaction effects of each -- but there are lots of tools that could be brought to bear.

Bayesian Robustness via Hierarchical Models: Some time ago, I got interested in disparity-based methods which involve estimating parameters p by minimizing a distance D(f(x;p),g(x)) between a parametric model f(x;p) and a non-parametric density estimate g. The reason to do this is that you can define a class of disparities D that render your estimate efficient, if your model is correct, but also robust to outliers.

I did some work incorporating Bayesian methods into these ideas: replacing the likelihood with the disparity. I also did some work on using Bayesian density estimates for g. What I wanted to get to, but never did, was to do both: use the parametric family as a modification for a nonparametric Bayes prior in which p becomes a hyperparameter that we can estimate. The idea would be that this is a principled Bayesian way of developing robustness -- D says that the truth should be "like" your parametric family, but allows departures from it. We ought to be able to demonstrate that the estimated parameters remain efficient if the model is correct.

Multidimensional Disparity Methods: as in Bayesian robustness above, but in a different direction. A severe limitation of disparity-based methods is that it really only applies to univariate distributions g. This is because the bias-variance trade-off in the non-parametric estimation of g only allows appropriate smoothing in one dimension: either bias or variance blows up otherwise. One way to get around this is to transform the data so that, under the assumed model, univariate density estimation is possible: based on the residuals of linear regression, for example, or by rotating the data to the principal components of a covariance. That means that g now depends on p (and f is often parameter-free), requiring much more analysis of the properties of the nonparametric smooth as p varies.

This interacts with the Bayesian methods above, allowing a much wider variety of models to be explored. It could also be extended to robustness in nonparametric smoothing methods in which we would also need to characterize efficiency.

Selecting Smoothing Parameters at Edge Cases: there is a lot of work on nonparametric smoothing and on the asymptotics of criteria for choosing smoothing parameters, but usually with a view to ensuring that they decay at the right rate. But what if you are doing local linear regression when the truth really is linear? In that case you want your bandwidth to grow!

Some ideas about an analysis at the implied parametric model appear in Crainiceanu and Rupert, 2003 but it would be useful to analyze other methods: GCV, or the forward prediction error that I suggested for inexact differential equations.

Nonparadoxical Item Response Theory One approach to the phenomenon of paradoxical results in multidimensional item response theory was to make sure that they were not observable. That is, put in the following constraint: any two subjects in which the first did better on every item than the second, they must also have higher ability in every dimension. This has minimal effect in asymptotics with growing numbers of questions, but can we also examine a growing number of subjects?

On a more basic level, the results in the original paradoxical results paper have been shown to hold only for two abilities. I conjecture that there will always be at least one paradoxical result in any test of more than two abilities, but this needs a proof.