# Learning and Model Fitting

# 1. Classification

Classification is the problem of correctly identifying a class label to an observation based only on a set of input variables. Machine learning algorithms learn to complete this task based on a training set of data containing observations with known class labels (outputs), which are described by a set of (input) attributes. Our studies have focused on building an instance space for benchmark classification problems (from the UCI and OpenML repositories) using a comprehensive set of features to characterise variations in instances that may present challenges for a set of 10 classification algorithms. Our instance space shows that the existing benchmarks are not as diverse as desired for insightful analysis of algorithm strengths and weaknesses, and we have proposed some new ideas for generating more diverse test instances.

Research Publications | Downloads | Instance Space Analysis |
---|---|---|

Muñoz, M. A., Villanova, L., Baatar, D. and Smith-Miles, K. A., "Instance Spaces for Machine Learning Classification", Machine Learning, vol. 107, no. 1, pp. 109-147, 2018. |

# 2. Regression

Regression is a machine learning approach based on supervised learning that aims to predict a continuous-valued target dependent variable based on a set of independent input variables or attributes. A variety of statistical, mathematical and computer science methods are available, each of which makes different assumptions about the underlying relationship between the dependent and independent variables. Our studies have focused on building an instance space to show whether the existing benchmarks can adequately explain variation in approaches, and converting problems from other fields into regression problems to augment the diversity of the instance space.

Research Publications | Downloads | Instance Space Analysis |
---|---|---|

Muñoz, M. A.,Yan, T.,Leal, M. R., Smith-Miles, K. A., Lorena, A. C., Pappa, G. L., Rodrigues, R. M., "An Instance Space Analysis of Regression Problems", ACM Transactions on Knowledge Discovery from Data |

# 3. Anomaly Detection

Anomaly detection methods are used to identify unusual patterns that do not conform to expected behavior, called outliers. Our studies have focused on characterising the benchmark instances using novel features, and exploring the impact of normalisation schemes on the success of various methods for outlier detection.

# 4. Time Series Forecasting

A time series is a sequence of discrete-time data. Time series forecasting builds a model to predict future values based on previously observed values. Our studies of time series forecasting have focused on developing useful features to globally characterise time series, and then using these features to construct an instance space of the well-studied M3 competition time series. We have filled the instance space with 10,000 new time series exhibiting a wide range of characteristics, enabling the strengths and weaknesses of forecasting methods to be better described.

# 5. Facial Age Estimation

Facial images contain much information about an individual: their identity, gender, mood, and their age. Various methods have been proposed for estimating the age of a person based on their face, using databases with known age labels including FG-NET, MORPH and MORPH2. Our early study in 2007 focused on developing new facial age estimation methods and comparing to state-of-the-art approaches including tailored methods and generic machine learning approaches. We are currently revisiting this study in light of instance space analysis to understand how the performance of algorithms depends on the characteristics of the face.

# 6. Clustering

The definition of a cluster in the literature is not unique, and each algorithm may adopt a different clustering criterion. These criteria create biases for different algorithms, affecting their suitability for identifying clustering structures in datasets, depending on the dataset properties. The challenge for Instance Space Analysis (ISA) is to explain how an algorithm's clustering criterion affects performance on a variety of datasets with various cluster structures. Here we have developed a set of 20 meta-features aiming to reveal different types of structures within a dataset. Since there are multiple cluster definitions, we have selected 10 popular partitional and hierarchical clustering algorithms employing different clustering criterion. In order to evaluate algorithm performance, noting the absence of ground truth for clustering results, we have combined 12 validation indexes into a ranking to score each algorithm's success. In this work, two ISA experiments were carried out. In the first, 380 artificial datasets were tested while in the second, 219 real datasets were added to the meta-data.