To put it simply, a model is a mathematical tool used to describe or predict properties of certain phenomena. These tools usually need to be adjusted to a set of data in order to be able to reliably represent reality. We wanted to have a deeper understanding of the study sample that we obtained in the first survey, that is why we decided to model it through a clustering algorithm with dimensionality reduction.
A survey is composed of several questions that may contain data in the form of numbers or text. Normally, there are so many questions that they can not be all visually studied with a graph at once.
This is a problem for two reasons. On the one hand, that they are so numerous makes it difficult to study the relationships in the data. On the other hand, using algorithms on spaces of many variables usually requires a lot of time and resources, which is not always possible. This is the reason why methods have been developed to reduce the set of variables to a representation of the original set.
Principal Component Analysis
Principal Component Analysis (PCA) is a technique used to describe a set of data in terms of new variables that are orthogonal, uncorrelated and fewer than the original. It is used to reduce the dimensionality of the dataset. This algorithm calculates the best projection of the data into a lower dimension space.
A dataset of observations and variables is represented with less variables calculated as linear combinations of the original ones. This method enables optimal representation, in a lower dimension space, of a bigger p-dimensional space. PCA is the first step to identify possible latent variables that creates the variability of the data. It allows a transformation of the variables, generally correlated, to a new set of variables that are uncorrelated, easing the interpretation of the data.
Cluster analysis is a technique with the objective of grouping data in homogeneous groups using the similarity between them. It is about looking for existing patterns in the data without prior knowledge of its structure. This is why it is called unsupervised classification. The objective is that the individuals in each group are as similar as possible to each other and that, in turn, the groups are as different as possible. It is necessary to define a criterion to measure this similarity, i.e., distance or similarity.
Partitive clustering
This type of clustering is used to group individuals, not variables. It is common to perform several analyses with different numbers of groups to identify the most appropriate number. It is very intuitive to assume that a correct classification should be one in which the dispersion within each group formed is the smallest possible. This condition is called variance criterion, and leads to selecting a configuration when the sum of the variances within each group, i.e. residual variance, is minimum.
Validation of clustering
Before applying clustering to the data, it is advisable to assess whether there are indications that there really is some kind of grouping in the data. This process is known as assessing clustering tendency and can be carried out by means of statistical tests or visually.
Silhouette
The silhouette coefficient quantifies how well an observation has been assigned by comparing its similarity to the rest of the observations in the same cluster versus those in the other clusters.
This coefficient is obtained by calculating the mean of the distance between the observation and the rest of observations in the same cluster; the lower this mean is, the more similarity. Then the distance between the observation and the other clusters is computed to identify the lowest distance of them, this is, the neighbouring cluster.
The silhouette coefficient is between -1 and 1. Values near to 1 indicates that the observation has been assigned to the correct cluster. When its value is close to zero it means that the observation is located in an intermediate point between two clusters. Negative values point to a possible incorrect assignment of the observation.
Density-based Spatial Clustering of Applications with Noise
Density-based spatial clustering of applications with noise is a clustering algorithm that mimics the way human beings identify clusters. It uses a topological approach to cluster data. It has two main parameters related to the radius of the ball center at a given point, known as the neighborhood of the point, and the minimum number of neighbors that must be found within the ball to group them in the same cluster.
This algorithm tries to avoid the problem of classical methods by following the idea that, for an observation to be part of a cluster, there must be a minimum of neighboring observations within a proximity radius and that clusters are separated by empty regions or regions with few observations.
We have developed models based on this techniques in order to gain insight of our project. The modelisation has helped us to know the better way of developing the final solutions of our product.
Huge part of our models were based on unsupervised techniques as we were searching new information from the surveys without any response variable. The conclusions that the models have shown up are written in the part of results