Dario Greco

One of the main goals of the NANOSOLUTIONS consortium is to create an ENM classifier – a model that can predict the effect of any given nanomaterial. This is being done through what is known as machine learning; we are essentially training a computer programme to know what is dangerous and what is not. The more data we input into the computer, the better the programme becomes at recognising overlying patterns, after which it can extrapolate these patterns to create general rules that will allow it to predict how unknown nanomaterials will behave.

A number of the work packages in the NANOSOLUTIONS consortium are dedicated to gathering data on the biological effects of around 40 nanomaterials when they come into contact with living cells and organisms. This data is then collated in work package 11 and fed into a computer, which will then use this information so that when it is shown a new, unknown material, it will be capable of making an accurate prediction as to whether it is safe or not.

The real challenge lies in identifying the features of the animals, cells and nanomaterials that are useful in predicting the behaviour of the nanomaterials. There are hundreds of thousands, even millions, of chemical and physical features in living organisms that could potentially be useful to us, but we also know that the vast majority of them will not be. Our challenge is to mine these useful pieces of information out of a huge mass of data. In technical terms, this is a feature selection problem; we need to select the smallest number of features that can give us the most powerful information for accurate classification.

The problem lies in the fact that with such a huge amount of data, it is currently impossible to explore all the possible combinations. Using a computer that could evaluate each combination in one second with just 10,000 features, it would take 1079 years to go through them all. No computer powerful enough exists yet to carry out the task in a feasible time, and so a more holistic approach is needed to explore the vast solution space.

A genetic algorithm is a special type of solution space evaluation algorithm that mimics natural selection, the key mechanism behind Charles Darwin’s theory of evolution. In terms of NANOSOLUTIONS, this involves grouping together features into, for example, 1000 random groups, each of which represents a solution. The predictive power of each of these solutions is then calculated, after which the least effective solutions are disposed of. The more effective solutions are then “mated” with each other in order to produce a new population of daughter solutions (much like the genetic rearrangement that occurs in natural populations), but this time drawing from a narrower and more powerful selection of features. This process is then repeated a number of times until only the most relevant features remain.

We know that we won’t be able to explore all solutions using our genetic algorithm, but we also know that this is probably the most effective way of finding the closest we can get to an ideal solution. Additionally to mating the best solutions, at each generation there will be other events such as substitutions, deletions, or insertions of new features, parallel evolving populations of smaller individual solutions mimicking viral infections, as well as local search operators that will work on optimising the best solution at each generation. Mimicking the evolution of natural populations will allow for the evaluation of many solutions, helping to explore the data as much as possible.

The algorithm will only need a couple of weeks computing time using a high-end desktop computer for an accurate predicting tool to be produced. However, the way the algorithm is designed will allow it to continue to evolve and become more accurate as new data becomes available. From a computational point of view, we would like the ENM Safety Classifier to also be a software package that can be used by future projects and data generators. They will be able to input new data so that the Classifier can continue to evolve. That is the beauty of using a dynamic system – it can always carry on learning when presented with new data.