Imagine teaching a person a new magic trick. It's a simple enough formula, you show them how it's done, let them have a go and correct them when they make a mistake. Practice makes perfect. Now imagine, that instead of being able to show them the trick, you could only show them low quality photos of it being performed. These photos have no description on how the trick is done and barely show enough features to see the trick in action.
This is the problem that is plaguing data science. We all know that data science is all the rage - the hottest job on the block. People train in complex, mathematical disciples to apply cutting edge algorithms to glean new insights from data. The problem is, before the learning can take place, the low quality input has to be improved. This is of immense frustration to business stakeholder, data scientists and engineers who spend much of their expert time performing monotonous data cleansing work.
The new open-source project Salute attempts to address some of this challenge and give data scientists some of their time back to focus on what's really important.
Salute's goal is to take any type of file (video, audio, image or text) and to recognise, analyse and transform it so that it is ready for consumption by Machine Learning routines. This involves recognising delimiters, data types, data distributions, categorical variables and much much more. The Salute framework is built in a way that allows new derivations from the data to be added easily so that experts can contribute in their own area of expertise.
To achieve all this on even the largest files, Salute is based on Apache Spark and so can integrate with existing clusters and use their processing power. Using Java means that contributing is straight-forward and is open to many coders.
Salute is a new project, much is immature. If you feel you can contribute - please do here .