Data Analytics and Machine Learning
Competences
- Python
- Numpy, Pandas
- Matplotlib, Seaborn
- Scikit learn
- Tensorflow, Keras, Theano
- Dash, Plotly
- MongoDB
- Amazon Web Services, Linux/Ubuntu
Approach
Data Analytics is the process of examining data in order to draw conclusions about the information it contains. The insight provided by this process is in many cases crucial to making informed, important decisions supported by the data available.
Machine Learning is a class of algorithms that enable machines to learn from data and make predictions or decisions based on such data in an autonomous way.
The driving factor in the fast development of this technology is the very large amount of data available. Thanks to the steadily decreasing cost of data storage and the increasing number of connected devices, every day more and more data is collected from a variety of sources, in relation to the most diverse processes: from people’s activities, to machines’ operation, to environmental, economic or scientific phenomena.
This data can contain very valuable information, but its extraction often requires specific know-how and algorithms and, in the case of very large datasets (big data), also special hardware and software tools.
Python is our preferred programming language for Data Analytics and Machine Learning: it is free and open source and it is used successfully in thousands of real-world business applications around the world, including many large and mission critical systems for organizations such as Google, CERN, NASA or Facebook. To date, python is the fourth most popular programming language, behind java, c and c++, and its user base constantly grows.
Freely available are also many libraries for the most diverse applications, including Data Analysis and Machine and Deep Learning.
Numpy and Pandas are the main libraries we use for Data Analysis and Preprocessing of data to be fed to Machine Learning algorithms.
Visualization is done using Matplotlib, Seaborn (especially for Data Analysis) and Plotly (for interactive plots).
Machine Learning classification, regression or clustering algorithms, such as using k-Nearest Neighbor, Support Vector Machines, Decision Trees and k-mean Clustering are implemented using the Sci-kit Learn library.
Deep Learning algorithms are implemented using Tensorflow, an open source library developed by Google which has quickly become one of the most popular tools for deep learning applications.
A web user interface is often a powerful tool for humans to interact with data and algorithms.
Such websites can be quickly implemented using the Python libraries Dash and Plotly and then rapidly and cost-effectively deployed on cloud services such as Amazon Web Services.
Applications
Data Analytics finds global application thanks to the ubiquity of recorded data and the universal benefit of gaining knowledge on the information it contains.
In the form of Exploratory Data Analysis, it is usually also the first step towards designing a Machine Learning algorithm.
Machine Learning is a rather ample category of algorithms and covers therefore a wide range of applications. Some typical examples include:
Recommender Systems: well-known applications are found in online advertising (online shops, video streaming services) where an algorithm selects items that could interest the customer based on past choices.
Anomaly Detection algorithms: credit card or internet fraud detection, intrusion detection in cyber and network security, where a strictly rule-based algorithm cannot provide the generality and robustness required. Also applicable to preventive maintenance and system or structural health monitoring.
Search Engine Result Refining progressively improves search results by analyzing how a user responds to the search results provided to infer which ones are better matches for each query.
Personal Assistants (e.g. Siri, Cortana, Alexa) also rely heavily on Machine Learning to react to user requests.
As a subfield of Machine Learning, Deep Learning has more specific applications, among which:
Feature Extraction and Image or Signal Classification, usually using Convolutional Neural Networks or Autoencoders. Given the universality of images, this is one of the applications with the highest and widest potential, from face recognition to precision farming and crop health monitoring, from handwriting and speech recognition to autonomous vehicles navigation and obstacle avoidance.
Closely related is Automated Medical Diagnosis, where an algorithm uses medical imaging as well as other clinical information about the patient to automatically formulate a diagnosis.
Robotic systems can learn how to perform complex tasks in the most efficient way (e.g. in the fields of robotic locomotion and dexterity) using Reinforcement Learning.
Resource Management Optimization using Deep Learning algorithms is applicable to a variety of large and complex systems, from computing clusters to a fleet of vehicles or the supply chain of a manufacturing company.
Deep Learning can also be used to tackle problems where the lack of knowledge about the system itself makes a conventional approach not feasible (black box systems).
Deep Learning is also extremely useful in fields where the concepts analyzed are not easily mapped onto variables. Sentiment analysis or Opinion Mining, Machine Perception, Natural Language Processing and Natural Language Understanding are all part of this class of applications.