Loading lesson path
Python
NumPy, Pandas, SciPy, plotting, and machine learning topics grouped into one practical track.
NumPy is a Python library.
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
Machine Learning is making the computer learn from studying data and statistics.
Pandas is a Python library.
If you have Python and PIP already installed on a system, then installation of Matplotlib is very easy.
What can we learn from looking at a group of numbers?
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the plt alias:
Standard deviation is a number that describes how spread out the values are.
The plot() function is used to draw points (markers) in a diagram.
Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
You can use the keyword argument marker to emphasize each point with a specified marker:
In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at least at an early stage of a project.
You can use the keyword argument linestyle , or shorter ls , to change the style of the plotted line:
In probability theory this kind of data distribution is known as the normal data distribution , or the Gaussian data distribution , after the mathematician Carl Friedrich Gauss who came up with the f…
With Pyplot, you can use the xlabel() and ylabel() functions to set a label for the x- and y-axis.
A scatter plot is a diagram where each value in the data set is represented by a dot.
With Pyplot, you can use the grid() function to add grid lines to the plot.
The term regression is used when you try to find the relationship between variables.
With the subplot() function you can draw multiple plots in one figure:
If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression.
With Pyplot, you can use the scatter() function to draw a scatter plot.
Multiple regression is like linear regression , but with more than one independent value, meaning that we try to predict a value based on two or more variables.
With Pyplot, you can use the bar() function to draw bar graphs:
When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?
A histogram is a graph showing frequency distributions.
In Machine Learning we create models to predict the outcome of certain events, like in the previous chapter where we predicted the CO2 emission of a car when we knew the weight and engine size.
With Pyplot, you can use the pie() function to draw pie charts:
In the example, a person will try to decide if he/she should go to a comedy show or not.
It is a table that is used in classification problems to assess where errors in the model were made.
Hierarchical clustering is an unsupervised learning method for clustering data points. The algorithm builds clusters by measuring the dissimilarities between data. Unsupervised learning means that a…
Logistic regression aims to solve classification problems. It does this by predicting categorical outcomes, unlike linear regression that predicts a continuous outcome.
The majority of machine learning models contain parameters that can be adjusted to vary how the model learns. For example, the logistic regression model, from sklearn , has a parameter C that control…
When your data has categories represented by strings, it will be difficult to use them to train machine learning models which often only accepts numeric data.
K-means is an unsupervised learning method for clustering data points. The algorithm iteratively divides data points into K clusters by minimizing the variance in each cluster.
Methods such as Decision Trees, can be prone to overfitting on the training set which can lead to wrong predictions on new data.
When adjusting models we are aiming to increase overall model performance on unseen data. Hyperparameter tuning can lead to much better performance on test sets. However, optimizing parameters to the…
In classification, there are many different evaluation metrics. The most popular is accuracy , which measures how often the model is correct. This is a great metric because it is easy to understand a…
KNN is a simple, supervised machine learning (ML) algorithm that can be used for classification or regression tasks - and is also frequently used in missing value imputation. It is based on the idea…