Python continues to lead the way in the field of data science with its ever-growing list of libraries and frameworks. Also, In this data-centric world, where consumers demand relevant information in their buying journey, companies also require data scientists to avail valuable insights by processing massive data sets. Hence you should not wonder if I tell you that Python is used by more than 55% of companies along with some Top MNCs, for their data related works. Python libraries simplify complex jobs and make data integration much easier with fewer codes and in lesser time.
In my previous article, I wrote about Top 10 Programming Languages to Learn in 2019. In this one, I’ll focus on the libraries and packages that are not coming with Python 3 by default. At the end of the article, I’ll also show you how to get (download, install and import) them.
Python libraries and packages for Data Science
These are the five most popular data science framework:
- Pandas
- Matplotlib
- Numpy
- Scikit learn (
sklearn ) - Theano
Let’s review them one by one.
Pandas
Pandas is
Originally, Python didn’t have this feature. Weird, isn’t it? But that’s why Pandas is so important! Some people also say that, Pandas is the “SQL of Python.”
With DataFrame you can store and manage data from tables by performing manipulation over rows and columns. Methods like square bracket notations reduce person’s effort in data analysis tasks like square bracket notations. Here, you will get tools for accessing data in-memory data structures performing read and write tasks even if they are in multiple formats such as CSV, SQL, HDFS or excel etc.
With pandas, you can load your data into data frames, you can select columns, filter for specific values, group by values, run functions (sum, mean, median, min, max, etc.), merge data frames and so on. You can also create multi-dimensional data-tables.
That’s a common misunderstanding, so let me clarify: Pandas is not predictive analytics or machine learning library. It was created for data analysis, data cleaning, data handling and data discovery. By the way, these are the necessary steps before you run machine learning projects, and that’s why you will need pandas for every scientific project, too.
Matplotlib
I hope I don’t have to detail why data visualization is important. Data visualization helps you to better understand your data, discover things that you wouldn’t discover in raw format and communicate your findings more efficiently to others.
Matplotlib is a Python 2D plotting library, capable of producing publication quality figures in a wide variety of hardcopy formats and interactive environments across platforms. It can be used in Python scripts, the Python and IPython shell, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Matplotlib is
import matplotlib.pyplot as plt
y = [x**3 for x in range(11)]
plt.plot(y)
plt.show()
<< Output
This 2D plotting library of Python is very famous among data scientists for designing varieties of figures in multiple formats which is compatible across their respected platforms. One can easily use it in their Python code, IPython shells or Jupyter notebook, application servers. With Matplotlib, you can make histograms, plots, bar charts, scatter plots etc.
Numpy
Numpy will help you to manage multi-dimensional arrays very efficiently. Maybe you won’t do that directly, but since the concept is a crucial part of data science, many other libraries (well, almost all of them) are built on Numpy. Simply put: without Numpy you won’t be able to use Pandas, Matplotlib, Scipy or Scikit-Learn. That’s why you need it on the first hand.
>>>import numpy as np
>>>i = np.arange(16).reshape(4,2,2)
>>>i
Out[11]:
array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[ 8, 9],
[10, 11]],
[[12, 13],
[14, 15]]])
NumPy is the first choice among developers and data scientists who are aware of the technologies which are dealing with data-oriented stuff. It is a Python package available for performing scientific computations. It is registered under the BSD license.
Through NumPy, you can leverage n-dimensional array objects, C, C++, Fortran program based integration tools, functions for performing complex mathematical operations like Fourier transformation, linear algebra, random number etc.
One can also use NumPy as a multi-dimensional container to treat generic data. Thus, you can effectively integrate your database by choosing varieties of operations to perform with.
Scikit learn
Machine learning and Data Analysis are the fanciest things about Python and the Scikit-Learn library immensely upgrade our arsenal, with different algorithms and functions. Scikit-Learn library has several methods, basically covering everything you might need in the first few years of your data science career: regression methods, classification methods, and clustering, as well as model validation and model selection. You can also use it for dimensionality reduction and feature extraction.
Let’s see a basic example of sklearn by implementing Simple Linear Regression on a data set.
# Simple Linear Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)"""
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
Now, you can easily identify that sklearn library developed over the Numpy, Scipy, and Matplotlib. It is being used for classification, regression and clustering o manage spam, image recognition, drug response, stock pricing, customer segmentation etc.
Theano
Do you have multiple GB’s of data and you want them to be processed quickly? Then Theano should be your first choice. With it’s GPU based infrastructure and it holds the capability to process operations in faster ways than CPU. It stands fit for speed and stability optimizations delivering us the expected outcomes.
You can use Theano for distributed and parallel computing based tasks. Through it, you can optimize, express or evaluate you array-enabled mathematical operations. It is tightly coupled with NumPy powered by implemented numpy.ndarray function.
Although beware that Theano has a steep learning curve for most Python users as the framework for declaring variables and building functions differ greatly from the basic premises of Python.
Though I have tried to cover all big popular libraries, it still may not cover some other great and useful libraries that deserve to be looked at. So, share your favourites in the comment section below, as well as any ideas about the packages that we mentioned.
In the next article I will write about a basic tutorial about Theano. Subscribe to our newsletter to stay updated to the latest content.
Thank you for your attention!