Data Processing in Python

Generally speaking, data processing consists of gathering and manipulating data elements to return useful, potentially valuable information. Different encoding types will have various processing formats. The most known formats for encodings are XML, CSV, JSON, and HTML.

With Python, you can manage some encoding processes, and it’s better suited for data processing than other languages due to its simple syntax, scalability, and cleanliness that allows solving different complex problems in multiple ways. All you’re going to need are some libraries or modules to make those encoding methods work, for example, Pandas.

Why is Data processing essential?

Data processing is a vital part of data science. Having inaccurate and bad-quality data can be damaging to processes and analysis. Good clean data will boost productivity and provide great quality information for your decision-making.

What is Pandas?

When we talk about Pandas, most people assimilate the name with the black and white bear from Asia. But in the tech world, it’s a recognized open-source Python library, developed as an extension of NumPy. Its function is to work with Data Analysis, Processing, and Manipulation, offering data structures and operations to manage number tables and time series.

With this said, we agree that Pandas is a powerful essential programming tool for those interested in the Machine Learning field.

Processing CSV Data

Most Data Scientists rely on CSV files (which stand for “Comma Separated Values”) in their day-to-day work. It’s because of the simplicity of the storage in a tabular form as plain text, making it easier to read and comprehend.

CSV files are easy to create. We can use Notepad or another text editor to make a file, for example:

Then, save the file using the .csv extension (example.csv). And select the save as All Files (*.*) option. Now you have a CSV data file.

In the Python environment, you will use the Pandas library to work with this file. The most basic function is reading the CSV data.

Processing Data using Pandas

We will use a simple dataset for this tutorial i.e. Highest grossing movies dataset. You can download this and other datasets from “kaggle.com.

To start working with pandas we will import the library into our jupyter notebook which you can find here to follow along with this tutorial.

Pandas is one of the more notable libraries essential to the data science workflow as it provides you with the means to process and wrangle the data. This is vital as many consider the data pre-processing stage to occupy as much as 80% of a data scientist’s time.

Import dataset

The next step is to import the dataset for this we will use the read_csv() which is a function of pandas. Since the dataset is in a tabular format, pandas will convert it to a dataframe called data. A DataFrame is a two-dimensional, mutable data structure in Python. It is a combination of rows and columns like an excel sheet.

This dataset contains data on the highest-grossing movies of each year. When working with datasets it is important to consider: where did the data come from? Some will be machine-generated data. Some of them will be data that’s been collected via surveys. Some could be data that are recorded from human observations. Some may be data that’s been scraped from websites or pulled via APIs. Don’t jump right into the analysis; take the time to first understand the data you are working with.

Exploring the data

The head() function is a built-in function in pandas for the dataframe used to display the rows of the dataset by default; it displays the first five rows of the dataset. We can specify the number of rows by giving the number within the parenthesis.

Here we also get to see what data is in the dataset we are working with. As we can see there are not a lot of columns which makes the data easier to work with and explore.

We can also see how the last five rows look using the tail() function.

The function memory_usage() returns a pandas series having the memory usage(in bytes) in a pandas dataframe. The importance of knowing the memory usage of a dataframe helps when tackling errors like MemoryError in Python.

In datasets, the information is presented in tabular form so data is organized in rows and columns. Each column has a name, a data type, and other properties knowing how to manipulate the data in the columns is quite useful. We can continue and check the columns we have.

Keep in mind, because this is a simple dataset there are not a lot of columns.

loc[:] can be used to access specific rows and columns as per what you require. If for instance, you want the first 2 columns and the last 3 rows you can access them with loc[:]. One can use the labels or row and column numbers with the loc[:] function.

The above code will return the “YEAR”, “MOVIE”, and “TOTAL IN 2019 DOLLARS” columns for the first 5 movies. Keep in mind that the index starts from 0 in Python and that loc[:] is inclusive of both values mentioned. So 0:4 will mean indices 0 to 4, both included.

sort_values() is used to sort values in a column in ascending or descending order.

The ‘inplace’ attribute here is False but by specifying it to be True you can make a change in the original dataframe.

You can look at basic statistics from your data using the simple data frame function i.e. describe(), this helps to better understand your data.

value_counts() returns a Pandas Series containing the counts of unique values. value_counts() helps in identifying the number of occurrences of each unique value in a Series. It can be applied to columns containing data.

value_counts() can also be used to plot bar graphs of categorical and ordinal data syntax below.

Finding and Rebuilding Missing Data

Pandas has functions for finding null values if any are in your data. There are four ways to find missing values and we will look at all of them.

isnull() function: This function provides the boolean value for the complete dataset to know if any null value is present or not.

isna() function: This is the same as the isnull() function

isna().any() function: This function also gives a boolean value if any null value is present or not, but it gives results column-wise, not in tabular format.

isna().sum() function: This function gives the sum of the null values preset in the dataset column-wise.

isna().any().sum() function: This function gives output in a single value if any null is present or not. In this case there is no null value.

When there is a null value present in the dataset the fillna() function will fill the missing values with NA/NaN or 0. Below is the syntax.

De-Duplicate

This is removing all duplicate values. When analyzing data, duplicate values affect the accuracy and efficiency of the results. To find duplicate values the function duplicated() is used as seen below.

While this dataset does not contain any duplicate values if a dataset contains duplicate values it can be removed using the drop_duplicates() function.

Below is the syntax of this function:

We have seen here, we can already conduct fairly interesting data analysis with Pandas that provides various useful functionalities that are fairly straightforward and easy to use. Different approaches can be used for many different kinds of datasets to find patterns and trends to apply more advanced machine learning techniques in the future.

Follow me here for more AI, Machine Learning, and Data Science tutorials to come!

References