Process of Data Cleaning in Python

There is a lot of data on this planet. In one single day, a total of 2.5 quintillion bytes of data is produced. Now, with this, think about how much data will be released in a day, a week, a year, etc. This everyday new generated data contains good as well as bad data. We have a data analyst who takes care of this huge amount of data.  

Data analysts are the people who work on collecting the data, further cleaning it, and then interpreting all the data sets. They work in a variety of fields can be any kind of business; the finance sector; law enforcement; the science field; medicine; and government.  

So, today we are going to learn about how to clean data for data analytics using python programming languages. 

What is data cleaning?

As earlier said, in trillions of data points, there is always both good and bad data. So to remove bad data, we need to clean the data. In layman’s language, cleaning refers to removing unwanted things, being free from toxic things, etc. Bad data refers to all kinds of incorrect, corrupted, duplicate, or incomplete data, so as in the language of data analytics, data cleaning refers to a process of correcting all kinds of bad, unorganized data. Bad data always leads to faulty and incorrect outcomes, even if they look nice.  

Why is data cleaning important?

Data cleaning is important because it helps in flushing all the clutter from the data because bad data causes toxicity in your growing business. Understanding and creating good data can give insights about your company, and changes in the growth of your company will increase. Bad data brings a lot of failed migration projects, completely affects the efficiency of the organization, gives out costly expenses, etc., so cleaning up of data is mandatory.

Data Cleaning Process in Python

Importing and studying the data

I am going to dodge each and every code of Python and will help you clean data in Python using all the libraries, datasets, plotting of graphs, etc. 

I am going to use Google Colaboratory for all my code compilation.

  • Importing Libraries 

First, we are going to import the required Python libraries. The libraries that we are using are pandas, mathplotlib, NumPy, seaborn, missingno, and DateTime. For easy execution of code, we are using declaring abbreviations of each and every library.

Pandas: It is used as a data analysis and manipulation tool. It is free and open-source.

Mathplotlib: It is used to represent data in the form of graphs and figures.

NumPy: It is used for scientific computation and enables all kinds of mathematical functions, linear algebra, algebraic routines, statistics, etc. 

Seaborn: It is used for making statistical-based graphs.

Missingno: This library helps with visualization. It helps in studying all the data in the presence and distribution of all the missing data in the pandas dataframe. 

DateTime: It helps in giving out different classes for multiple dates and times.Importing Libraries 

  • Reading the dataset

We are going to now read a dataset whose location is on GitHub, so we will add up the link to GitHub over here. We are using an a.csv file.

Reading the dataset

  • Print the header of DataFrame

The data we have imported from Github. Now we are going to analyze everything about it. First, we print the header of the dataframe. A dataframe in Python is referred to as a well-labeled data structure with rows and columns. Print the header of DataFrame

From this data, we can analyze all kinds of incorrect, corrupted, duplicate, or incomplete data, basically extracting bad data. 

  1. So, here in this data, there are three data type problems. 

SNo.

Problem Description

1.

Column number 6: “coordinates” is likely to be a string (str). As almost most of the mapping functions require latitude and longitude inputs, we are going to break this column into two and convert the values into float.

2.

The 8-column “Price” is probably a string with $(dollar symbol) attached to each price point. So rather than creating a problem, we are going to convert this into float.

3.

The 10 columns “last_review” and the 18 columns “listing_added” should be in the date-time correct format for easy insight into the data.

  1. We can see there are missing data problems in some of the columns. So, we are going to solve this query too.
  2. There are some textual and category-based problems over here.

SNo.

Problem Description

1.

We need to look after a number of listings that are by boroughs, for this, we are going to separate neighborhood names from borough names and list them in the neighbourhood_full column.

2.

To make our data more detailed and organized, we will replace the column “room_type” with “Shared Room”, “Private Home/Apartment”, “Private Room” and “Hotel Room”.

So we have extracted a total of six problems from this dataframe, and now we are going to clean it.

  • Printing the data types of DataFrame

We are printing all the data types so that we can get to know the datatype of “coordinates”, “price”, “last_review” and “listing_added” columns and solve our data type problems as mentioned above.Printing the data types of DataFrame

  • Printing the data info of DataFrame

From the information in the data, we have now extracted all the missing data. Total missing data are from price, last_review, reviews_per_month, rating, number_of_stays, and 5_stars. We are going to use the missingno library for this. Hence, our missing data problems will be solved with this.Printing the data info of DataFrame

  • Printing the missing values from the data

To deal with missing data, we have many ways. We can either drop all the missing data, or we can use the statistical method for extracting missing data. We can also use complex algorithmic or machine learning-based methods. Printing the missing values from the data

  • Printing a detailed description of the data

From this, we will get a detailed insight into each and every piece of data. We have noticed the rating column has the maximum outcomes, so we are going to fix the range of this particular column.Printing a detailed description of the data

  • Having a look at the distribution of the rating column

This will help us to have a look at how the distribution is being done in the rating column. Here, by plotting with the help of mathplotlib, we will get better insights. Having a look at the distribution of the rating column

  • Finding the number of unique values in the room_type columnFinding the number of unique values in the room_type column
  • A detailed description of the different room types

We will find out the total number of values of different room_types we have.A detailed description of the different room types

Data Cleaning for Data Type Problems

Problem 1: In this, we have to replace the “coordinates” column with latitude and longitude.

Step 1: We are going to have a look at the data using the header function.Print the header of DataFrame

Step 2: We are going to remove “(” and “)” from the coordinates and print the header of the column.

Step 3: Splitting the column into two parts.Splitting the column into two parts.

Step 4:  Now we will precisely look at the data and then assign the correct columns to the latitude and longitude columns in Airbnb. After assigning the correct columns, we are going to print the header and confirm the new column creation.

Step 5: Print the datatypes again. Print the datatypes again.

Step 6: Now we will convert latitude and longitude into float and then print datatypes again.convert latitude and longitude into float

Step 7: At the end, drop the coordinates column.drop the coordinates column

Problem 2: Here, we have to remove $(dollar) from the price and then convert it to float.

Step 1: Remove $ from the price before the conversion to float and then print the head.Remove $ from the price Step 2: Now we will convert the price into float and then calculate the mean of the price after conversion.convert the price into float

Step 3: Now, with the help of mathplotlib and seaborn, we will visualize the distribution of prices.visualize the distribution of prices

Problem 3: Now, we have to convert the listing_added and last_review columns to DateTime.

Step 1: Print the header of the listing_added and last_review columnsPrint the header of the listing_added and last_review columns

Step 2: Now we will convert both columns to DateTimeconvert both columns to DateTime

Step 3: Now, with the help of the DateTime library, print the header and datatypes of both columns again.DateTime library

Data Cleaning for Textual and Category-based Problems

Problem 4: We need to bifurcate the column “room_type” into correct categories.

Step 1: Print out the unique values of room_type.Print out the unique values of room_type

Step 2: Analyze all the capitalized valuesAnalyze all the capitalized values

Step 3: Analyze all the trailing spacesAnalyze all the trailing spaces

Step 4: Replace values with ‘Shared room’, ‘Entire place’, ‘Private room’, and ‘Hotel room’ (if applicable), then collapse the data. Replace values

Problem 5: We will now divide neighbourhood_full into 2 columns. 

Step 1: Print the header of the neighbourhood_full column.header of the neighbourhood

Step 2: We are going to split the column.split the column

Step 3: We will now create borough and neighbourhood columns and then print the header of the column.create borough and neighbourhood

Step 4: Drop neighbourhood_full column.Drop neighbourhood_full column.

Step 5: Print out the unique values of the borough and neighbourhood.unique values of the borough and neighbourhood.

Step 6: Remove all the white space from the “neighborhood” column and then print the unique data again.white space from the neighborhood

Problem 6: Make sure we set the correct maximum for the rating column outside of its range of values.

Step 1: Re-visualize the rating column again.Re-visualize the rating column again

Step 2: Isolate rows of rating > 5.0Isolate rows of rating

Step 3: Drop these rows and make sure we have effected changesDrop these rows

Step 4: Visualize the rating column againVisualize the rating column again

Step 5: Get the maximumGet the maximum

Data Cleaning for missing data

The missingno (imported as msno) package is great for visualizing missing data; we will be using:

  • msno.matrix () visualizes a missingness matrix.
  • msno.bar() visualizes a missingness barplot.
  • plt.show () to show the plot

Step 1: Visualize the missingness Visualize the missingness

From the data we can analyze, the missing values are almost identical between last_review, reviews_per_month, rating, number_of_stays, and 5_stars.

Step 2: Visualize the missingness on sorted valuesVisualize the missingness on sorted values

Step 3: Bar graph of missingess data.Bar graph of missingess data

Step 4: Understanding the DataFrame with missing values in rating, number_of_stays, 5_stars, reviews_per_month.Understanding the DataFrame

Step 5: Importing all the missing data.Importing all the missing data

Step 6: Investigate DataFrame with missing values in priceInvestigate DataFrame with missing values in price

Step 7: Investigate DataFrame with missing values in price (This will give you a different value as it is filtering itself.): Investigate DataFrame with missing values

Step 8: Visualize the relationship between price and room_typeVisualize the relationship between price

Step 9: Get median price per room_typeGet median price per room_type

Step 10: Impute price based on conditionsImpute price based on conditions

Step 11: Confirm the price has been imputedConfirm the price has been imputed

Hence these all are the process used for cleaning the data in python with google collaboratory. 

Leave a Comment

Your email address will not be published. Required fields are marked *