There is a lot of data on this planet. In one single day, a total of 2.5 quintillion bytes of data is produced. Now, with this, think about how much data will be released in a day, a week, a year, etc. This everyday new generated data contains good as well as bad data. We have a data analyst who takes care of this huge amount of data.
Data analysts are the people who work on collecting the data, further cleaning it, and then interpreting all the data sets. They work in a variety of fields can be any kind of business; the finance sector; law enforcement; the science field; medicine; and government.
So, today we are going to learn about how to clean data for data analytics using python programming languages.
What is data cleaning?
As earlier said, in trillions of data points, there is always both good and bad data. So to remove bad data, we need to clean the data. In layman’s language, cleaning refers to removing unwanted things, being free from toxic things, etc. Bad data refers to all kinds of incorrect, corrupted, duplicate, or incomplete data, so as in the language of data analytics, data cleaning refers to a process of correcting all kinds of bad, unorganized data. Bad data always leads to faulty and incorrect outcomes, even if they look nice.
Why is data cleaning important?
Data cleaning is important because it helps in flushing all the clutter from the data because bad data causes toxicity in your growing business. Understanding and creating good data can give insights about your company, and changes in the growth of your company will increase. Bad data brings a lot of failed migration projects, completely affects the efficiency of the organization, gives out costly expenses, etc., so cleaning up of data is mandatory.
Data Cleaning Process in Python
Importing and studying the data
I am going to dodge each and every code of Python and will help you clean data in Python using all the libraries, datasets, plotting of graphs, etc.
I am going to use Google Colaboratory for all my code compilation.
- Importing Libraries
First, we are going to import the required Python libraries. The libraries that we are using are pandas, mathplotlib, NumPy, seaborn, missingno, and DateTime. For easy execution of code, we are using declaring abbreviations of each and every library.
Pandas: It is used as a data analysis and manipulation tool. It is free and open-source.
Mathplotlib: It is used to represent data in the form of graphs and figures.
NumPy: It is used for scientific computation and enables all kinds of mathematical functions, linear algebra, algebraic routines, statistics, etc.
Seaborn: It is used for making statistical-based graphs.
Missingno: This library helps with visualization. It helps in studying all the data in the presence and distribution of all the missing data in the pandas dataframe.
DateTime: It helps in giving out different classes for multiple dates and times.
- Reading the dataset
We are going to now read a dataset whose location is on GitHub, so we will add up the link to GitHub over here. We are using an a.csv file.
- Print the header of DataFrame
The data we have imported from Github. Now we are going to analyze everything about it. First, we print the header of the dataframe. A dataframe in Python is referred to as a well-labeled data structure with rows and columns.
From this data, we can analyze all kinds of incorrect, corrupted, duplicate, or incomplete data, basically extracting bad data.
- So, here in this data, there are three data type problems.
SNo. |
Problem Description |
1. |
Column number 6: “coordinates” is likely to be a string (str). As almost most of the mapping functions require latitude and longitude inputs, we are going to break this column into two and convert the values into float. |
2. |
The 8-column “Price” is probably a string with $(dollar symbol) attached to each price point. So rather than creating a problem, we are going to convert this into float. |
3. |
The 10 columns “last_review” and the 18 columns “listing_added” should be in the date-time correct format for easy insight into the data. |
- We can see there are missing data problems in some of the columns. So, we are going to solve this query too.
- There are some textual and category-based problems over here.
SNo. |
Problem Description |
1. |
We need to look after a number of listings that are by boroughs, for this, we are going to separate neighborhood names from borough names and list them in the neighbourhood_full column. |
2. |
To make our data more detailed and organized, we will replace the column “room_type” with “Shared Room”, “Private Home/Apartment”, “Private Room” and “Hotel Room”. |
So we have extracted a total of six problems from this dataframe, and now we are going to clean it.
- Printing the data types of DataFrame
We are printing all the data types so that we can get to know the datatype of “coordinates”, “price”, “last_review” and “listing_added” columns and solve our data type problems as mentioned above.
- Printing the data info of DataFrame
From the information in the data, we have now extracted all the missing data. Total missing data are from price, last_review, reviews_per_month, rating, number_of_stays, and 5_stars. We are going to use the missingno library for this. Hence, our missing data problems will be solved with this.
- Printing the missing values from the data
To deal with missing data, we have many ways. We can either drop all the missing data, or we can use the statistical method for extracting missing data. We can also use complex algorithmic or machine learning-based methods.
- Printing a detailed description of the data
From this, we will get a detailed insight into each and every piece of data. We have noticed the rating column has the maximum outcomes, so we are going to fix the range of this particular column.
- Having a look at the distribution of the rating column
This will help us to have a look at how the distribution is being done in the rating column. Here, by plotting with the help of mathplotlib, we will get better insights.
- Finding the number of unique values in the room_type column
- A detailed description of the different room types
We will find out the total number of values of different room_types we have.
Data Cleaning for Data Type Problems
Problem 1: In this, we have to replace the “coordinates” column with latitude and longitude.
Step 1: We are going to have a look at the data using the header function.
Step 2: We are going to remove “(” and “)” from the coordinates and print the header of the column.
Step 3: Splitting the column into two parts.
Step 4: Now we will precisely look at the data and then assign the correct columns to the latitude and longitude columns in Airbnb. After assigning the correct columns, we are going to print the header and confirm the new column creation.
Step 5: Print the datatypes again.
Step 6: Now we will convert latitude and longitude into float and then print datatypes again.
Step 7: At the end, drop the coordinates column.
Problem 2: Here, we have to remove $(dollar) from the price and then convert it to float.
Step 1: Remove $ from the price before the conversion to float and then print the head. Step 2: Now we will convert the price into float and then calculate the mean of the price after conversion.
Step 3: Now, with the help of mathplotlib and seaborn, we will visualize the distribution of prices.
Problem 3: Now, we have to convert the listing_added and last_review columns to DateTime.
Step 1: Print the header of the listing_added and last_review columns
Step 2: Now we will convert both columns to DateTime
Step 3: Now, with the help of the DateTime library, print the header and datatypes of both columns again.
Data Cleaning for Textual and Category-based Problems
Problem 4: We need to bifurcate the column “room_type” into correct categories.
Step 1: Print out the unique values of room_type.
Step 2: Analyze all the capitalized values
Step 3: Analyze all the trailing spaces
Step 4: Replace values with ‘Shared room’, ‘Entire place’, ‘Private room’, and ‘Hotel room’ (if applicable), then collapse the data.
Problem 5: We will now divide neighbourhood_full into 2 columns.
Step 1: Print the header of the neighbourhood_full column.
Step 2: We are going to split the column.
Step 3: We will now create borough and neighbourhood columns and then print the header of the column.
Step 4: Drop neighbourhood_full column.
Step 5: Print out the unique values of the borough and neighbourhood.
Step 6: Remove all the white space from the “neighborhood” column and then print the unique data again.
Problem 6: Make sure we set the correct maximum for the rating column outside of its range of values.
Step 1: Re-visualize the rating column again.
Step 2: Isolate rows of rating > 5.0
Step 3: Drop these rows and make sure we have effected changes
Step 4: Visualize the rating column again
Step 5: Get the maximum
Data Cleaning for missing data
The missingno (imported as msno) package is great for visualizing missing data; we will be using:
- msno.matrix () visualizes a missingness matrix.
- msno.bar() visualizes a missingness barplot.
- plt.show () to show the plot
Step 1: Visualize the missingness
From the data we can analyze, the missing values are almost identical between last_review, reviews_per_month, rating, number_of_stays, and 5_stars.
Step 2: Visualize the missingness on sorted values
Step 3: Bar graph of missingess data.
Step 4: Understanding the DataFrame with missing values in rating, number_of_stays, 5_stars, reviews_per_month.
Step 5: Importing all the missing data.
Step 6: Investigate DataFrame with missing values in price
Step 7: Investigate DataFrame with missing values in price (This will give you a different value as it is filtering itself.)
Step 8: Visualize the relationship between price and room_type
Step 9: Get median price per room_type
Step 10: Impute price based on conditions
Step 11: Confirm the price has been imputed
Hence these all are the process used for cleaning the data in python with google collaboratory.