Data Cleaning Using Python: 5 Principle Points to Prepare Data (Telco Dataset)

Azizha Zeinita
9 min readJan 11, 2021

--

Photo by Priscilla Du Preez on Unsplash

Cleaning the dataset is an essential task to do before using it for further proceedings, such as applying machine learning. This work is part of the data preparation after collecting some data needed. Optimizing the time consuming for data cleaning, which has to be needed most of our time during the project, we can make a list of several activities to clean the data. Before we go any further, the source codes and dataset can be downloaded from this Github.

Now let’s begin! First, I would like to load the dataset. Since I use Google Colab to explore the data, the code will be like this below:

import pandas as pdimport iofrom google.colab import filesdata_to_load = files.upload()df = pd.read_csv(io.BytesIO(data_to_load['data.csv']))

But if you use Jupyter notebook, you can type this code to load the dataset from your local computer:

import pandas as pd
df = pd.read_csv("data.csv")

After running the code above, we can run some other codes such as df.info() or df.describe(), to do data wrangling to understand more about the data.

In this writing, I would like to check and clean the data I have using these principle points:

Point 1: Follow The Data Description

Make sure that we already have a clear description of the data we want to proceed with. For instance, when the data has the attribute “ID”, we have to know the ID data’s constraints. Whether it should be a combination of number and alphabet, or it has to be a maximum of 10 characters or some other criteria. This way, we can be easily clean the data with a clear direction of what it should be.

For my case, following the description of the data I have, the ID value must be 11–12 characters in length and should be number only. This data also should be solely in the form of numbers and began with “45” for the first two digits.

First, we need to convert any initial data type into a string data type to filter it.

#Step 1: Filtering customerID with character length 11-12#Part a: convert data type into stringdf["customerID"] = df["customerID"].astype(str)df.info()
Image 1. The information about data after converting the data type

After converting the data type, the next we should do is filtering character that meets the requirement of 11–12 character length.

#Part b: filtering character length of the column customerID which has 11-12 characters lengthlong = df[(df["customerID"].str.len() == 11) | (df["customerID"].str.len() == 12)]long.info()
Image 2. The information about data after filtering the 11–12 character length.

If we compare the information shown in Image 2 with Image 1, we can quickly point out the different numbers in some columns. Also, check the row number of other columns, if those have the same or lower row as customerID has, it has filtered well.

The next step will be finding the customerID value in number format only, we should dismiss any other format like alphabetic.

#Step 2: Filtering data with value number onlynumber=long[long["customerID"].str.isnumeric()]
number.info()

Then, we have to do the same thing as we do in step 1 after running the code, which compares the information result of this code with the previous code. In this case, we need to compare the result in Step 2 with Step 1. Again, We have to make sure that the row number of other columns has the same or lower row as customerID. We need to continue doing this step after running the code, but we can also use other methods to check the result.

Next, we will filter “customerID” columns which have two first number “45” using the code below:

#Step 3: Filtering "customerID" with the two first number is 45first45 = number[number["customerID"].str[:2] == "45"]
first45.info()

Point 2: Filtering Any Duplicate Values

The duplication may happen because of double inserting the data or inserting the data within a different period. The duplicate data has a significant role in changing the data pattern away from reality, which can mislead any data processing result. The way this method work is to compare any existing data on the specified attributes. Then this method can remove duplicate data so that only one of all duplicate data is saved.

To erase the duplicate data, we can use “drop_duplicates” to drop any row’s duplication. Remember to always check the result by comparing the row number of the current result to the previous result.

#Dropping duplicate ID and sorting it outdf_load = first45.drop_duplicates(subset="customerID")df_load.info()

Point 3: Dealing with Missing Values

There are two kinds of missing values in Python, which are None and NaN. That is why we will never meet the NULL status while using python. None (NoneType) represents an object of missing value in Python code, while NaN (Not a Number) is for numerical type float that uses the representation of standard IEEE floating-point.

We can check missing values using “isnull()” to find the missing values or notnull() for finding values that exist, which the Boolean value of NaN is False. This time we will use “isnull()” to find the NULL value.

#Step 1: Checking how many the missing values aredf_load.isnull().sum()
Image 3. The result that is showing how many missing value in the dataframe per columns

As we can see from the result above, column “tenure”, “MonthlyCharges”, dan “TotalCharges” has some null values. Next, we will fill in the null value with a specific value using fillna(), which may apply to both NaN and None.

For this case, we will fill the missing value with their median because it is safer to use the value. If we fill the NA (Not Available) data or missing value with mean, it can be misleading. Let us take a closer look, if there are 99 staff who have income below 100$ and only 1 staff that has above 1,000$, then the mean can be more significant than the average income of most staff. This issue can be different if we use the median, which will take the mid of the data or the highest half from the lowest half.

While using the fillna() method, we need to add “inplace = True” inside the parentheses to change the value of data frame.

#Step 2: Fixing missing valuedf_load["tenure"].fillna(value=df['tenure'].median(), inplace = True)df_load["MonthlyCharges"].fillna(value=df['MonthlyCharges'].median(), inplace = True)df_load["TotalCharges"].fillna(value=df['TotalCharges'].median(), inplace = True)df_load.isnull().sum()
Image 4. The result that is showing there is no more missing value in the dataframe

The picture above gives us information that there are no null values anymore in the data frame. It means that all columns have been filled with specific data, which is median.

Point 4: Eliminating Categories with Similar Meaning

The data frame I use here has an issue of the different value that has the same meaning. To know whether there is this kind of issue or not, we need to know the correct form of the data (it is mentioned in the beginning) to show the unique value of the column. For example, we want to know the unique value of the column “gender” and fix it if there are similar meaning values.

df_load.groupby('gender')['customerID'].nunique()
Image 5. The result of unique value by gender

As we can see, the result indicates four forms of data that have similar meanings. “Laki-laki” similar to “Male” and “Wanita” has the same meaning as “Female”, laki-laki and perempuan are the Indonesian language.

Therefore, we will transform every single value of “Laki-laki” to be “Male, so does “Wanita” to be “Female” using this code below.

df_load = df_load.replace(['Laki-Laki'],'Male')df_load = df_load.replace(['Wanita'],'Female')df_load.groupby('gender')['customerID'].nunique()

In the last row of code, we can find the same code as before. We need to recall it to ensure that the unique values have been standardized to only “Male” and “Female”. The result will be shown like this:

Image 6. The result of unique value by gender after transforming process

Point 5: Tackling The Outlier or Irregular Data

In this case, we can choose to find the outlier by visualizing data using a boxplot (box and whisker plot), a summary of graphically-presented sample distribution that measures median (Q2), upper (Q1) and lower quartiles (Q3), maximum and minimum data. With a boxplot, we can easily recognize the outliers’ indication by seeing the visualization of data, in which all values exceeding the limits will be considered abnormal values or outliers. We can learn the visual detail of a boxplot’s parts by Image 7 below.

Image 7. Parts of a boxplot (source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)

Now just put them into practice. First, transform the data in all columns into numerical data type, it can be float or int. For example, let us use the column “tenure”. It is essential since boxplot only work with numerical data types.

#Part a: Changing data type to int or floatdf_load["tenure"] = df_load["tenure"].astype(float)df_load["tenure"].dtypes
Image 8. Outlier exists in the column “tenure”

From the result above, there is an outlier shown by a circle in the y-axis of 500. Hence, we need to tackle this one with the appropriate values. Like the visualization of boxplot, we will use The Interquartile Range (IQR) to divide data into quartiles (Q1, Q2, Q3). We can use the minimum and maximum values to replace the outliers by setting the length of the maximum whisker line with less than or equal to Q3 + (1.5 x IQR) and the length of the minimum whisker line with greater than or equal to Q1 — (1.5 x IQR).

According to Image 7, the outlier only happens above the maximum, but I still want to add the under-minimum code if you need to use it for your data. The result is shown in Image 9, the maximum value is 124.0, and the minimum is -64.0

#Part b: Changing value of the outlier using IQRQ1 = (df_load[['tenure']]).quantile(0.25)Q3 = (df_load[['tenure']]).quantile(0.75)##Calculating IQRIQR = Q3 - Q1##Declaring the minimum and maximum valuesmaximum  = Q3 + (1.5*IQR)print('The maximum value of each variables: ')print(maximum)minimum = Q1 - (1.5*IQR)print('\nThe maximum value of each variables: ')print(minimum)
Image 9. The value of the maximum and maximum data

After we have the minimum and maximum values, we have to filter the data that fit the boundary of minimum and maximum values. Then, we replace the outliers in the data frame that are more than the maximum constraint with the maximum value and less than the minimum constraint with the minimum value.

##Implementing constraintsmore_than     = (df_load > maximum)lower_than    = (df_load < minimum)##Replacing the values in the dataframedf_load       = df_load.mask(more_than, maximum, axis=1)df_load       = df_load.mask(lower_than, minimum, axis=1)

Next step, we finally can illustrate the new data frame using boxplot again with this code below. And the final result is shown in Image 9.

#Part c: Starting to visualize the numerical datadf_box = pd.DataFrame(df_load, columns = ['tenure'])df_box[['tenure']].boxplot()
Image 9. The final result of fixing outliers

Those are the five major points to take when we want to clean our dataset before using it for any needs regarding data science. These steps look easy, but it takes accuracy and patience to deal with millions of data. If we want to download the final result, we can use the code below to get an “xls” output.

#downloading the final result in xls (excel) 
df_load.to_csv('data.csv', encoding='utf-8', index=False)
files.download('data.csv')

Thank you for reading this article, I hope it gives you a glimpse of how exciting to play with data. To continue the project until we make Machine Learning modeling, follow the series of steps by clicking this link.

--

--

No responses yet