Steps Before Classification: Data Encoding and Feature Selection with Python
After finishing the data cleaning step in this link, we will try to make something from the clean dataset. This project will lead to Machine Learning Classification modeling for churn prediction of telecommunication dataset that will conduct in the next article. The source codes and dataset can be downloaded from this Github.
First and foremost, we will make data encoding first before we choose some variables using the Feature Selection Method. For proceeding with the data using this method, all data types should be numerical in which needs those data to be transformed (encode). Before that, we have to be sure which columns we need to proceed with by looking at the data types using df.info(). In this case, the result of my dataset is shown below:
As the picture above illustrates, the column with non-numerical data types, which is object or string, is sixteen. Accordingly, we will transform those sixteen columns’ data type into a numerical one (integer). This action will not be disappearing the existing non-numerical columns, but it will make the transformed (numerical) duplication version of the non-numerical columns.
After loading the clean dataset downloaded from the data cleaning step, we will import Scikit-Learn library to begin the encoding. This library’s function can help data processing or some features for machine-learning needs, such as clustering, regression, classification, model selection, and dimensionality reduction.
#Scikit-Learn library to Encodefrom sklearn.preprocessing import LabelEncoder
After that, we will encode the object type column using “fit_transform” after copying the sklearn.processing “LabelEncoder” for each column. This code below only an example of the column “Partner” encoding to be column “Partnercode”. Implement this code to every single column that needs to be transformed. Then, check the result if this code works using df.info(). Image 2 shows a successful process with integer (int64) data type.
lb_Partner = LabelEncoder()df["Partnercode"] = lb_Partner.fit_transform(df["Partner"])
After finishing with encoding, wi will begin to choose suitable variables for making a machine learning model. The method we will implement is Feature Selection, which is used to select influential features and override features that do not affect a data modeling or analysis activity without making a new subset of the original features like another method, Feature Extraction.
Speaking about another method to choose the variables, there is another one called Dimensionality Reduction, it is a reduction process of the dataset dimensions with a consideration that essential information is retained. In other words, this method will transform the high-dimensional data into the low-dimensional one, which is better for reducing the more significant number of features such as 100 or 1,000 features.
Back to Feature Selection, there are some options of feature selection techniques, but first, we need to make an objective of what kind of machine learning we will do. This time we will make supervised learning, which will use sample input data and desired output to approximate the mapping function. This learning is quite different from another type, unsupervised learning, that do not need the expected output in which means that the algorithm will automatically identify the data structure.
To select which kind of Feature Selections method, we need to determine what input variables we have and what type of variables we want to get. This case will use numerical input — numerical output and categorical input -numerical output, which, based on Image 4, we select Pearson’s and Kendall’s Rank Correlation Coefficient.
- Numerical Input — Numerical Output
First, we will tackle the numerical input — numerical output using Pearson’s Correlation. Pearson is commonly used for running Filter Method (shown on Image 3), which chooses features according to their scores in several statistical tests to correlate with the result variable. We determine some numerical variables we want to proceed with and store them into a new variable. Based on my dataset, these four variables below are the numerical ones.
#Determining numerical variablenum = df[["tenure","MonthlyCharges","TotalCharges","Churncode"]]
After that, we will create a heatmap to see visual correlations among the variables. Set the size of the heatmap we want to show using .firgure(figsize=(a,b)), this time we will set 12 x 10 inch. The result is shown in Image 5.
#Filtering using Pearson Correlationimport matplotlib.pyplot as pltimport seaborn as sns
plt.figure(figsize=(12,10))cor = num.corr()sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)plt.show()
The numbers inside boxes are the correlation coefficient that has range values from -1 to 1. When the value is 0, it means that the variables are strictly not related. The closer the value with -1 has a meaning of stronger negative correlations between the variables, while in the opposite (closer to 1), the variables have a stronger positive correlation. In other words, both -1 and 1 mean that the variables correlate with each other.
Next step, we will filter the highest correlation based on the standard we have determined. Actually, we can easily pick the highest correlation value based on the heatmap because of the low number of input variables we have. However, I will still write the code below to check the highest correlation if you deal with so many variables next time. Now we will see how many variables which has more than 0.1 value of correlation.
#Correlation with output variablecor_target = abs(cor["Churncode"])#Selecting highly correlated featuresrelevant_features = cor_target[cor_target>0.1]relevant_features
Since there are more than one variable’s outcomes, we need to ensure that every variables are not related. If the variables are related to each other, then we need to drop them. Write the code below to find the result.
print(df[["tenure","MonthlyCharges"]].corr())print(df[["tenure","TotalCharges"]].corr())print(df[["MonthlyCharges","TotalCharges"]].corr())
As we see from Image 7, tenure-TotalCharges and MonthlyCharges-TotalCharges have higher correlations than its correlation with “Churncode” (based on Image 6). This situation means that “TotalCharges” needs to be removed from the variables because tenure-MonthlyCharges correlation is lower. This way, we have two independents variables result: tenure and MonthlyCharges.
2. Categorical Input — Numerical Output
Second, we will be working for categorical input to see a numerical output with Kendall’s Rank Correlation Coefficient or Kendall’s Tau (τ) Coefficient. This analysis is used to find relationships and to test hypotheses between two or more variables, with constraints that the data type is in the form of ordinal or ranking. This method is the best fit for analyzing samples of more than ten sizes and can be improved to find the partial correlation coefficient.
To start processing the variables, we will determine which ones are categorical variables first, the same as the early step of finding correlation in the numerical variable input. Put all categorical columns on your dataset into a new variable, this time, we name it “cat”.
#Determining categorical variablecat = df[["UpdatedAt","customerID","gendercode"]]
#those are only some examples of the actual variables
Then, we will directly write the code of Kendall’s Rank Coefficient like the one below. Next, we filter the result with a limit of more than 0.1.
import numpy as npkendalls = cat.corr(method='kendall')relevant_features = kendalls[kendalls>0.1]relevant_features
Seeing the results in Images 8 and 9, we can say that there are two variables related to “Churncode” based on the filtering, which is “SeniorCitizen” and “PaperlessBillingcode”. Like what we have done before with the numerical input- numerical output process, we also need to check whether these two variables have correlations to each other.
print(df[["SeniorCitizen","PaperlessBillingcode"]].corr())
Image 10 illustrates the correlation value between those variables is lower than “PaperlessBillingcode” correlation with “Churncode”, yet higher than “SeniorCitizen”-“Churncode” according to Image 8 and 9 so that we need to drop the variable SeniorCitizen from the result list.
CONCLUSION
Based on the result of Pearson’s Correlation and Kendall’s Rank Correlation Coefficient, we can conclude that we have three different variables that are not related to each other, which are “tenure” and “MonthlyCharges” from numerical-numerical and “PaperlessBillingcode” from categorical-numerical.
Now these are all set up to bring these variables into Machine Learning modeling. Thank you for reading this article, if you want to continue this project, you can click this link. See you in the next article.