Data Science in Telco : Data Cleansing

Fawwaz
4 min readJul 5, 2021

So, this is basically a case study that I learned from DQLab (Obviously they are a very great source to learn Data Science! Haha)

So, what is DQLab Telco? It is a Telecommunication Company a.k.a Telco, it’s a growing company made by DQLab in 2019. DQLab Telco always consistently paying attention to their Customer Experience so their loyal customer wouldn’t go and turn their head to another brand.

Even though the company is only a year old, their customer is starting to move to another brand (It is not a good sign for DQLab Telco, right?). Then the management want to reduce the number of switching customers.

Have you guess it? Yep, that’s right. With Machine Learning!. So, I am here trying to preparing the data including making a prediction model to determine which customers is going to stop subscribing and which one is not.

First, the library that we are going to use, as follows :

After the dataset is imported to the workspace, I showed the total rows and total columns from the dataset, and print the first 5 lines with .head() function, and then find the customerID with a unique number with the .unique() function. And this is the result, voila!

Based on the pictures, we can see that 7017 customerID has a unique numbers (that’s quite a lot, obviously!) and the total rows and columns of the dataset are 110.

Then, the next step is filtering the ID number of the customer, after a heavy rain of calculation (HAHAHA!), I finally found the ID number that has been filtered that is 7006.

After I filtered the ID number of the customers, I make sure that there are no ID numbers that have been duplicated, I use the drop.duplicates() functions to make sure that there areno ID numbers that has been duplicated.

The image above is the formula to filtered the ID numbers that have been duplicated. And here is the result :

The validity of the customers ID number is very important to make sure that the data that I took is right. Based on that result, there is a huge difference in the total of ID numbers from the first I took the data compare to the end results. The total of the row data when I first took are 7113 rows and 22 columns with 7017 Unique ID numbers. Then, after I do the validity check of the customers ID, the rows data shrink into 6993 rows data.

Next, I am going to detect if there’s any outlier in the data, I use the matplotlib and seaborn as a starting point. Here’s the result :

After figuring out which variable has an outlier, I am trying to overcome that Outlier using Interquartile Range (IQR). Here is the results :

From the three boxplot with ‘tenure’, ‘MonthlyCharges’ & ‘TotalCharges’ variables. We can see that there is an outlier there. This thing can be identified from the location of the dots that is far from the image of the boxplot. And if we saw the data distribution from the MAX column, we can see there is a very high value there.

So, I think thats it guys for the Data Cleansing in DQLab Telco! Hope you have some enlightment after reading this article (Should I say article? Yeah who cares I don’t even know what this is called haha). Bye and see you guys in a bit!

--

--

Fawwaz

Entrepreneur, Data Enthusiast and whatever you wanna call me!