Week 8 Assignment Machine Learning Model Building
Background information: Customer Churn Prediction in the Telecom Sector
Customer churn, also known as customer retention, customer turnover, or customer defection, is the loss of clients or customers.
Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients.
Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer’s relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies’ control, such as how billing interactions are handled or how after-sales help is provided.
Predictive analytics uses churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small, prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.
In this assignment, we will be applying the K Nearest Neighbors algorithm to see how well our model can classify new data as either positive (TRUE) for churn or negative (FALSE) based on certain characteristics of our dataset.
The dataset contains various data about individual customers for a monthly billing cycle, including
• average day, evening, night and international minutes, number of calls and cost data
• amount of time the customer has had a contract with the company (in months)
• average number of calls to customer service (time frame is not given, but we can assume the same time for all customers, so directly comparable).
Let’s recall the Data Science Pipeline from Module 1, Week 1. We will be working through each of these stages in this project.
You should be sure that you have studied all the Module 4 resources, readings, videos, and tutorials before beginning this final assignment.
If you have any questions, do not hesitate to contact your instructor.
While this final assignment relies on knowledge and skills acquired throughout the course, the assignment is particularly anchored on the Week 8 K Nearest Neighbors tutorial (Iris dataset).
While you are in possession of a perfectly working Jupyter Notebook with Python code applicable to a different dataset, it is a fact of ‘data science life’ that rarely is code directly usable without modification from one dataset, one project, to another. Therefore it is critical that you not only study the code but also the dataset carefully.
Previous tutorials and assignment in the course have given a step by step breakdown of what to do, what to type, what to execute, when and where. This final assignment of the course will deviate from that approach.
Your task in this assignment is to execute and document an end-to-end machine learning algorithm (K Nearest Neighbors-KNN) on the business problem and Churn.csv dataset presented earlier. Your evaluation is not so much based on identifying the best combination of criteria that lead to the highest ‘accuracy’ of the model, but rather your ability to explain the process, in layman’s terms (for non-data scientist colleagues and management), at each stage of your investigation, your Data Science Pipeline.
You must include the code, output and comments/analysis in your submission. You may create your submission in a Jupyter Notebook or Word.
In this assignment, you will follow the basic processes covered in the KNN Iris dataset tutorial (with modifications that you will implement).
You should provide:
• A comprehensive, multiple-step, Summary of the Dataset,
• Data Preparation, including Label Encoding and Feature Scaling
• Data Visualization of the dataset, a minimum of 3 visualizations (Note: a single code cell that generates multiple visualizations, such as sns.pairplot or matplotlib boxplot counts as one visualization)
• Complete Prediction set, including results, evaluations and at least one form of cross-validation, with at least one graphic of optimal number of neighbors for the KNN algorithm
• Each of the above stages should be accompanied with commentary/analysis. Nearly every cell will need comments.
Suggested first steps:
• There are 15 potential features (X values) in the machine learning model you will create. You will want to at least try one iteration of the model with all 15 features.
• However, before doing so, be sure to run the Step 3.6 Correlation Heat Map and see if that gives you some indication of which features to concentrate on.
• You will run several iterations of this process on the dataset (4 iterations minimum). Do not ‘cut to the chase’ and choose your final features too quickly; take the time to experiment. You will need to take notes during and after each iteration for the required deliverables (below).
o Ultimately you should choose to concentrate on 3-4 features (columns). By the final iteration, you will want to remove/delete certain features (columns) from the dataset.
This can be done manually in Excel (then reloading a new, more focused dataset).
Or, per Step 3 Data Visualization in the KNN Iris dataset tutorial, by using the filter method in Jupyter Notebook, to create another DataFrame
dataset2 = dataset.filter([‘SepalLengthCm’, ‘SepalWidthCm’, ‘Species’], axis=1)
o Note that when you are testing which features to include/exclude you can modify the following code (from the KNN Iris dataset tutorial), then run the succession of steps to get to the KNN results to compare.
feature_columns = [‘SepalLengthCm’, ‘SepalWidthCm’, PetalLengthCm’,’PetalWidthCm’] X = dataset[feature_columns].values
y = dataset[‘Species’].values
• In terms of complete code, output (including visualizations on your final 3-4 features only) and ‘in code’ comments/analysis, you will submit only your final iteration. Ideally this will be the KNN where you achieved the best results, but as mentioned previously, the commentary/analysis is more important than the results.
• Full python code + output/visualizations + ‘in code’ comments/description of process will be preferably in the form of a Jupyter Notebook. A Word document with relevant code and output and commentary entered/cut and pasted is possible as well.
• In addition to your Jupyter Notebook (or code/output/comments document), you should provide a one page overall summary of the process (including a brief description of the problem, what KNN should accomplish and descriptions of your various iterations—see below for a sample table format). This will be a separate Word document. Include any significant interim observations/milestones and any conclusions of the resulting model—in this case, which level of K proved most successful in your K Nearest Neighbor algorithm and to what accuracy rate.
• Thus you will submit two files
o Jupyter Notebook or Word file (Full python code + output/visualizations + ‘in code’ comments/description)
o Word file (one page overall summary of the process)
This is an open-ended, leverage-what-you’ve-observed-and-learned assignment—very much like adjusting to the ever-evolving knowledge and skill base required in any data science career path. Do not hesitate to consult documentation or demos for Python or the various Python libraries (Pandas, matplotlib, Seaborn, etc.).
There is a lot of detail in these instructions, but the bottom line is that we’re looking for a deliverable very similar to the knn_classification_tutorial_iris.ipynb Jupyter Notebook, with additional comments/analysis. As mentioned elsewhere, when adapting code from one dataset to another, adjustments to code and/or parameters will be required.
Of course, if you have any questions, do not hesitate to contact your instructor.
Sample summary table for reporting iterations (adapt as needed):
Iteration# Features Included Primary Results Observations/
Analysis Next Steps