K Nearest Neighbors: Predicting Airline Passenger satisfaction

Data Preparation
The dataset contains about 130,000 survey records and passenger details from a US airline. Passengers rate the flight experience on a scale of 1 to 5.
Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

K Nearest Neighbors — Classification
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).
The KNN model (K Nearest Neighbors) is one that chooses the value closest to the values from the train set. if K = 1, then it chooses to depend on the single closest neighbor to the wanted data; if N > 1, then it chooses the label depending on the N closest neighbors (in terms of data). The “weights” parameter specifies whether the model is to give a higher penalty for the neighbors further away or to count their value as equally valuable to determine the label.
93% Accuracy Rate
def identifyKNN():
maxi = 1
max = 0
for i in range(1,20):
modelKNN = KNeighborsClassifier(n_neighbors = i, weights='distance')
modelKNN.fit(X_train, y_train)
accuracy = modelKNN.score(X_test, y_test)
if (accuracy > max):
maxi = i
max = accuracy
print(maxi, " ", max)modelKNN = KNeighborsClassifier(n_neighbors = 9, weights='distance')
modelKNN.fit(X_train, y_train)
predictionsKNN = modelKNN.predict(X_test)
accuracyKNN = metrics.accuracy_score(y_test, predictionsKNN)# accuracy : 0.9335540498922081
Choosing the K fold cross-validation score. We run from 1 to 20. The cv value in the cross-validation score means the number of “folds” the training set is divided into.
def identifyKFold(model):
maxi = 1
max = 0
for i in range(2,20):
accuracy = cross_val_score(model, X_train, y_train, cv = i).mean()
if (accuracy > max):
maxi = i
max = accuracy
print('Best index:', maxi, "\ncross_val_score of index", maxi, ':', max)# Maxi returned: 19
# Value for max: 0.9317832165238031

ROC curve (receiver operating characteristic curve)
fprKNN, tprKNN, thresholdsKNN = metrics.roc_curve(y_test, probsKNN)
fig = plt.figure()
axes = fig.add_axes([0,0,1,1])
axes.plot(fprKNN, tprKNN, label = "KNN")
axes.set_xlabel("False positive rate")
axes.set_ylabel("True positive rate")
axes.set_title("ROC Curve for KNN")
axes.legend()

Dataset: flight satisfaction survey data from Kaggle.