K Nearest Neighbors: Predicting Airline Passenger satisfaction

3 min readApr 22, 2021

While COVID-19 continues to threaten different sectors of the economy, “no industry has seen a bigger impact than airlines”. Among the hardest hit might be AirAsia, It cut capacity across its airline units by 19% in the first quarter of 2020, and passenger numbers fell 22%. Load factor was still “within expectations” at 77%.

Data Preparation

The dataset contains about 130,000 survey records and passenger details from a US airline. Passengers rate the flight experience on a scale of 1 to 5.

Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

Plots of Features split by target classes (Satisfied: 1, Neutral/Dissatisfied: 0)

K Nearest Neighbors — Classification

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

The KNN model (K Nearest Neighbors) is one that chooses the value closest to the values from the train set. if K = 1, then it chooses to depend on the single closest neighbor to the wanted data; if N > 1, then it chooses the label depending on the N closest neighbors (in terms of data). The “weights” parameter specifies whether the model is to give a higher penalty for the neighbors further away or to count their value as equally valuable to determine the label.

93% Accuracy Rate

def identifyKNN():
    maxi = 1
    max = 0
    for i in range(1,20):
        modelKNN = KNeighborsClassifier(n_neighbors = i, weights='distance')
        modelKNN.fit(X_train, y_train)
        accuracy = modelKNN.score(X_test, y_test)
        if (accuracy > max):
            maxi = i
            max = accuracy
    print(maxi, "  ", max)modelKNN = KNeighborsClassifier(n_neighbors = 9, weights='distance')
modelKNN.fit(X_train, y_train)
predictionsKNN = modelKNN.predict(X_test)
accuracyKNN = metrics.accuracy_score(y_test, predictionsKNN)# accuracy : 0.9335540498922081

Choosing the K fold cross-validation score. We run from 1 to 20. The cv value in the cross-validation score means the number of “folds” the training set is divided into.

def identifyKFold(model):
    maxi = 1
    max = 0
    for i in range(2,20):
        accuracy = cross_val_score(model, X_train, y_train, cv = i).mean()
        if (accuracy > max):
            maxi = i
            max = accuracy
    print('Best index:', maxi, "\ncross_val_score of index", maxi, ':', max)# Maxi returned: 19
# Value for max: 0.9317832165238031

ROC curve (receiver operating characteristic curve)

fprKNN, tprKNN, thresholdsKNN = metrics.roc_curve(y_test, probsKNN)
fig = plt.figure()
axes = fig.add_axes([0,0,1,1])
axes.plot(fprKNN, tprKNN, label = "KNN")
axes.set_xlabel("False positive rate")
axes.set_ylabel("True positive rate")
axes.set_title("ROC Curve for KNN")
axes.legend()

Dataset: flight satisfaction survey data from Kaggle.

K Nearest Neighbors: Predicting Airline Passenger satisfaction

Data Preparation

K Nearest Neighbors — Classification

93% Accuracy Rate

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ariel Silva

No responses yet