This project process the Airline dataset. There are existing many valuable data and information. However, this dataset still exists problems, such as missing data. This project uses statistical methods to impute missing data and then create statistical graphs in order to present data analyze results and reports.
2. Project Milestones -Metadata Extraction and Imputation
Step1: JSON to CSV file (RStudio)
#### RStudio ####
library(RJSONIO)
AirlineRaw<-fromJSON("http://ist.gmu.edu/~hpurohit/courses/ait582-proj-data-spring16.json")
length(AirlineRaw)
#### RStudio ####
library(RJSONIO)
# from the website
# if downloaded, just upload
# AirlineRaw<-fromJSON("ait582_proj_data_spring16.json")
# We can coerce this to a data.frame
Airline_data <- do.call(rbind, AirlineRaw)
# Then write it to a flat csv file
Airline_data <- Airline_data[,c("FARE","DESCRIPTION","SUCCESS","SEATCLAS","GUESTS","CUSTOMERID")]
Airline_data <- Airline_data[,c("CUSTOMERID","SUCCESS","DESCRIPTION","SEATCLASS","GUESTS","FARE")]
Airline_data <- Airline_data[-1,]
write.csv(Airline_data, "Airline_data.csv")
#################################################################################
Step 2.1: Split the column “description”
# Read Data
myData1 <- separate(myData, DESCRIPTION2, c("DESCRIPTION2","Age"), sep = ";")

#### RStudio ####
myData <- read.csv("Airline_data_Split2.csv")
View(myData)
aggregate(Age~Title,myData,mean)
myData$Gender <- gsub('Major','Male',myData$Gender)
write.csv(myData,"Airline_data_Gender.csv")
Step 2.1: Split the column “description”
#### RStudio ####
# Step 2 Metadata Extraction and Imputation
# Split a column from csv file
library(tidyr)
# Read Data
# Duplicate the column "DESCRIPTION"
DESCRIPTION2 <- myData$DESCRIPTION
myData <- cbind(myData,DESCRIPTION2)
View(myData)
# separate data
myData1 <- separate(myData, DESCRIPTION2, c("DESCRIPTION2","Age"), sep = ";")
myData2 <- separate(myData1, DESCRIPTION2, c("Fisrt Name","DESCRIPTION2"), sep = ",")
# View(myData2)
myData3 <- separate(myData2, DESCRIPTION2, c("Title","Last Name"), sep = ". ")
View(myData3)
# Save as CSV file
write.csv(myData3, "Airline_data_Split.csv")
####################################################################################################
Step2.2: Impute missing data
#### RStudio ####
# Impute Missing Values
# import csv
myData <- read.csv("Airline_data_Split2.csv")
View(myData)
aggregate(Age~Title,myData,mean)
# Plot graph: original data
hist(myData$Age, freq=NULL, main='Age: Original Data',
col='darkgreen', ylim=c(0,500),xlab = "Age", ylab = "Population")
# Simple mean imputation
myData$Age <- ifelse(is.na(myData$Age), mean(myData$Age, na.rm=TRUE), myData$Age)
# Convert NUM to INT
myData$Age <- as.integer(myData$Age)
View(myData)
# Plot graph: new data
?hist
hist(myData$Age, freq=NULL, main='Age: New Data',
col='red', ylim=c(0,500),xlab = "Age", ylab = "Population")
write.csv(myData, "Airline_data_ImputeMisssingValue2.csv")
####################################################################################################
#### RStudio ####
# Step2.3 Add a column"Gender"
# Step2.3 Add a column"Gender"
#Import csv file
myData <- read.csv("Airline_data_ImputeMisssingValue2.csv")
View(myData)
# aggregate(): mean of each Title
aggregate(Age~Title,myData,mean)
# Add a new column"Gender" by copying Title
Gender <- myData$Title
myData <- cbind(myData,Gender)
View(myData)
# Replace values
myData$Gender <- gsub('Capt','Male',myData$Gender)
myData$Gender <- gsub('Col','Male',myData$Gender)
myData$Gender <- gsub('Don','Male',myData$Gender)
myData$Gender <- gsub('Dr','Male',myData$Gender)
myData$Gender <- gsub('Jonkheer','Male',myData$Gender)
myData$Gender <- gsub('Lady','Female',myData$Gender)
myData$Gender <- gsub('Major','Male',myData$Gender)
myData$Gender <- gsub('Master','Male',myData$Gender)
myData$Gender <- gsub('Miss','Female',myData$Gender)
myData$Gender <- gsub('Mlle','Female',myData$Gender)
myData$Gender <- gsub('Mme','Male',myData$Gender)
myData$Gender <- gsub('Mr','Male',myData$Gender)
myData$Gender <- gsub('Mrs','Female',myData$Gender)
myData$Gender <- gsub('Ms','Female',myData$Gender)
myData$Gender <- gsub('Rev','Male',myData$Gender)
myData$Gender <- gsub('Sir','Male',myData$Gender)
myData$Gender <- gsub('th','Female',myData$Gender)
myData$Gender <- gsub('Males','Female',myData$Gender)
View(myData)
# Save as csv file
write.csv(myData,"Airline_data_Gender.csv")
# Calculate num of Male and Female
table(myData$Gender)
# Pie Chart from data frame with Appended Sample Sizes
mytable <- table(myData$Gender)
lbls <- paste(names(mytable), "\n", mytable, sep="")
pie(mytable,edges = 300, col = rainbow(2), radius = 0.9,
main="Gender")
####################################################################################################
II. Summarization and Visualization
1. Preprocessing dataset (WEKA)
2. Attribute Selection (WEKA)
3. The section “Visualize” (WEKA)
4. Scatter plot (R)
2. Attribute Selection (WEKA)
3. The section “Visualize” (WEKA)
4. Scatter plot (R)
#### RStudio ####
# import csv
# import csv
myData <- read.csv("Airline_data_Gender.csv")
View(myData)
# scatterplot #1
plot(myData[,9:14])
# scatterplot #2
comp <- data.frame(myData[,10:14])
plot(comp, pch=16, col=rgb(0,0,0,0.5))
####################################################################################################
5.1 First clustering graph (WEKA)
5.2 Second Clustering graph (WEKA)
5.1 First clustering graph (WEKA)
5.2 Second Clustering graph (WEKA)
III. Analytics and Conclusion
1. Classifier Model – (WEKA)
7 attributes
4 attributes
2.1 Model - Decision Tree with 4 Attributes – J48 (WEKA)
4 attributes
2.1 Model - Decision Tree with 4 Attributes – J48 (WEKA)
2.2 Model – Decision Tree with 5 Attributes - (R)
< origin >
< Simple >
myData <- myData[,c("SUCCESS","Age","SEATCLASS","Gender")]
myData_ctree <- ctree(Gender ~SUCCESS + Age + SEATCLASS, data = myData)
myData <- myData[,c("SUCCESS","Age","SEATCLASS","GUESTS","Gender")]
plot(myData_ctree)
< Simple >
#### RStudio ####
# 3.2 Decision Tree
# source: http://www.rdatamining.com/examples/decision-tree
library(grid)
library(mvtnorm)
library(modeltools)
library(stats4)
library(party)
# Read CSV file
myData <- read.csv("Airline_data_Gender.csv")
View(myData)
# ------------------------------------------------------------------------------ #
# 4 Attributes: "SUCCESS","Age","SEATCLASS","Gender"
# Preprocess data
myData <- myData[,c("SUCCESS","Age","SEATCLASS","Gender")]
View(myData)
# Implementation for Dataframe
str(myData)
myData_ctree <- ctree(Gender ~SUCCESS + Age + SEATCLASS, data = myData)
print(myData_ctree)
plot(myData_ctree, type="simple")
# ------------------------------------------------------------------------------ #
# ------------------------------------------------------------------------------ #
# 5 Attributes: "SUCCESS","Age","SEATCLASS","GUESTS","Gender"
# Preprocess data
myData <- myData[,c("SUCCESS","Age","SEATCLASS","GUESTS","Gender")]
View(myData)
# Implementation for Dataframe
str(myData)
myData_ctree <- ctree(Gender ~SUCCESS + Age + SEATCLASS + GUESTS, data = myData)
print(myData_ctree)
plot(myData_ctree)
plot(myData_ctree, type="simple")
# ------------------------------------------------------------------------------ #
# ------------------------------------------------------------------------------ #
####################################################################################################
3. K-means clustering (k=3)
# Compare the Gender label with the clustering result
3. K-means clustering (k=3)
#### RStudio ####
# 3.3 k-means Clustering
# source: http://www.rdatamining.com/examples/kmeans-clustering
# Read CSV file
myData <- read.csv("Airline_data_Gender.csv")
View(myData)
View(myData)
# 4 Attributes: "SUCCESS","Age","SEATCLASS","Gender"
# Preprocess data
myData <- myData[,c("SUCCESS","Age","SEATCLASS","Gender")]
View(myData)
# Implementation for Dataframe
str(myData)
new_myData <-myData
new_myData$Gender <- NULL
kc <- kmeans(new_myData, 3, nstart = 1)
# Compare the Gender label with the clustering result
table(myData$Gender, kc$cluster)
plot(new_myData[c("SUCCESS", "Age")], col=kc$cluster)
points(kc$centers[,c("SUCCESS", "Age")], col=1:3, pch=8, cex=2)
####################################################################################################
5. ROC Curve in WEKA
IIII. Discussion and summary
From this dataset, airline dataset exists many missing value. It is hard to depict the information and get further conclusion by spreadsheet applications, such as EXCEL, Google sheets, and so on. Thus, this project use meta data method to preprocess the airline dataset. In second milestone, imputing missing value must be important process of this project. Moreover, this project produces a column gender in order to compare data. Finally, this project use R and WEKA to create visualization graphs. This process show values between each attribute. The summary report process further information. It shows more attributes cause summary reports lower accuracy rates. The decision tree is the model of this project. Even decision tree is a simple graph it still shows important information to explain several details of this airline dataset. The k-means clustering graphs show three groups of clustering. From the result, we can realize middle age people have higher success rate than younger and older people. Finally, ROC curves show the good performances of binary classifier models. [1][2][3]
Reference:
[1] O. source, "Introduction and regression", Ibm.com, 2017. [Online]. Available: https://www.ibm.com/developerworks/library/os-weka1/. [Accessed: 03- May- 2017].
[2] "Data", Cs.cornell.edu, 2017. [Online]. Available: http://www.cs.cornell.edu/People/pabo/movie-review-data/. [Accessed: 03- May- 2017].
[3] "Examples - RDataMining.com: R and Data Mining", Rdatamining.com, 2017. [Online]. Available: http://www.rdatamining.com/examples. [Accessed: 07- May- 2017].