2017年5月26日 星期五

Airline Dataset: Preprocessing and clustering by RStudio & WEKA

I. Introduction and Problem Formulation
1. Abstract
       This project process the Airline dataset. There are existing many valuable data and information. However, this dataset still exists problems, such as missing data. This project uses statistical methods to impute missing data and then create statistical graphs in order to present data analyze results and reports.
2. Project Milestones -Metadata Extraction and Imputation

Step1: JSON to CSV file (RStudio)






####  RStudio ####
library(RJSONIO)
# from the website
AirlineRaw<-fromJSON("http://ist.gmu.edu/~hpurohit/courses/ait582-proj-data-spring16.json")

# if downloaded, just upload
# AirlineRaw<-fromJSON("ait582_proj_data_spring16.json")
length(AirlineRaw)

# We can coerce this to a data.frame
Airline_data <- do.call(rbind, AirlineRaw)

# Then write it to a flat csv file
Airline_data <- Airline_data[,c("FARE","DESCRIPTION","SUCCESS","SEATCLAS","GUESTS","CUSTOMERID")]
Airline_data <- Airline_data[,c("CUSTOMERID","SUCCESS","DESCRIPTION","SEATCLASS","GUESTS","FARE")]
Airline_data <- Airline_data[-1,]
write.csv(Airline_data, "Airline_data.csv")
#################################################################################
Step 2.1: Split the column “description”





####  RStudio #### 
# Step 2 Metadata Extraction and Imputation
# Split a column from csv file
library(tidyr)

# Read Data
myData = read.csv("Airline_data.csv")
 View(myData)

# Duplicate the column "DESCRIPTION"
DESCRIPTION2 <- myData$DESCRIPTION
myData <- cbind(myData,DESCRIPTION2)
View(myData)

# separate data

myData1 <- separate(myData, DESCRIPTION2, c("DESCRIPTION2","Age"), sep = ";")
myData2 <- separate(myData1, DESCRIPTION2, c("Fisrt Name","DESCRIPTION2"), sep = ",")
# View(myData2)
myData3 <- separate(myData2, DESCRIPTION2, c("Title","Last Name"), sep = ". ")
View(myData3)

# Save as CSV file
write.csv(myData3, "Airline_data_Split.csv")

####################################################################################################
Step2.2: Impute missing data
 

####  RStudio  ####
# Impute Missing Values
# import csv

myData <- read.csv("Airline_data_Split2.csv")

View(myData)

aggregate(Age~Title,myData,mean)

# Plot graph: original data
hist(myData$Age, freq=NULL, main='Age: Original Data'
     col='darkgreen', ylim=c(0,500),xlab = "Age", ylab = "Population")

# Simple mean imputation
myData$Age <- ifelse(is.na(myData$Age), mean(myData$Age, na.rm=TRUE), myData$Age)

# Convert NUM to INT
myData$Age <- as.integer(myData$Age)
View(myData)

# Plot graph: new data
?hist
hist(myData$Age, freq=NULL, main='Age: New Data'
     col='red', ylim=c(0,500),xlab = "Age", ylab = "Population")

write.csv(myData, "Airline_data_ImputeMisssingValue2.csv")
####################################################################################################

Step 2.3: Create a new column”Gender









#### RStudio ####
# Step2.3 Add a column"Gender"
#Import csv file
myData <- read.csv("Airline_data_ImputeMisssingValue2.csv")
View(myData)
# aggregate(): mean of each Title

aggregate(Age~Title,myData,mean)

# Add a new column"Gender" by copying Title
Gender <- myData$Title
myData <- cbind(myData,Gender)
View(myData)

# Replace values 
myData$Gender <- gsub('Capt','Male',myData$Gender)
myData$Gender <- gsub('Col','Male',myData$Gender)
myData$Gender <- gsub('Don','Male',myData$Gender)
myData$Gender <- gsub('Dr','Male',myData$Gender)
myData$Gender <- gsub('Jonkheer','Male',myData$Gender)
myData$Gender <- gsub('Lady','Female',myData$Gender)

myData$Gender <- gsub('Major','Male',myData$Gender)
myData$Gender <- gsub('Master','Male',myData$Gender)
myData$Gender <- gsub('Miss','Female',myData$Gender)
myData$Gender <- gsub('Mlle','Female',myData$Gender)
myData$Gender <- gsub('Mme','Male',myData$Gender)
myData$Gender <- gsub('Mr','Male',myData$Gender)
myData$Gender <- gsub('Mrs','Female',myData$Gender)
myData$Gender <- gsub('Ms','Female',myData$Gender)
myData$Gender <- gsub('Rev','Male',myData$Gender)
myData$Gender <- gsub('Sir','Male',myData$Gender)
myData$Gender <- gsub('th','Female',myData$Gender)

myData$Gender <- gsub('Males','Female',myData$Gender)

View(myData)

# Save as csv file

write.csv(myData,"Airline_data_Gender.csv")

# Calculate num of Male and Female
table(myData$Gender)

# Pie Chart from data frame with Appended Sample Sizes
mytable <- table(myData$Gender)
lbls <- paste(names(mytable), "\n", mytable, sep="")
pie(mytable,edges = 300, col = rainbow(2), radius = 0.9, 
    main="Gender")

####################################################################################################
II. Summarization and Visualization

1. Preprocessing dataset (WEKA)










2. Attribute Selection (WEKA)












3. The section “Visualize” (WEKA)










4. Scatter plot (R)











#### RStudio ####
# import csv
myData <- read.csv("Airline_data_Gender.csv")
View(myData)

# scatterplot #1
plot(myData[,9:14])

# scatterplot #2
comp <- data.frame(myData[,10:14])
plot(comp, pch=16, col=rgb(0,0,0,0.5))
####################################################################################################
5.1 First clustering graph (WEKA)










5.2 Second Clustering graph (WEKA)












III. Analytics and Conclusion
1. Classifier Model – (WEKA)


7 attributes





4 attributes







2.1 Model - Decision Tree with 4 Attributes – J48 (WEKA)














2.2 Model – Decision Tree with 5 Attributes - (R)


< origin >









< Simple >










####  RStudio  ####
# 3.2 Decision Tree
# source: http://www.rdatamining.com/examples/decision-tree
library(grid)
library(mvtnorm)
library(modeltools)
library(stats4)
library(party)
# Read CSV file
myData <- read.csv("Airline_data_Gender.csv")
View(myData)
# ------------------------------------------------------------------------------ #
# 4 Attributes: "SUCCESS","Age","SEATCLASS","Gender"
# Preprocess data

myData <- myData[,c("SUCCESS","Age","SEATCLASS","Gender")]
View(myData)

# Implementation for Dataframe
str(myData)


myData_ctree <- ctree(Gender ~SUCCESS + Age + SEATCLASS, data = myData)
print(myData_ctree)
plot(myData_ctree)
plot(myData_ctree, type="simple")
# ------------------------------------------------------------------------------ #
# 5 Attributes: "SUCCESS","Age","SEATCLASS","GUESTS","Gender"
# Preprocess data

myData <- myData[,c("SUCCESS","Age","SEATCLASS","GUESTS","Gender")]
View(myData)

# Implementation for Dataframe
str(myData)
myData_ctree <- ctree(Gender ~SUCCESS + Age + SEATCLASS + GUESTS, data = myData)
print(myData_ctree)

plot(myData_ctree)
plot(myData_ctree, type="simple")
# ------------------------------------------------------------------------------ #
####################################################################################################
3. K-means clustering (k=3) 




####  RStudio  ####
# 3.3 k-means Clustering
# source: http://www.rdatamining.com/examples/kmeans-clustering
# Read CSV file
myData <- read.csv("Airline_data_Gender.csv")
View(myData)

# 4 Attributes: "SUCCESS","Age","SEATCLASS","Gender"
# Preprocess data
myData <- myData[,c("SUCCESS","Age","SEATCLASS","Gender")]
View(myData)

# Implementation for Dataframe
str(myData)
new_myData <-myData 
new_myData$Gender <- NULL
kc <- kmeans(new_myData, 3, nstart = 1)

# Compare the Gender label with the clustering result
table(myData$Gender, kc$cluster)
plot(new_myData[c("SUCCESS""Age")], col=kc$cluster)
points(kc$centers[,c("SUCCESS""Age")], col=1:3, pch=8, cex=2)
####################################################################################################
5. ROC Curve in WEKA












IIII. Discussion and summary
From this dataset, airline dataset exists many missing value. It is hard to depict the information and get further conclusion by spreadsheet applications, such as EXCEL, Google sheets, and so on. Thus, this project use meta data method to preprocess the airline dataset.  In second milestone, imputing missing value must be important process of this project. Moreover, this project produces a column gender in order to compare data. Finally, this project use R and WEKA to create visualization graphs. This process show values between each attribute. The summary report process further information. It shows more attributes cause summary reports lower accuracy rates. The decision tree is the model of this project. Even decision tree is a simple graph it still shows important information to explain several details of this airline dataset. The k-means clustering graphs show three groups of clustering. From the result, we can realize middle age people have higher success rate than younger and older people. Finally, ROC curves show the good performances of binary classifier models. [1][2][3]

Reference:
[1] O. source, "Introduction and regression", Ibm.com, 2017. [Online]. Available: https://www.ibm.com/developerworks/library/os-weka1/. [Accessed: 03- May- 2017].
[2] "Data", Cs.cornell.edu, 2017. [Online]. Available: http://www.cs.cornell.edu/People/pabo/movie-review-data/. [Accessed: 03- May- 2017].

[3] "Examples - RDataMining.com: R and Data Mining", Rdatamining.com, 2017. [Online]. Available: http://www.rdatamining.com/examples. [Accessed: 07- May- 2017].

Python program to display calendar

# Python program to display calendar of given month of the year # importing calendar module for calendar operations import calendar # set t...