KUAN-HUNG LIN (Gary) M.S. Data Analytics in GMU

I. Introduction and Problem Formulation

1. Abstract

This project process the Airline dataset. There are existing many valuable data and information. However, this dataset still exists problems, such as missing data. This project uses statistical methods to impute missing data and then create statistical graphs in order to present data analyze results and reports.

2. Project Milestones -Metadata Extraction and Imputation

Step1: JSON to CSV file (RStudio)

#### RStudio ####
library(RJSONIO)

# from the website

AirlineRaw<-fromJSON("http://ist.gmu.edu/~hpurohit/courses/ait582-proj-data-spring16.json")

# if downloaded, just upload

# AirlineRaw<-fromJSON("ait582_proj_data_spring16.json")

length(AirlineRaw)

# We can coerce this to a data.frame

Airline_data <- do.call(rbind, AirlineRaw)

# Then write it to a flat csv file

Airline_data <- Airline_data[,c("FARE","DESCRIPTION","SUCCESS","SEATCLAS","GUESTS","CUSTOMERID")]

Airline_data <- Airline_data[,c("CUSTOMERID","SUCCESS","DESCRIPTION","SEATCLASS","GUESTS","FARE")]

Airline_data <- Airline_data[-1,]

write.csv(Airline_data, "Airline_data.csv")

#################################################################################
Step 2.1: Split the column “description”

#### RStudio ####

# Step 2 Metadata Extraction and Imputation

# Split a column from csv file

library(tidyr)

# Read Data

myData = read.csv("Airline_data.csv")
View(myData)

# Duplicate the column "DESCRIPTION"

DESCRIPTION2 <- myData$DESCRIPTION

myData <- cbind(myData,DESCRIPTION2)

View(myData)

# separate data

myData1 <- separate(myData, DESCRIPTION2, c("DESCRIPTION2","Age"), sep = ";")

myData2 <- separate(myData1, DESCRIPTION2, c("Fisrt Name","DESCRIPTION2"), sep = ",")

# View(myData2)

myData3 <- separate(myData2, DESCRIPTION2, c("Title","Last Name"), sep = ". ")

View(myData3)

# Save as CSV file

write.csv(myData3, "Airline_data_Split.csv")

####################################################################################################

Step2.2: Impute missing data

#### RStudio ####

# Impute Missing Values

# import csv

myData <- read.csv("Airline_data_Split2.csv")

View(myData)

aggregate(Age~Title,myData,mean)

# Plot graph: original data

hist(myData$Age, freq=NULL, main='Age: Original Data',

col='darkgreen', ylim=c(0,500),xlab = "Age", ylab = "Population")

# Simple mean imputation

myData$Age <- ifelse(is.na(myData$Age), mean(myData$Age, na.rm=TRUE), myData$Age)

# Convert NUM to INT

myData$Age <- as.integer(myData$Age)

View(myData)

# Plot graph: new data

?hist

hist(myData$Age, freq=NULL, main='Age: New Data',

col='red', ylim=c(0,500),xlab = "Age", ylab = "Population")

write.csv(myData, "Airline_data_ImputeMisssingValue2.csv")

####################################################################################################

Step 2.3: Create a new column”Gender

#### RStudio ####
# Step2.3 Add a column"Gender"

#Import csv file

myData <- read.csv("Airline_data_ImputeMisssingValue2.csv")

View(myData)

# aggregate(): mean of each Title

aggregate(Age~Title,myData,mean)

# Add a new column"Gender" by copying Title

Gender <- myData$Title

myData <- cbind(myData,Gender)

View(myData)

# Replace values

myData$Gender <- gsub('Capt','Male',myData$Gender)

myData$Gender <- gsub('Col','Male',myData$Gender)

myData$Gender <- gsub('Don','Male',myData$Gender)

myData$Gender <- gsub('Dr','Male',myData$Gender)

myData$Gender <- gsub('Jonkheer','Male',myData$Gender)

myData$Gender <- gsub('Lady','Female',myData$Gender)

myData$Gender <- gsub('Major','Male',myData$Gender)

myData$Gender <- gsub('Master','Male',myData$Gender)

myData$Gender <- gsub('Miss','Female',myData$Gender)

myData$Gender <- gsub('Mlle','Female',myData$Gender)

myData$Gender <- gsub('Mme','Male',myData$Gender)

myData$Gender <- gsub('Mr','Male',myData$Gender)

myData$Gender <- gsub('Mrs','Female',myData$Gender)

myData$Gender <- gsub('Ms','Female',myData$Gender)

myData$Gender <- gsub('Rev','Male',myData$Gender)

myData$Gender <- gsub('Sir','Male',myData$Gender)

myData$Gender <- gsub('th','Female',myData$Gender)

myData$Gender <- gsub('Males','Female',myData$Gender)

View(myData)

# Save as csv file

write.csv(myData,"Airline_data_Gender.csv")

# Calculate num of Male and Female

table(myData$Gender)

# Pie Chart from data frame with Appended Sample Sizes

mytable <- table(myData$Gender)

lbls <- paste(names(mytable), "\n", mytable, sep="")

pie(mytable,edges = 300, col = rainbow(2), radius = 0.9,

main="Gender")

####################################################################################################

II. Summarization and Visualization

1. Preprocessing dataset (WEKA)

2. Attribute Selection (WEKA)

3. The section “Visualize” (WEKA)

4. Scatter plot (R)

#### RStudio ####
# import csv

myData <- read.csv("Airline_data_Gender.csv")

View(myData)

# scatterplot #1

plot(myData[,9:14])

# scatterplot #2

comp <- data.frame(myData[,10:14])

plot(comp, pch=16, col=rgb(0,0,0,0.5))

####################################################################################################
5.1 First clustering graph (WEKA)

5.2 Second Clustering graph (WEKA)

III. Analytics and Conclusion

1. Classifier Model – (WEKA)

7 attributes

4 attributes

2.1 Model - Decision Tree with 4 Attributes – J48 (WEKA)

2.2 Model – Decision Tree with 5 Attributes - (R)

< origin >

< Simple >

#### RStudio ####

# 3.2 Decision Tree

# source: http://www.rdatamining.com/examples/decision-tree

library(grid)

library(mvtnorm)

library(modeltools)

library(stats4)

library(party)

# Read CSV file

myData <- read.csv("Airline_data_Gender.csv")

View(myData)

# ------------------------------------------------------------------------------ #

# 4 Attributes: "SUCCESS","Age","SEATCLASS","Gender"

# Preprocess data

myData <- myData[,c("SUCCESS","Age","SEATCLASS","Gender")]

View(myData)

# Implementation for Dataframe

str(myData)

myData_ctree <- ctree(Gender ~SUCCESS + Age + SEATCLASS, data = myData)

print(myData_ctree)

plot(myData_ctree)

plot(myData_ctree, type="simple")
# ------------------------------------------------------------------------------ #

# 5 Attributes: "SUCCESS","Age","SEATCLASS","GUESTS","Gender"

# Preprocess data

myData <- myData[,c("SUCCESS","Age","SEATCLASS","GUESTS","Gender")]

View(myData)

# Implementation for Dataframe

str(myData)

myData_ctree <- ctree(Gender ~SUCCESS + Age + SEATCLASS + GUESTS, data = myData)

print(myData_ctree)

plot(myData_ctree)

plot(myData_ctree, type="simple")
# ------------------------------------------------------------------------------ #

####################################################################################################
3. K-means clustering (k=3)

#### RStudio ####

# 3.3 k-means Clustering

# source: http://www.rdatamining.com/examples/kmeans-clustering

# Read CSV file

myData <- read.csv("Airline_data_Gender.csv")
View(myData)

# 4 Attributes: "SUCCESS","Age","SEATCLASS","Gender"

# Preprocess data

myData <- myData[,c("SUCCESS","Age","SEATCLASS","Gender")]

View(myData)

# Implementation for Dataframe

str(myData)

new_myData <-myData

new_myData$Gender <- NULL

kc <- kmeans(new_myData, 3, nstart = 1)

# Compare the Gender label with the clustering result

table(myData$Gender, kc$cluster)

plot(new_myData[c("SUCCESS", "Age")], col=kc$cluster)

points(kc$centers[,c("SUCCESS", "Age")], col=1:3, pch=8, cex=2)

####################################################################################################

5. ROC Curve in WEKA

IIII. Discussion and summary

From this dataset, airline dataset exists many missing value. It is hard to depict the information and get further conclusion by spreadsheet applications, such as EXCEL, Google sheets, and so on. Thus, this project use meta data method to preprocess the airline dataset. In second milestone, imputing missing value must be important process of this project. Moreover, this project produces a column gender in order to compare data. Finally, this project use R and WEKA to create visualization graphs. This process show values between each attribute. The summary report process further information. It shows more attributes cause summary reports lower accuracy rates. The decision tree is the model of this project. Even decision tree is a simple graph it still shows important information to explain several details of this airline dataset. The k-means clustering graphs show three groups of clustering. From the result, we can realize middle age people have higher success rate than younger and older people. Finally, ROC curves show the good performances of binary classifier models. [1][2][3]

Reference:

[1] O. source, "Introduction and regression", Ibm.com, 2017. [Online]. Available: https://www.ibm.com/developerworks/library/os-weka1/. [Accessed: 03- May- 2017].

[2] "Data", Cs.cornell.edu, 2017. [Online]. Available: http://www.cs.cornell.edu/People/pabo/movie-review-data/. [Accessed: 03- May- 2017].

[3] "Examples - RDataMining.com: R and Data Mining", Rdatamining.com, 2017. [Online]. Available: http://www.rdatamining.com/examples. [Accessed: 07- May- 2017].

KUAN-HUNG LIN (Gary) M.S. Data Analytics in GMU

2017年5月26日星期五

Airline Dataset: Preprocessing and clustering by RStudio & WEKA

Python program to display calendar

檢舉濫用情形

標籤

2017年5月26日 星期五

Airline Dataset: Preprocessing and clustering by RStudio & WEKA

Python program to display calendar

2017年5月26日星期五