1. Introduction
This project explore two datasets. The project is from my DEAN 690 course in GMU.
The topic of this project is Data Analytics Approach to Foreign Student Data
Analytics Job Opportunity Search.
2. Purpose
As technology growing, the jobs are changing in
process. [1] As this trend, this article explored two datasets: Virginia job
description in 2016 and H1B Visa petitions 2011-2016. This article used two datasets to find link
from jobs between Americans and international students in order to provide
visualization graphs and suggestions for future student in data science field.
The purpose of the article is give better comprehension about data analytics
jobs in Virginia area.
3. Datasets Description
1st Dataset: 2016 Virginia job description
id title url baseSalary_maxSalary baseSalary_medianSalary baseSalary_minSalary baseSalary_salary dateExpires datePosted educationRequirements employmentType experienceRequirements hiringOrganization_geo_latitude hiringOrganization_geo_longitude hiringOrganization_location hiringOrganization_logo hiringOrganization_organizationCode hiringOrganization_organizationDescription hiringOrganization_organizationName hiringOrganization_organizationTaxID hiringOrganization_organizationUnit hiringOrganization_url incentiveCompensation jobBenefits jobDescription jobLocation_address_countryCode jobLocation_address_countryName jobLocation_address_extendedAddress jobLocation_address_fullText jobLocation_address_locality jobLocation_address_postOfficeBox jobLocation_address_postalCode jobLocation_address_region jobLocation_address_regionCode jobLocation_address_streetAddress jobLocation_geo_latitude jobLocation_geo_longitude normalizedTitle_onetCode normalizedTitle_onetName numberOfOpenings occupationalCategory positionPeriod_endDate positionPeriod_startDate qualifications responsibilities salaryCurrency skills specialCommitments veteranCommitment workHours
2nd Dataset: H-1B Visa Petitions 2011-2016
CASE_STATUS, EMPLOYER_NAME, SOC_NAME, JOB_TITLE,
FULL_TIME_POSITION, PREVAILING_WAGE, YEAR, WORKSITE, lon, and lat.
Same: title & Location
Different: 1st Dataset (job description, skills, and job benefits)
2nd Dataset (Worker's name and salary)
4. Strategy
1st Dataset + 2nd Dataset → Analytics → Visualization
5. 1st Dataset: Data Preprocessing, Result, and Word Cloud.
5.1 1st Dataset Data Preprocessing
### RStudio ###
# VA job
library(ggplot2)
library(dplyr)
library(stringr)
library(tidyr)
library(reshape2)
library(tidytext)
library(devtools)
library(formattable)
library(stringr)
library(leaflet)
# Read tsv file
myData <- read.delim(file = 'joblistings_merged_parsed_unique_grpbyyear_2016.tsv', header = TRUE)
# View Data
View(myData)
ls(myData)
# Filter Data
myData <- myData[,c("title","experienceRequirements","hiringOrganization_organizationName","jobDescription","jobLocation_address_locality","normalizedTitle_onetName","responsibilities","skills")]
View(myData)
write.csv(myData, "VA_Job.csv")
# Clean Data
# Search key words "DataData Scientist","Data Analyst", and "Data Engineer"
myData_related <- myData %>% filter(str_detect(title,"Data"))
myData_job <- myData_related %>%
mutate(job = ifelse(str_detect(title, "SCIENTIST"),"Data Scientist",
ifelse(str_detect(title, "ANALYST"),"Data Analyst",
ifelse(str_detect(title, "ENGINEER"),"Data Engineer",
"OTHERS"))))%>%
mutate(level = ifelse(str_detect(title, "MANAGER"),"MANAGER",
ifelse(str_detect(title, "SENIOR"),"SENIOR",
ifelse(str_detect(title, "PRINCIPAL"),"PRINCIPAL",
ifelse(str_detect(title, "SR"),"SENIOR",
ifelse(str_detect(title, "DIRECTOR"),"DIRECTOR",
ifelse(str_detect(title, "VP"),"VP",
ifelse(str_detect(title, "VICE PRESIDENT"),"VP",
ifelse(str_detect(title, "LEAD"),"LEAD",
ifelse(str_detect(title, "ASSOCIATE"),"ASSOCIATE",
ifelse(str_detect(title, "SPECIALIST"),"SPECIALIST",
ifelse(str_detect(title, "JUNIOR"),"JUNIOR",
"UNSPECIFIED"))))))))))))
table(myData_job$job, myData_job$level)
############################################################################
5.2 1st Dataset Result
> table(myData_job$job, myData_job$level)
PRINCIPAL SENIOR UNSPECIFIED VP
Data Engineer 0 5 1 0
OTHERS 2 20 5999 3
From this table, It is
clear to find out the relationship, but the amount of each jobs is low. It is
hard to compare different titles and levels. [2]
5.3 1st Dataset Word Cloud
### RStudio ###
# Text mining
library(NLP)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(wordcloud)
myData2 <- toString(myData$jobLocation_address_locality)
myData2
# Text mining
myData3 <- Corpus(VectorSource(myData2))
# Remove Punctuation
myData3 <- tm_map(myData3, removePunctuation)
# Lower case
myData3 <- tm_map(myData3, content_transformer(tolower))
myData3 <- tm_map(myData3, PlainTextDocument)
# Remove Stop words
myData3 <- tm_map(myData3, removeWords, stopwords('english'))
myData3
# Word Cloud
wordcloud(myData3, max.words = 100, random.order = FALSE)
############################################################################
6. 2nd Dataset: : Data Preprocessing, Preprocessing Result, and Data Visualization
6.1 Data Preprocessing
### RStudio ###
library(ggplot2)
library(dplyr)
library(stringr)
library(tidyr)
library(reshape2)
library(RColorBrewer)
library(lubridate)
library(tidytext)
library(devtools)
library(formattable)
library(stringr)
library(plotly)
library(wordcloud)
library(tm)
library(qdap)
library(viridis)
library(leaflet)
# Read Data
df = read.csv("h1b_kaggle.csv")
# View Data
df$X1 <- NULL
View(df)
glimpse(df)
# Text Cleaning
df_data_related <- df %>% filter(str_detect(JOB_TITLE,"DATA"))
df_job <- df_data_related %>%
mutate(job = ifelse(str_detect(JOB_TITLE, "SCIENTIST"),"Data Scientist",
ifelse(str_detect(JOB_TITLE, "ANALYST"),"Data Analyst",
ifelse(str_detect(JOB_TITLE, "ENGINEER"),"Data Engineer",
"OTHERS"))))%>%
mutate(level = ifelse(str_detect(JOB_TITLE, "MANAGER"),"MANAGER",
ifelse(str_detect(JOB_TITLE, "SENIOR"),"SENIOR",
ifelse(str_detect(JOB_TITLE, "PRINCIPAL"),"PRINCIPAL",
ifelse(str_detect(JOB_TITLE, "SR"),"SENIOR",
ifelse(str_detect(JOB_TITLE, "DIRECTOR"),"DIRECTOR",
ifelse(str_detect(JOB_TITLE, "VP"),"VP",
ifelse(str_detect(JOB_TITLE, "VICE PRESIDENT"),"VP",
ifelse(str_detect(JOB_TITLE, "LEAD"),"LEAD",
ifelse(str_detect(JOB_TITLE, "ASSOCIATE"),"ASSOCIATE",
ifelse(str_detect(JOB_TITLE, "SPECIALIST"),"SPECIALIST",
ifelse(str_detect(JOB_TITLE, "JUNIOR"),"JUNIOR",
"UNSPECIFIED"))))))))))))
table(df_job$job, df_job$level)
############################################################################
6.2 Preprocessing Result
> table(df_job$job, df_job$level)
ASSOCIATE DIRECTOR JUNIOR LEAD MANAGER PRINCIPAL SENIOR SPECIALIST UNSPECIFIED VP
Data Analyst 115 6 22 94 85 135 1972 50 12256 18
Data Engineer 54 18 12 130 71 82 1249 18 4016 8
Data Scientist 59 8 12 52 25 37 504 7 2573 4
OTHERS 371 243 41 821 1811 206 4361 3511 30974 116
From the table, the
information is clear to compare titles and levels of jobs. In
order to analyze information, this project creates a structure and then try to
link each variable to do further analytics. First step is data preprocessing
to keep the existing data and delete missing value. Second step is filter the
job titles and levels. Third step continue using RStudio to do data
visualization. Final step includes conclusion and discussion.
6.3 Data Visualization
### RStudio ###
# Visualization
catpal = colorFactor(viridis(10),df_job$job)
leaflet(data = filter(df_job, !is.na(lat),!is.na(lon),str_detect(df_job$CASE_STATUS,"CERTIFIED"))) %>%
addProviderTiles("Stamen.TonerLite") %>%
setView(lng = -110, lat = 25, zoom = 3) %>%
addCircleMarkers(~lon, ~lat, stroke = FALSE,
fillOpacity = 0.3,
radius=2.5,
popup = ~EMPLOYER_NAME,
color = ~catpal(job)) %>%
addLayersControl(
baseGroups = c("job", "level"),
options = layersControlOptions(collapsed = FALSE)
)%>%
addLegend("topleft", pal = catpal, values = df_job$job,
title = "Job Title",
opacity = .8)
############################################################################
From visualization
graphs: Job
opportunities near big cities & Many
jobs at eastern area.
Structure of this project
Data Preprocessing → Filter Dataset → Data Visualization → Summary
Reference:
[1]
"Functional Data Analysis, S+Functional Data Analysis User's Guide",
Technometrics, vol. 48, no. 2, pp. 313-313, 2006.
[2]
J. SILGE, TEXT MINING WITH R, 1st ed. [S.l.]: O'REILLY MEDIA, 2017.