顯示具有 Data Preprocessing 標籤的文章。 顯示所有文章
顯示具有 Data Preprocessing 標籤的文章。 顯示所有文章

2017年6月3日 星期六

2016 Virginia job description V.S. H-1B Visa Petitions 2011-2016 Datasets

1. Introduction
         This project explore two datasets. The project is from my DEAN 690 course in GMU.
The topic of this project is Data Analytics Approach to Foreign Student Data Analytics Job Opportunity Search.
2. Purpose
          As technology growing, the jobs are changing in process. [1] As this trend, this article explored two datasets: Virginia job description in 2016 and H1B Visa petitions 2011-2016.  This article used two datasets to find link from jobs between Americans and international students in order to provide visualization graphs and suggestions for future student in data science field. The purpose of the article is give better comprehension about data analytics jobs in Virginia area.

3. Datasets Description
1st Dataset: 2016 Virginia job description
Source: Virginia Government Open Data (Link)
id title url baseSalary_maxSalary baseSalary_medianSalary baseSalary_minSalary baseSalary_salary dateExpires datePosted educationRequirements employmentType experienceRequirements hiringOrganization_geo_latitude hiringOrganization_geo_longitude hiringOrganization_location hiringOrganization_logo hiringOrganization_organizationCode hiringOrganization_organizationDescription hiringOrganization_organizationName hiringOrganization_organizationTaxID hiringOrganization_organizationUnit hiringOrganization_url incentiveCompensation jobBenefits jobDescription jobLocation_address_countryCode jobLocation_address_countryName jobLocation_address_extendedAddress jobLocation_address_fullText jobLocation_address_locality jobLocation_address_postOfficeBox jobLocation_address_postalCode jobLocation_address_region jobLocation_address_regionCode jobLocation_address_streetAddress jobLocation_geo_latitude jobLocation_geo_longitude normalizedTitle_onetCode normalizedTitle_onetName numberOfOpenings occupationalCategory positionPeriod_endDate positionPeriod_startDate qualifications responsibilities salaryCurrency skills specialCommitments veteranCommitment workHours

2nd Dataset: H-1B Visa Petitions 2011-2016
Source: Kaggle (Link)
CASE_STATUS, EMPLOYER_NAME, SOC_NAME, JOB_TITLE, FULL_TIME_POSITION, PREVAILING_WAGE, YEAR, WORKSITE, lon, and lat.

Same: title & Location
Different1st Dataset (job description, skills, and job benefits)
                2nd Dataset (Worker's name and salary)

4. Strategy
1st Dataset + 2nd Dataset  Analytics  Visualization

5. 1st Dataset: Data Preprocessing, Result, and Word Cloud.
5.1 1st Dataset Data Preprocessing
### RStudio ###
# VA job
library(ggplot2)
library(dplyr)
library(stringr)
library(tidyr)
library(reshape2)
library(tidytext)
library(devtools)
library(formattable)
library(stringr)
library(leaflet)

# Read tsv file
myData <- read.delim(file = 'joblistings_merged_parsed_unique_grpbyyear_2016.tsv', header = TRUE)

# View Data
View(myData)
ls(myData)

# Filter Data
myData <- myData[,c("title","experienceRequirements","hiringOrganization_organizationName","jobDescription","jobLocation_address_locality","normalizedTitle_onetName","responsibilities","skills")]
View(myData)

write.csv(myData, "VA_Job.csv")

# Clean Data
# Search key words "DataData Scientist","Data Analyst", and "Data Engineer"
myData_related <- myData %>% filter(str_detect(title,"Data"))

myData_job <- myData_related %>% 
  mutate(job = ifelse(str_detect(title, "SCIENTIST"),"Data Scientist",
               ifelse(str_detect(title, "ANALYST"),"Data Analyst",
               ifelse(str_detect(title, "ENGINEER"),"Data Engineer",
                                    "OTHERS"))))%>%
  mutate(level = ifelse(str_detect(title, "MANAGER"),"MANAGER",
                 ifelse(str_detect(title, "SENIOR"),"SENIOR",
                 ifelse(str_detect(title, "PRINCIPAL"),"PRINCIPAL",
                 ifelse(str_detect(title, "SR"),"SENIOR",
                 ifelse(str_detect(title, "DIRECTOR"),"DIRECTOR",
                 ifelse(str_detect(title, "VP"),"VP",
                 ifelse(str_detect(title, "VICE PRESIDENT"),"VP",
                 ifelse(str_detect(title, "LEAD"),"LEAD",
                 ifelse(str_detect(title, "ASSOCIATE"),"ASSOCIATE",
                 ifelse(str_detect(title, "SPECIALIST"),"SPECIALIST",
                 ifelse(str_detect(title, "JUNIOR"),"JUNIOR",
                 "UNSPECIFIED"))))))))))))
table(myData_job$job, myData_job$level)
############################################################################
5.2 1st Dataset Result
> table(myData_job$job, myData_job$level)
               
                PRINCIPAL SENIOR UNSPECIFIED   VP
  Data Engineer         0      5           1    0
  OTHERS                2     20        5999    3

        From this table, It is clear to find out the relationship, but the amount of each jobs is low. It is hard to compare different titles and levels. [2] 

5.3 1st Dataset Word Cloud
### RStudio ###
# Text mining
library(NLP)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(wordcloud)

myData2 <- toString(myData$jobLocation_address_locality)
myData2

# Text mining
myData3 <- Corpus(VectorSource(myData2))

# Remove Punctuation
myData3 <- tm_map(myData3, removePunctuation)

# Lower case
myData3 <- tm_map(myData3, content_transformer(tolower))

myData3 <- tm_map(myData3, PlainTextDocument)

# Remove Stop words
myData3 <- tm_map(myData3, removeWords, stopwords('english'))

myData3
# Word Cloud
wordcloud(myData3, max.words = 100, random.order = FALSE)
############################################################################

6. 2nd Dataset: : Data Preprocessing, Preprocessing Result, and Data Visualization
6.1 Data Preprocessing
### RStudio ###
library(ggplot2)
library(dplyr)
library(stringr)
library(tidyr)
library(reshape2)
library(RColorBrewer)
library(lubridate)
library(tidytext)
library(devtools)
library(formattable)
library(stringr)
library(plotly)
library(wordcloud)
library(tm)
library(qdap)
library(viridis)
library(leaflet)

# Read Data
df = read.csv("h1b_kaggle.csv")

# View Data
df$X1 <- NULL
View(df)
glimpse(df)

# Text Cleaning
df_data_related <- df %>% filter(str_detect(JOB_TITLE,"DATA"))

df_job <- df_data_related %>% 
  mutate(job = ifelse(str_detect(JOB_TITLE, "SCIENTIST"),"Data Scientist",
               ifelse(str_detect(JOB_TITLE, "ANALYST"),"Data Analyst",
               ifelse(str_detect(JOB_TITLE, "ENGINEER"),"Data Engineer",
                      "OTHERS"))))%>%
  mutate(level = ifelse(str_detect(JOB_TITLE, "MANAGER"),"MANAGER",
                 ifelse(str_detect(JOB_TITLE, "SENIOR"),"SENIOR",
                 ifelse(str_detect(JOB_TITLE, "PRINCIPAL"),"PRINCIPAL",
                 ifelse(str_detect(JOB_TITLE, "SR"),"SENIOR",
                 ifelse(str_detect(JOB_TITLE, "DIRECTOR"),"DIRECTOR",
                 ifelse(str_detect(JOB_TITLE, "VP"),"VP",
                 ifelse(str_detect(JOB_TITLE, "VICE PRESIDENT"),"VP",
                 ifelse(str_detect(JOB_TITLE, "LEAD"),"LEAD",
                 ifelse(str_detect(JOB_TITLE, "ASSOCIATE"),"ASSOCIATE",
                 ifelse(str_detect(JOB_TITLE, "SPECIALIST"),"SPECIALIST",
                 ifelse(str_detect(JOB_TITLE, "JUNIOR"),"JUNIOR",
                         "UNSPECIFIED"))))))))))))
table(df_job$job, df_job$level)
############################################################################
6.2 Preprocessing Result
> table(df_job$job, df_job$level)
                
                 ASSOCIATE DIRECTOR JUNIOR  LEAD MANAGER PRINCIPAL SENIOR SPECIALIST UNSPECIFIED    VP
  Data Analyst         115        6     22    94      85       135   1972         50       12256    18
  Data Engineer         54       18     12   130      71        82   1249         18        4016     8
  Data Scientist        59        8     12    52      25        37    504          7        2573     4
  OTHERS               371      243     41   821    1811       206   4361       3511       30974   116

         From the table, the information is clear to compare titles and levels of jobs. In order to analyze information, this project creates a structure and then try to link each variable to do further analytics. First step is data preprocessing to keep the existing data and delete missing value. Second step is filter the job titles and levels. Third step continue using RStudio to do data visualization. Final step includes conclusion and discussion.

6.3 Data Visualization
### RStudio ###
# Visualization
catpal = colorFactor(viridis(10),df_job$job)
leaflet(data = filter(df_job, !is.na(lat),!is.na(lon),str_detect(df_job$CASE_STATUS,"CERTIFIED"))) %>% 
  addProviderTiles("Stamen.TonerLite") %>% 
  setView(lng = -110, lat = 25, zoom = 3) %>%
  addCircleMarkers(~lon, ~lat, stroke = FALSE
                   fillOpacity = 0.3
                   radius=2.5
                   popup = ~EMPLOYER_NAME, 
                   color = ~catpal(job)) %>%
  addLayersControl(
    baseGroups = c("job", "level"),
    options = layersControlOptions(collapsed = FALSE)
  )%>% 
  addLegend("topleft", pal = catpal, values = df_job$job, 
            title = "Job Title"
            opacity = .8)
############################################################################

From visualization graphs: Job opportunities near big cities & Many jobs at eastern area.

Structure of this project
Data Preprocessing → Filter Dataset → Data Visualization → Summary

Reference:
[1] "Functional Data Analysis, S+Functional Data Analysis User's Guide", Technometrics, vol. 48, no. 2, pp. 313-313, 2006.
[2] J. SILGE, TEXT MINING WITH R, 1st ed. [S.l.]: O'REILLY MEDIA, 2017.

Python program to display calendar

# Python program to display calendar of given month of the year # importing calendar module for calendar operations import calendar # set t...