Datatools Course Project 数据工具项目
JHU DataScience Specialization/Cousers The Data Scientist’s Toolbox/Week??/Course Project
由于这个非常入门的课程,共4Week每周一个小quiz,这个应该是最后的Project忘记了
主要是熟练掌握一些R语言的数据处理工具例如 xlsx,XML等格式,以及readr这些有用的R包用法
以下代码下载资源比较大暂不执行有兴趣的读者可以自己尝试
读取csv格式 2006年美国社区调查(ACS)
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
- (pid, Population CSV file)
- (hid, Household CSV file)
说明书 PUMS 说明书 DATA DICTIONARY - 2006 HOUSING PDF
1 | fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv" |
读取几十上百m的数据算大文件,运算应该考虑花销
1 | # Unit: milliseconds |
1 | # Unit: milliseconds |
1 | # Unit: microseconds |
1 | pander(arrange(A,mean)) |
expr | min | lq | mean |
---|---|---|---|
PID <- fread(file = “Fss06pid.csv”, fill = T) | 13.4 | 13.4 | 13.4 |
read.csv(file = “Fss06pid.csv”) | 422.6 | 422.6 | 422.6 |
read_csv(file = “Fss06pid.csv”) | 627 | 627 | 627 |
median | uq | max | neval |
---|---|---|---|
13.4 | 13.4 | 13.4 | 1 |
422.6 | 422.6 | 422.6 | 1 |
627 | 627 | 627 | 1 |
1 | pander(arrange(B,mean)) |
expr | min | lq | mean |
---|---|---|---|
HID <- fread(file = “Fss06hid.csv”, fill = T) | 17.08 | 17.08 | 17.08 |
read_csv(file = “Fss06hid.csv”) | 519.9 | 519.9 | 519.9 |
read.csv(file = “Fss06hid.csv”) | 971.6 | 971.6 | 971.6 |
median | uq | max | neval |
---|---|---|---|
17.08 | 17.08 | 17.08 | 1 |
519.9 | 519.9 | 519.9 | 1 |
971.6 | 971.6 | 971.6 | 1 |
简单计算求和时间
1 | pander(arrange(select(C,expr,mean),mean)) |
expr | mean |
---|---|
mean(PID\(pwgtp15, by = PID\)SEX) | 20.49 |
median(PID\(pwgtp15, by = PID\)SEX) | 91.38 |
sapply(split(PID\(pwgtp15, PID\)SEX), mean) | 250.4 |
tapply(PID\(pwgtp15, PID\)SEX, mean) | 262.4 |
median(PID[PID$SEX == 1, ]$SERIALNO) | 2322 |
mean(PID[PID$SEX == 1, ]$pwgtp15) | 2571 |
mean(PID[PID$SEX == 1, ]$SERIALNO) | 3693 |
Quiz 1
How many housing units in this survey were worth more than $1,000,000?
- 159
- 164
- 53
- 24
读取xls,xlsx格式 美国国家天然气采购计划
说明书 Natural Gas Acquisition Program
1 | fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx" |
1 | NGAP <- read.xlsx(file="gov_NGAP.xlsx", |
Zip | CuCurrent | PaCurrent | PoCurrent | Contact | Ext |
---|---|---|---|---|---|
74136 | 0 | 1 | 0 | 918-491-6998 | 0 |
30329 | 1 | 0 | 0 | 404-321-5711 | NA |
74136 | 1 | 0 | 0 | 918-523-2516 | 0 |
80203 | 0 | 1 | 0 | 303-864-1919 | 0 |
80120 | 1 | 0 | 0 | 345-098-8890 | 456 |
Fax | Status | |
---|---|---|
918-491-6659 | NA | 1 |
NA | NA | 1 |
918-523-2522 | NA | 1 |
NA | NA | 1 |
NA | NA | 1 |
1 | cat("Sum of Zipcode and") |
1 | # Sum of Zipcode and |
1 | sum(NGAP$Zip*NGAP$Ext,na.rm=T) |
1 | # [1] 36534720 |
Quiz 3
Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:
What is the value of:
sum(dat(Zip*dat)Ext,na.rm=T)
(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program)
- 36534720
- 338924
- 33544718
- 0
读取xml格式 巴的摩尔餐厅数据
1 | fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml" |
1 | system.time(DOC <- xmlTreeParse("Frestaurants.xml", useInternal = TRUE)) |
1 | # user system elapsed |
1 | rootNode <- xmlRoot(DOC) |
1 | # [1] 127 |
Quiz 4
How many restaurants have zipcode 21231?
- 127
- 17
- 28
- 156