R_for_Data_Science_Tibble&Import
这是英文版的第9、10、11章
第9章是Introduction,没啥好讲的10.2 Creating tibbles
第11章import部分我一点都不熟,也没啥好讲的……
- 把一个数据框转as_tibble
越来越感觉tibble是个好东西
as_tibble(iris)
#> # A tibble: 150 x 5
#>Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>
#> 15.13.51.40.2 setosa
#> 24.931.40.2 setosa
#> 34.73.21.30.2 setosa
#> 44.63.11.50.2 setosa
#> 553.61.40.2 setosa
#> 65.43.91.70.4 setosa
#> # … with 144 more rows
但有个小问题,就是as_tibble转换的时候会把列名给弄没了。
其实这也是tidyverse一贯的思想吧?没有列名。包括你像readr读文件,left_join合并数据集等
> head(mtcars)
mpg cyl disphp dratwtqsec vs am gear carb
Mazda RX421.06160 110 3.90 2.620 16.460144
Mazda RX4 Wag21.06160 110 3.90 2.875 17.020144
Datsun 71022.8410893 3.85 2.320 18.611141
Hornet 4 Drive21.46258 110 3.08 3.215 19.441031
Hornet Sportabout 18.78360 175 3.15 3.440 17.020032
Valiant18.16225 105 2.76 3.460 20.221031> mtcars %>%
+as_tibble()
# A tibble: 32 x 11
mpgcyldisphpdratwtqsecvsamgearcarb
12161601103.92.6216.50144
22161601103.92.8817.00144
322.84108933.852.3218.61141
421.462581103.083.2219.41031
518.783601753.153.4417.00032
618.162251052.763.4620.21031
714.383602453.213.5715.80034
824.44147.623.693.19201042
922.84141.953.923.1522.91042
1019.26168.1233.923.4418.31044
# … with 22 more rows
这时候你就可以把列名变成单独的一列
> mtcars %>%
+as_tibble(rownames = "myrowname")
# A tibble: 32 x 12
myrownamempgcyldisphpdratwtqsecvsamgearcarb
1 Mazda RX42161601103.92.6216.50144
2 Mazda RX4 Wag2161601103.92.8817.00144
3 Datsun 71022.84108933.852.3218.61141
4 Hornet 4 Drive21.462581103.083.2219.41031
5 Hornet Sportab…18.783601753.153.4417.00032
6 Valiant18.162251052.763.4620.21031
7 Duster 36014.383602453.213.5715.80034
8 Merc 240D24.44147.623.693.19201042
9 Merc 23022.84141.953.923.1522.91042
10 Merc 28019.26168.1233.923.4418.31044
# … with 22 more rows
还可以
> mtcars %>%
+as_tibble(rownames = NA) %>%
+rownames_to_column(var = "myrowname")
# A tibble: 32 x 12
myrownamempgcyldisphpdratwtqsecvsamgearcarb
1 Mazda RX42161601103.92.6216.50144
2 Mazda RX4 Wag2161601103.92.8817.00144
3 Datsun 71022.84108933.852.3218.61141
4 Hornet 4 Drive21.462581103.083.2219.41031
5 Hornet Sportab…18.783601753.153.4417.00032
6 Valiant18.162251052.763.4620.21031
7 Duster 36014.383602453.213.5715.80034
8 Merc 240D24.44147.623.693.19201042
9 Merc 23022.84141.953.923.1522.91042
10 Merc 28019.26168.1233.923.4418.31044
# … with 22 more rows
- 用tibble函数自建tibble对象
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
#> # A tibble: 5 x 3
#>xyz
#>
#> 1112
#> 2215
#> 33110
#> 44117
#> 55126
If you’re already familiar withIt’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, ```:data.frame()
, note thattibble()
does much less:
- it never changes the type of the inputs (e.g. it never converts strings to factors!), (R 4.0 终于不会默认把字符串变成因子了)
- it never changes the names of variables,(这应该指的是你如果列名是1的话,就会变成X1)
> data.frame(`1` = 1:5, `2` = 1:5) X1 X2 111 222 333 444 555> tibble(`1` = 1:5, `2` = 1:5) # A tibble: 5 x 2 `1``2`
111 222 333 444 555
- and it never creates row names.
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
#> # A tibble: 1 x 3
#>`:)`` ``2000`
#>
#> 1 smile space number
- Another way to create a tibble is with
tribble()
tribble()
, short for transposed tibble. tribble()
is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~
), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
#> # A tibble: 2 x 3
#>xyz
#>
#> 1 a23.6
#> 2 b18.5
I often add a comment (the line starting with
#
), to make it really clear where the header is.tribble还可以有下面的骚操作
# tribble will create a list column if the value in any cell is # not a scalar tribble( ~x,~y, "a", 1:3, "b", 4:6 ) #> # A tibble: 2 x 2 #>xy #>
#> 1 a
#> 2 b
参考
Row-wise tibble creation
tibble做上面tribble的骚操作
> data.frame(x = c("a","b"), +y = I(list(1:3,4:6))) %>% +as_tibble() # A tibble: 2 x 2 xy
> 1 a 2 b > tibble(x = c("a","b"), +y = list(1:3,4:6)) # A tibble: 2 x 2 xy 1 a
2 b
参考
Create a data.frame where a column is a list
想到一个有意思的包,usethis10.3 Tibbles vs. data.frame
library(usethis)# Want to print friendly output to a user in a package (or to yourself in your own code?) # The usethis ui_*() functions are perfect!# Use ui_done() when something is done, like a file saved ui_done("File saved at...")## ? File saved at...# ui_todo() is useful when you need your user to pay attention and do something! ui_todo("Changes have been made, please review them!")## ● Changes have been made, please review them!# ui_oops() when something went wrong ui_oops("That should not have happened")## x That should not have happened
参考:
usethis::ui_done()
- i know this one!
【R_for_Data_Science_Tibble&Import】There are two main differences in the usage of a tibble vs. a classic
data.frame
: printing and subsetting.- 打印
str()
:tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
#> # A tibble: 1,000 x 5
#>abcd e
#>
#> 1 2020-01-15 20:43:23 2020-01-221 0.368 n
#> 2 2020-01-16 14:48:32 2020-01-272 0.612 l
#> 3 2020-01-16 09:12:12 2020-02-063 0.415 p
#> 4 2020-01-15 22:33:29 2020-02-054 0.212 m
#> 5 2020-01-15 18:57:45 2020-02-025 0.733 i
#> 6 2020-01-16 05:58:42 2020-01-296 0.460 n
#> # … with 994 more rows
First, you can explicitly
print()
the data frame and control the number of rows (n
) and the width
of the display. width = Inf
will display all columns:nycflights13::flights %>%
print(n = 10, width = Inf)
You can also control the default print behaviour by setting options:
-
options(tibble.print_max = n, tibble.print_min = m)
: if more thann
rows, print onlym
rows. Useoptions(tibble.print_min = Inf)
to always show all rows. - Use
options(tibble.width = Inf)
to always print all columns, regardless of the width of the screen.
- 提取
$
and [[
. [[
can extract by name or position;
$
only extracts by name but is a little less typing.df <- tibble(
x = runif(5),
y = rnorm(5)
)# Extract by name
df$x
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
df[["x"]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605# Extract by position
df[[1]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
To use these in a pipe, you’ll need to use the special placeholder
.
:df %>% .$x
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
df %>% .[["x"]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
data.frame也可以Compared to a
df <- data.frame( x = runif(5), y = rnorm(5) )> df %>% +.$x [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451 > df %>% +.[["x"]] [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451 > df %>% +.[[1]] [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451# 2其实你还可以这样子 df %>% "[["("x")
data.frame
, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.关于部分匹配的例子
df1 <- data.frame(xyz = "a") df2 <- tibble(xyz = "a")str(df1$x) #>Factor w/ 1 level "a": 1 str(df2$x) #> Warning: Unknown or uninitialised column: 'x'. #>NULL
参考
Advanced_R-3.6.4 Subsetting
部分匹配的另一个例子10.4 Interacting with older code
$
is a shorthand operator:x$y
is roughly equivalent tox[["y"]]
. It’s often used to access variables in a data frame, as inmtcars$cyl
ordiamonds$carat
. One common mistake with$
is to use it when you have the name of a column stored in a variable:
var <- "cyl" # Doesn't work - mtcars$var translated to mtcars[["var"]] mtcars$var #> NULL# Instead use [[ mtcars[[var]] #>[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
The one important difference between$
and[[
is that$
does (left-to-right) partial matching:
x <- list(abc = 1) x$a #> [1] 1 x[["a"]] #> NULL
To help avoid this behaviour I highly recommend setting the global optionwarnPartialMatchDollar
toTRUE
:
options(warnPartialMatchDollar = TRUE) x$a #> Warning in x$a: partial match of 'a' to 'abc' #> [1] 1
(For data frames, you can also avoid this problem by using tibbles, which never do partial matching.)
参考
Advanced_R-4.3.2 $
Some older functions don’t work with tibbles. If you encounter one of these functions, use
as.data.frame()
to turn a tibble back to a data.frame
:class(as.data.frame(tb))
#> [1] "data.frame"
The main reason that some older functions don’t work with tibble is the
[
function. We don’t use [
much in this book because dplyr::filter()
and dplyr::select()
allow you to solve the same problems with clearer code (but you will learn a little about it in vector subsetting). With base R data frames, [
sometimes returns a data frame, and sometimes returns a vector. With tibbles, [
always returns another tibble.对于data.frame来说,如果你用[]选取了一列,那么其就会自动转换成向量了
df <- data.frame( x = runif(5), y = rnorm(5) )> df[, "x"] [1] 0.02206585 0.98926964 0.95333742 0.79946273 0.19327569 > df[, c("x","y")] xy 1 0.02206585 -1.32245311 2 0.989269640.59576966 3 0.953337420.03922984 4 0.799462731.09332833 5 0.193275690.88358188# 如果你想阻止这种行为 # 加一个drop=F > df[, "x", drop = F] x 1 0.55692342 2 0.06739173 3 0.08648150 4 0.84341912 5 0.93941534
但对于tibble而言
df <- tibble( x = runif(5), y = rnorm(5) )> df[, "x"] # A tibble: 5 x 1 x
1 0.422 2 0.519 3 0.881 4 0.114 5 0.956 > df[, c("x","y")] # A tibble: 5 x 2 xy 1 0.422 -1.03 2 0.5190.605 3 0.8810.414 4 0.1140.820 5 0.956 -0.391
- Exercise 10.1
How can you tell if an object is a tibble? (Hint: try printing
mtcars
, which is a regular data frame).
is_tibble()
to check whether a data frame is a tibble or not. The mtcars
data frame is not a tibble.is_tibble(mtcars)
#> [1] FALSE
But the
diamonds
and flights
data are tibbles.is_tibble(ggplot2::diamonds)
#> [1] TRUE
is_tibble(nycflights13::flights)
#> [1] TRUE
is_tibble(as_tibble(mtcars))
#> [1] TRUE
More generally, you can use the
class()
function to find out the class of an object. Tibbles has the classes c("tbl_df", "tbl", "data.frame")
, while old data frames will only have the class "data.frame"
.class(mtcars)
#> [1] "data.frame"
class(ggplot2::diamonds)
#> [1] "tbl_df""tbl""data.frame"
class(nycflights13::flights)
#> [1] "tbl_df""tbl""data.frame"
If you are interested in reading more on R’s classes, read the chapters on object oriented programming in Advanced R.
Advanced R虽然看到 object oriented programming那里劝退了,但真的写的极好
- Exercise 10.2
Compare and contrast the following operations on a
data.frame
and equivalent tibble. What is different? Why might the default data frame behaviors cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df$x
#> [1] a
#> Levels: a
df[, "xyz"]
#> [1] a
#> Levels: a
df[, c("abc", "xyz")]
#>abc xyz
#> 11a
tbl <- as_tibble(df)
tbl$x
#> Warning: Unknown or uninitialised column: 'x'.
#> NULL
tbl[, "xyz"]
#> # A tibble: 1 x 1
#>xyz
#>
#> 1 a
tbl[, c("abc", "xyz")]
#> # A tibble: 1 x 2
#>abc xyz
#>
#> 11 a
The
$
operator will match any column name that starts with the name following it. Since there is a column named xyz
, the expression df$x
will be expanded to df$xyz
. This behavior of the $
operator saves a few keystrokes, but it can result in accidentally using a different column than you thought you were using.With data.frames, with
[
the type of object that is returned differs on the number of columns. If it is one column, it won’t return a data.frame, but instead will return a vector. With more than one column, then it will return a data.frame. This is fine if you know what you are passing in, but suppose you did df[ , vars]
where vars
was a variable. Then what that code does depends on length(vars)
and you’d have to write code to account for those situations or risk bugs.上面是solution的解答
其实综合起来,上面所出现的df和tibble的操作结果差异就是因为df的$操作符的部分匹配特性、[]操作符在选取一列的时候会自动降维为一维向量的特性
- Exercise 10.3
If you have the name of a variable stored in an object, e.g.
var <- "mpg"
, how can you extract the reference variable from a tibble?
df[[var]]
. You cannot use the dollar sign, because df$var
would look for a column named var
.这种特性可能单个使用没啥用,但在写循环的时候应该会大有用处
试验一下
df <- tibble( x = runif(5), y = rnorm(5) )a <- "x"df[a] # A tibble: 5 x 1 x
1 0.617 2 0.971 3 0.866 4 0.105 5 0.0429 > df[, a] # A tibble: 5 x 1 x 1 0.617 2 0.971 3 0.866 4 0.105 5 0.0429 > df[[a]] [1] 0.61743318 0.97105570 0.86600921 0.10470175 0.04291076
data.frame似乎也是可以的
df <- data.frame( x = runif(5), y = rnorm(5) )a <- "x" > df[a] x 1 0.5506573 2 0.2944493 3 0.7896432 4 0.6288798 5 0.6678818 > df[, a] [1] 0.5506573 0.2944493 0.7896432 0.6288798 0.6678818 > df[[a]] [1] 0.5506573 0.2944493 0.7896432 0.6288798 0.6678818
- Exercise 10.4
Practice referring to non-syntactic names in the following data frame by:To extract the variable named
- Extracting the variable called 1.
- Plotting a scatterplot of 1 vs 2.
- Creating a new column called 3 which is 2 divided by 1.
- Renaming the columns to one, two and three.
annoying <- tibble( `1` = 1:10, `2` = `1` * 2 + rnorm(length(`1`)) )
1
:annoying[["1"]]
#>[1]123456789 10
or
annoying$`1`
#>[1]123456789 10
Plotting a scatterplot of 1 vs 2.
ggplot(annoying, aes(x = `1`, y = `2`)) +
geom_point()
To add a new column
3
which is 2
divided by 1
:mutate(annoying, `3` = `2` / `1`)
#> # A tibble: 10 x 3
#>`1``2``3`
#>
#> 110.600 0.600
#> 224.262.13
#> 333.561.19
#> 447.992.00
#> 55 10.62.12
#> 66 13.12.19
#> # … with 4 more rows
or
annoying[["3"]] <- annoying$`2` / annoying$`1`
or
annoying[["3"]] <- annoying[["2"]] / annoying[["1"]]
To rename the columns to
one
, two
, and three
, run:annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
glimpse(annoying)
#> Observations: 10
#> Variables: 3
#> $ one 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> $ two 0.60, 4.26, 3.56, 7.99, 10.62, 13.15, 12.18, 15.75, 17.76,…
#> $ three 0.60, 2.13, 1.19, 2.00, 2.12, 2.19, 1.74, 1.97, 1.97, 1.97
get到一个新函数glimpse(来自tibble包)
This is like a transposed version of print(): columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It's a little like str() applied to a data frame but it tries to show you as much data as possible. (And it always shows the underlying data, even when applied to a remote data source.)
> glimpse(mtcars) Observations: 32 Variables: 11 $ mpg
21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.… $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, … $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, … $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 1… $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.9… $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, … $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, … $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, … $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, … $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, … $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, …
- Exercise 10.5
What does
tibble::enframe()
do? When might you use it?
tibble::enframe()
converts named vectors to a data frame with names and valuesenframe(c(a = 1, b = 2, c = 3))
#> # A tibble: 3 x 2
#>namevalue
#>
#> 1 a1
#> 2 b2
#> 3 c3
enframe还有个对应的函数deframe
来自Converting vectors to data frames, and vice versa
enframe(1:3)#> # A tibble: 3 x 2 #>name value #>
#> 111 #> 222 #> 333enframe(c(a = 5, b = 7))#> # A tibble: 2 x 2 #>namevalue #> #> 1 a5 #> 2 b7# 这个效果应该跟上面的tribble的很像 enframe(list(one = 1, two = 2:3, three = 4:6)) #> # A tibble: 3 x 2 #>namevalue #> #> 1 one
#> 2 two #> 3 three tribble( ~name, ~value, "one", 1, "two", 2:3, "three", 4:6 ) # A tibble: 3 x 2 namevalue 1 one
2 two 3 three
推荐阅读
- 数据库总结语句
- whlie循环和for循环的应用
- ffmpeg源码分析01(结构体)
- 【WORKFOR】最真的自己
- vue组件中为何data必须是一个函数()
- R|R for data Science(六)(readr 进行数据导入)
- performSelectorOnMainThread:withObject:waitUntilDone:参数设置为NO或YES的区别
- JavaScript|JavaScript — 初识数组、数组字面量和方法、forEach、数组的遍历
- Swift7|Swift7 - 循环、函数
- 65|65 - Tips for File Handling