改善R代码的五个技巧 _数据科学

本文概述

1.从1开始更有趣
2. vector()你的c()
3.抛弃which()
4.因素那个因素！
5.首先获得$, 然后获得幂
登出

@drsimonj这里有五个简单的窍门, 我发现自己一直与R的同伴分享他们的改进代码！
1.从1开始更有趣下次使用冒号运算符从1创建序列(如1：n)时, 请尝试seq()。

# Sequence a vectorx < - runif(10)seq(x)#> [1]123456789 10# Sequence an integerseq(nrow(mtcars))#> [1]123456789 10 11 12 13 14 15 16 17 18 19 20 21 22 23#> [24] 24 25 26 27 28 29 30 31 32

冒号运算符可能会产生意想不到的结果, 可能会引起各种问题, 而你无需注意！看一下要对空向量的长度进行排序时会发生什么：

# Empty vectorx < - c()1:length(x)#> [1] 1 0seq(x)#> integer(0)

你还将注意到, 这使你不必使用诸如length()之类的函数。当应用于一定长度的对象时, seq()将自动创建一个从1到对象长度的序列。
2. vector()你的c() 下次使用c()创建空向量时, 请尝试将其替换为vector(” type” , length)。

# A numeric vector with 5 elementsvector("numeric", 5)#> [1] 0 0 0 0 0# A character vector with 3 elementsvector("character", 3)#> [1] "" "" ""

这样做可以提高内存使用率并提高速度！你通常经常预先知道向量中将使用哪种类型的值, 以及向量将持续多长时间。使用c()意味着R必须慢慢解决这两个问题。因此, 请使用vector()帮助提升它！
这个值的一个很好的例子是在for循环中。人们通常通过声明一个空向量并使用c()使其增长来编写循环, 如下所示：

x < - c()for (i in seq(5)) {x < - c(x, i)}

#> x at step 1 : 1#> x at step 2 : 1, 2#> x at step 3 : 1, 2, 3#> x at step 4 : 1, 2, 3, 4#> x at step 5 : 1, 2, 3, 4, 5

而是使用vector()预先定义类型和长度, 并通过索引引用位置, 如下所示：

n < - 5x < - vector("integer", n)for (i in seq(n)) {x[i] < - i}

#> x at step 1 : 1, 0, 0, 0, 0#> x at step 2 : 1, 2, 0, 0, 0#> x at step 3 : 1, 2, 3, 0, 0#> x at step 4 : 1, 2, 3, 4, 0#> x at step 5 : 1, 2, 3, 4, 5

这是一个快速的速度比较：

n < - 1e5x_empty < - c()system.time(for(i in seq(n)) x_empty < - c(x_empty, i))#> usersystem elapsed #> 15.2382.32717.650x_zeros < - vector("integer", n)system.time(for(i in seq(n)) x_zeros[i] < - i)#> usersystem elapsed #> 0.0070.0000.007

那应该足够令人信服！
3.抛弃which() 下次你使用which()时, 请尝试放弃它！人们经常使用which()从某个布尔条件中获取索引, 然后在这些索引中选择值。这不是必需的。
使向量元素大于5：

x < - 3:7# Using which (not necessary)x[which(x > 5)]#> [1] 6 7# No whichx[x > 5]#> [1] 6 7

或计数大于5的值：

# Using whichlength(which(x > 5))#> [1] 2# Without whichsum(x > 5)#> [1] 2

你为什么要抛弃which()？通常这是不必要的, 布尔向量就足够了。
【改善R代码的五个技巧】例如, R使你可以选择布尔矢量中标记为TRUE的元素：

condition < - x > 5condition#> [1] FALSE FALSE FALSETRUETRUEx[condition]#> [1] 6 7

同样, 当与sum()或mean()结合使用时, 布尔向量可用于获取满足条件的值的计数或比例：

sum(condition)#> [1] 2mean(condition)#> [1] 0.4

which()告诉你TRUE值的索引：

which(condition)#> [1] 4 5

尽管结果没有错, 但没有必要。例如, 我经常看到人们结合使用which()和length()来测试任何或所有值是否为TRUE。相反, 你只需要any()或all()：

x < - c(1, 2, 12)# Using `which()` and `length()` to test if any values are greater than 10if (length(which(x > 10)) > 0)print("At least one value is greater than 10")#> [1] "At least one value is greater than 10"# Wrapping a boolean vector with `any()`if (any(x > 10))print("At least one value is greater than 10")#> [1] "At least one value is greater than 10"# Using `which()` and `length()` to test if all values are positiveif (length(which(x > 0)) == length(x))print("All values are positive")#> [1] "All values are positive"# Wrapping a boolean vector with `all()`if (all(x > 0))print("All values are positive")#> [1] "All values are positive"

哦, 它为你节省了一些时间…

x < - runif(1e8)system.time(x[which(x > .5)])#> usersystem elapsed #> 1.1560.5221.686system.time(x[x > .5])#> usersystem elapsed #> 1.0710.4421.662

4.因素那个因素！你是否曾经从某个因素中删除过价值, 发现自己陷入了不再存在的旧水平？我看到了各种各样的创造性方法来解决这个问题。最简单的解决方案通常只是再次将其包装在factor()中。
本示例创建一个具有四个级别的因子(” a” , ” b” , ” c” 和” d” )：

# A factor with four levelsx < - factor(c("a", "b", "c", "d"))x#> [1] a b c d#> Levels: a b c dplot(x)

文章图片
如果删除所有一个级别(” d” )的个案, 则级别仍记录在因子中：

# Drop all values for one levelx < - x[x != "d"]# But we still have this level!x#> [1] a b c#> Levels: a b c dplot(x)

文章图片
删除它的一种超简单方法是再次使用factor()：

x < - factor(x)x#> [1] a b c#> Levels: a b cplot(x)

文章图片
这通常是解决很多人生气的问题的好方法。因此, 省去你的头痛, 并将其作为因素！
5.首先获得美元, 然后获得权力下次你要从满足条件的data.frame列中提取值时, 请在$与行之前指定$。
假设你要使用mtcars数据集获得4缸(cyl)汽缸的马力(hp)。你可以编写以下任何一个：

# rows first, column second - not idealmtcars[mtcars$cyl == 4, ]$hp#> [1]936295665265976691 113 109# column first, rows second - much bettermtcars$hp[mtcars$cyl == 4]#> [1]936295665265976691 113 109

这里的技巧是使用第二种方法。
但是为什么呢？
第一个原因：消除讨厌的逗号！在列之前指定行时, 需要记住逗号：mtcars [mtcars $ cyl == 4, ] $ hp。当你首先指定列时, 这意味着你现在正在引用向量, 并且不需要逗号！
第二个原因：速度！让我们在更大的数据帧上进行测试：

# Simulate a data frame...n < - 1e7d < - data.frame(a = seq(n), b = runif(n))# rows first, column second - not idealsystem.time(d[d$b > .5, ]$a)#> usersystem elapsed #> 0.4970.1260.629# column first, rows second - much bettersystem.time(d$a[d$b > .5])#> usersystem elapsed #> 0.0890.0170.107

值得, 对不对？
不过, 如果你想磨练R数据框忍者的技能, 建议你学习dplyr。你可以在dplyr网站上获得很好的概述, 或者通过在线课程(例如使用dplyr进行R中的srcmini的Data Manipulation)来真正学习绳索。
登出感谢你的阅读, 希望对你有所帮助。
有关最新博客文章的更新, 请在Twitter上关注@drsimonj, 或通过drsimonjackson@gmail.com给我发送电子邮件以取得联系。
如果你想要生成此博客的代码, 请查看blogR GitHub存储库。