-
Notifications
You must be signed in to change notification settings - Fork 31
/
Copy pathFrustration-One-Year-With-R.Rmd
1344 lines (1171 loc) · 187 KB
/
Frustration-One-Year-With-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Frustration: One Year With R"
author: "Reece Goding"
output:
github_document:
toc: true
number_sections: true
---
```{r, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```
# Introduction
What follows is an account of my experiences from about one year of roughly daily R usage. It began as a list of things that I liked and disliked about the language, but grew to be something huge. Once the list exceeded ten thousand words, I knew that it must be published. By the time it was ready, it had nearly tripled in length. It took five months of weekends just to get it all in R Markdown.
This isn't an attack on R or a pitch for anything else. It is only an account of what I've found to be right and wrong with the language. Although the length of my list of what is wrong far exceeds that of what is right, that may be my failing rather than R's. I suspect that my list of what R does right will grow as I learn other languages and begin to miss some of R's benefits. I welcome any attempts to correct this or any other errors that you find. Some major errors will have slipped in somewhere or other.
## Length
To start, I must issue a warning: This document is **huge**. I have tried to keep everything contained in small sections, such that the reader has plenty of points where they can pause and return to the document later, but the word count is still far higher than I'm happy with. I have tried to not be too petty, but every negative point in here comes from an honest position of frustration. There are some things that I really love about R. I've even devoted [an entire section to them](#what-r-does-right). However, if there is one point that I really want this document to get across, it's that R is filled to the brim with small madnesses. Although I can name a few major issues with R, its ultimate problem is the sum of its little problems. This document couldn't be short.
Also, on the topic of the sections in this document, watch out for all of the internal links. Nothing in R Markdown makes them look distinct from external ones, so you might lose your place if you don't take care to open all of your links in a new tab/window.
## Experience
Before I say anything nasty about R, a show of good faith is in order. In my year with R, I have done the following:
* Added almost 100 [R solutions](https://github.com/ReeceGoding/Rosetta-Code-Submissions) to [Rosetta Code](https://rosettacode.org/wiki/Rosetta_Code).
* Asked over 100 Stack Overflow R questions.
* Read both editions of [*Advanced R*](https://adv-r.hadley.nz/) from cover to cover. I didn't do the exercises, but I'd recommend the books to any serious R user.
* Read the first edition of [*R for Data Science*](https://r4ds.had.co.nz/) from cover to cover. It's a good enough non-technical introduction to the Tidyverse and a handful of other popular parts of R's ecosystem. However, I can't give it a strong recommendation for a variety of reasons:
* A lot of the exercises didn't specify what they wanted from your answer. This made checking your solutions against anyone else's quite difficult.
* It deliberately avoids the fundamentals of programming -- e.g. making functions, loops, and if statements -- until the second half. I therefore suspect that any non-novice would be better off finding an introduction to the relevant packages with their favourite search engine.
* Despite my efforts, I can find no "*Tidyverse for Programmers*" book. When one is inevitably written, it will make this book redundant for many potential readers.
* Read [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf) and some other well-known PDFs and manuals, such as [*Rtips. Revival 2014!*](https://pj.freefaculty.org/R/Rtips.html) and the official [*An Introduction to R*](https://cran.r-project.org/doc/manuals/r-release/R-intro.html), [*R Language Definition*](https://cran.r-project.org/doc/manuals/r-release/R-lang.html), and [*R FAQ*](https://cran.r-project.org/doc/FAQ/R-FAQ.html) manuals. Out of all of these, I must recommend [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf). The page count may be intimidating, but it's a delightfully fast read that mirrors many of my points. In many cases I have pointed the reader straight to its relevant section. Its only true fault is its age. I wish that I could claim that this document is a sequel to it, but I'm writing to review rather than advise.
* Update: After publishing this review, I skimmed a handful of books by John Chambers. There were some gems in them and I've mentioned them where needed, but I don't expect that I will ever read those books closely. I read them far too quickly for me to be able to say anything insightful, but I will confess that I feel fundamentally opposed to any programming textbooks that lack exercises.
* Made minor contributions to open source R projects.
At minimum, I can say with confidence that unless I happen to pick up an R-focused statistics textbook -- [the *R FAQ* has some tempting items](https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-documentation-exists-for-R_003f) -- I've already done all of the R-related reading that I ever plan to do. All that is left for me is to use the language more and more. I hope that this section shows that I've given it a good chance before writing this review of it.
## Ignorance
I am not an R expert. I freely admit that I am lacking in the following regards:
* You can never have done enough statistics with R. I've mostly used R as a programming language rather than a statistics tool. My arguments would certainly be stronger if I had some published stats work to back them up, even just blogs. I might correct this at some point.
* The above point makes me more ignorant of formulae objects (e.g. expressions like `foo ~ log(bar) * bar^2`), the `plot()` function, and factor variables than I ought to be. I saw a lot of them during my degree, but have long since forgotten them and have never needed to really pick them back up. For similar reasons, I have nothing to say on how hard it can sometimes be to read data in to R.
* I haven't used enough of the community's favourite libraries. My biggest regret is my near-total ignorance of `data.table`. From [what little I've seen](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/), it's a real pleasure. More practice with `ggplot2`, the wider Tidyverse, and R Markdown is also in order. If I continue to use R, I will gradually master these. For now, it suffices to say that my experience with base R far exceeds my knowledge of both the Tidyverse and many other well-loved packages. If I've missed any gems, let me know.
* I know almost nothing about Shiny, but it appears to be far better than Power BI.
* My experience with R's competitors is minimal. In particular, I have virtually no experience with Python or Julia. Most of my points on R are about R on its own merits, rather than comparing it to its competition. I plan to pick up Python soon, but Julia is in my distant future.
* Although I have used SQL professionally, how it compares to R has rarely crossed my mind. This suggests that I'm missing something about both languages. I plan to one day read a SQL book while having `dplyr` loaded.
* R's functional aspects make me wish that I knew more Lisp. All that I've done is finish reading _Structure and Interpretation of Computer Programs_. I will learn more, but R's clear Scheme inspiration makes Lisp books a lot less fun to read. It's like I've already been spoiled on some of the best bits.
* I haven't done enough OOP in R. My only real experience is with S3. S4 looks enough like CLOS that I expect that I will revisit it at some point after picking up Common Lisp, but that will just be to play around.
* I have never made a package for R and have no experience with the ecosystem surrounding that (e.g. `roxygen2`). I have no plans for this.
* I have no experience in developing large projects in R. This is likely a part of why I have never felt the need to make significant use of its OOP. I do not expect this to change.
The above list is unlikely to be exhaustive. I'm not against reading another book about R as a programming language, but [*Advanced R*](https://adv-r.hadley.nz/) seems to be the only one that anyone ever mentions. For the foreseeable future, the main thing that I plan to do to improve my evaluation of R is to learn Python. I'll probably read a book on it.
## Assumed Knowledge
You'd be a fool to read this without some experience of R. I don't think that I've written anything that requires an expert level of understanding, but you're unlikely to get much out of this document without at least a basic idea of R. I've also mentioned the Tidyverse a few times without giving it much introduction, particularly its `tibble` package. If you care enough about R to consider reading this document, then you really ought to be familiar with the most popular parts of the Tidyverse. It's rare for any discussion of R to go long without some mention of `purrr`, `dplyr` or `magrittr`.
## Disclaimer
This document started out as personal notes that I had no intention of publishing. There's a good chance that I might have copy and pasted someone's example from somewhere and totally forgot that it wasn't my own. If you spot any plagiarism, let me know.
# General Feelings
My overall feelings about R are tough to quantify. As I mentioned near the start, its ultimate problem is the sum of its little problems. However, if I must speak generally, then I think that the problem with R is that it's always some mix of the following:
1. A statistics language with countless useful libraries and an excellent collection of mathematical tools.
2. A Scheme-inspired language that tries to be functional while maintaining a C-like syntax.
3. Decades of haphazard patches for S.
4. A collection of [semantic semtex](https://wiki.c2.com/?SemanticSemtex) that is powerful in the hands of a master and crippling in the hands of a novice.
When it's anything but #3, R is great. Statisticians and mathematicians love it for #1 and programmers love it for #2 and #4. If it weren't for #3, R would be an amazing -- albeit, domain-specific -- language, but #3 is such a big factor that it makes the language unpredictable, inconsistent, and infuriating. Mixed with #4, it makes being an R novice hellish. It gives me little doubt that R is not the ideal tool for many of the jobs that it wants to do, but #1 and #2 leave me with equally little doubt that R can be a very good tool.
# What R Does Right
As a final show of good faith, here is what I think R does right. In summary, along with having some great functional programming toys, R has some domain-specific tools that can work excellently when they're in their element. Whatever the faults of R, it's always going to be my first choice for some problems.
## Mathematics and Statistics
R wants to be a mathematics and statistics tool. Many of its fundamental design choices support this. For example, vectors are primitive types and R isn't at all shy about giving you a table or matrix as output. Similarly, the base libraries are packed with maths and stats functions that are usually a good combination of relevant, generic, and helpful. Some examples:
* Lots of stats is made easy. Commands like `boxplot(data)` or `quantile(data)` just work and there are lots of handy functions like `colSums()`, `table()`, `cor()`, or `summary()`.
* R is **the** language of research-level statistics. If it's stats, R either has it built-in or has a library for it. It's impossible to visit a statistics Q&A website and not see R code. For this reason alone, R will never truly die.
* The generic functions in the base stats library work magic. Whenever you try to print or summarise a model from there, you're going to get all of the details that you could ever realistically ask for and you're going to get them presented in a very helpful way. For example
```{r, echo = TRUE}
model <- lm(mpg ~ wt, data = mtcars)
print(model)
summary(model)
```
shows us plenty of useful information and works just as well even if we change to another type of model. Your mileage may vary with packages, but it usually works as expected. Other examples are easy to come by, e.g. `plot(model)`.
* The rules for subsetting data, although requiring mastery, are extremely expressive. Coupled with sub-assignment tricks like `result[result < 0.5] <- 0`, which often do exactly what you think they will, you can really save yourself a lot of work. Being able to demand precisely what parts of your data that you want to see or change is a really great feature.
* The factor and ordered data types are definitely the sort of tools that I want to have in a stats language. [They're a bit unpredictable](#factor-variables), but they're great when they work.
* It's no surprise that an R terminal has fully replaced my OS's built-in calculator. It's my first choice for any arithmetical task. When checking a gaming problem, I once opened R and used `(0.2 * seq(1000, 1300, 50) + 999) / seq(1000, 1300, 50)`. That would've been several lines in many other languages. Furthermore, a general-purpose language that was capable of the same would've had a call to something long-winded like `math.vec.seq()` rather than just `seq()`. I find the cumulative functions, e.g. `cumsum()` and `cummax()`, similarly enjoyable.
* How many other language have matrix algebra fully built-in? Solving systems of linear equations is just `solve()`.
* The `rep()` function is outstandingly versatile. I'd give examples, but those found in its documentation are more than sufficient. Open up R and run `example(rep)` if you want to see them. If tricks like `cbind(rep(1:6, each = 6), rep(1:6, times = 6))` have yet to become second nature, then you're really missing out.
* On top of replacing your computer's calculator, R can replace your graphing calculator as well. Unless you need to tinker with the axes or stop the asymptotes causing you problems -- problems that your graphing calculator would give you anyway -- functions like `curve(x / (x^3 + 9), -10, 10)` (output below) do exactly what you would expect and exactly how.
```{r, echo = FALSE}
curve(x / (x^3 + 9), -10, 10)
```
## Names and Data Frames
These seem like trivial features, but the language's deep integration of them is extremely beneficial for manipulating and presenting your data. They assist subsetting, variable creation, plotting, printing, and even [metaprogramming](#first-class-environments).
* The ability to name the components of vectors, e.g. `c(Fizz=3, Buzz=5)`, is a nice trick for toy programs. The same syntax is used to much greater effect with lists, data frames, and S4 objects. However, it's good to show how far you can get with even the most basic case. Here's my submission for a [General FizzBuzz](https://rosettacode.org/wiki/General_FizzBuzz) task:
```{r, eval = FALSE}
namedGenFizzBuzz <- function(n, namedNums)
{
factors <- sort(namedNums)#Required by the task: We must go from least factor to greatest.
for(i in 1:n)
{
isFactor <- i %% factors == 0
print(if(any(isFactor)) paste0(names(factors)[isFactor], collapse = "") else i)
}
}
namedNums <- c(Fizz=3, Buzz=5, Baxx=7)#Notice that we can name our inputs without a function call.
namedGenFizzBuzz(105, namedNums)
```
I've little doubt that an R guru could improve this, but the amount of expressiveness in each line is already impressive. A lot of that is owed to R's love for names.
* Having a tabular data type in your base library -- the data frame -- is very handy for when you want a nice way to present your results without having to bother importing anything. Due to this and the aforementioned ability to name vectors, my output in coding challenges often looks nicer than most other people's.
* I like how data frames are constructed. Even if you don't know any R at all, it's pretty obvious what `data.frame(who = c("Alice", "Bob"), height = c(1.2, 2.3))` produces and what adding the `row.names = c("1st subject", "2nd subject")` argument would do.
* As a non-trivial example of how far these features can get you, I've had some good fun making alists out of syntactically valid expressions and using only those alists to build a data frame where both the expressions and their evaluated values are shown:
```{r, echo = TRUE}
expressions <- alist(-x ^ p, -(x) ^ p, (-x) ^ p, -(x ^ p))
x <- c(-5, -5, 5, 5)
p <- c(2, 3, 2, 3)
output <- data.frame(x,
p,
setNames(lapply(expressions, eval), sapply(expressions, deparse)),
check.names = FALSE)
print(output, row.names = FALSE)
```
(stolen from my submission [here](https://rosettacode.org/wiki/Exponentiation_with_infix_operators_in_(or_operating_on)_the_base)). Did you notice that the output knew the names of `x` and `p` without being told them? Did you also notice that a similar thing happened in after our call to `curve()` earlier on? Finally, did you notice how easy it was to get such neat output?
## Outstanding Packages
[I've already admitted a great deal of ignorance of this topic](#ignorance), but there are some parts of R's ecosystem that I'm happy to call outstanding. The below are all things that I'm sure to miss in other languages.
* `corrplot`: It has less than ten functions, but it only needed one to blow my mind. Once you've even as much as read [the introduction](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html), you will never try to read a correlation matrix again.
* `ggplot2`: I'm not experienced enough to know what faults it has, but it's fun to use. That single fact makes it blow any other graphing software that I've used out of the water: **It's fun**.
* `magrittr`: It sold me on pipes. I'd say that any package that makes you consider changing your programming style is automatically outstanding. However, the real reason why I love it is because whenever I've run `bigLongExpression()` in my console and decided that I really wanted `foo()` of it, it's so much easier to press the up arrow and type CTRL+SHIFT+M+"foo" than it is to do anything that results in `foo(bigLongExpression())` appearing. Maybe there's a keyboard shortcut that I never learned, but this isn't the only reason why I love `magrittr`. [I'll say more about it much later](#magrittr).
* `R Markdown` has served me well in writing this document. It's buggier than I'd like, rarely has helpful error messages, and does things that I can't explain or fix even after setting a bounty on Stack Overflow, but it's still a great way to make a document from R. It's the closest thing that I know of to an R user's LaTeX. I had to wait on [this bug fix](https://github.com/rstudio/rmarkdown/pull/2093) before I could start numbering my sections. Hopefully it didn't break anything.
## Vectorization
[When it's not causing you problems](#vectorization-again), the vectorization can be the best thing about the language:
* The vector recycling rules are powerful when mastered. Expressions like `c("x", "y")[rep(c(1, 2), times = 4)]` let you do a lot with only a little work. My favourite ever FizzBuzz could well be
```{r, eval = FALSE}
x <- paste0(rep("", 100), c("", "", "Fizz"), c("", "", "", "", "Buzz"))
cat(ifelse(x == "", 1:100, x), sep = "\n")
```
I wish that I could claim credit for that, but I stole it from an old version of [this page](https://rosettacode.org/wiki/FizzBuzz#R) and improved it a little.
* Basically everything is a vector, so R comes with some great vector-manipulation tools like `ifelse()` (seen above) and makes it very easy to use a function on a collection. Can you believe that `mtcars / 20` actually works?
* Tricks like `array / seq_along(array)` save a lot of loop writing.
* Even simple things like being able to subtract a vector from a constant (e.g. `10 - 1:5`) and get a sensible result are a gift when doing mathematics.
* Vectorization of functions is sometimes very useful, particularly when it lets you do what should've been two loops worth of work in one line. You'd be amazed by how often you can get away with calling `foo(1:100)` without needing to vectorize `foo()` yourself.
## Functional Programming
R's done a good job of harnessing the power of functional languages while maintaining a C-like syntax. It makes no secret of being inspired by Scheme and has reaped many of its benefits.
### First-class Functions
It's impossible to not notice that functions are first-class in R. You're almost forced to learn functional programming idioms like mapping functions, higher-order functions, and anonymous functions. This is a good thing. Where else do you find a language with enough useful higher-order functions for the community to be able to discourage new users from writing loops? Some examples:
* All of the functional programming toys that you could want are easily found in R, e.g. closures, anonymous functions, and higher-order functions like `Map()`, `Filter()`, and `Reduce()`. Once you're used to them, you can write some very expressive code.
* The apply family of functions is basically a set of DSL mapping functions for stats. Both `apply()` and `tapply()` can produce some very concise code, as can related functions like `by()`.
* Where else can you write functions that are both anonymous and recursive? Not that you should, of course.
* First-class functions sometimes interact with R's vectorization obsession in a very entertaining way. In how many other languages do you see somebody take a list of functions and, in a single line, call them all with a vector as a single argument to each function? Code like `lapply(listOfFuns, function(f) f(1:10))` is entirely valid. It calls each function in `listOfFuns` with the entire vector `1:10` as their first argument.
* Code like `Vectorize(foo)(1:100)` is not particularly hard to understand, but I'd struggle to name another language that lets me do the same thing with so much ease.
### First-class Environments
Not only are functions first-class in R, environments are too. You therefore have lots of control over what environment an expression is evaluated in. This is an amazing source of power that tends to scare off beginners, but I cannot overstate how much of an asset it can be. If you're not familiar with the below, look it up. You will not regret it.
* Because R's environments are first-class, functions like `with()` and `within()` can generate them on the fly. I've seen this called "*data masking*". [*Advanced R* has a whole chapter on it](https://adv-r.hadley.nz/translation.html). It lets you do things like "*treat this list of functions as if it were a namespace, so I can write code that uses function names that I wouldn't dare use elsewhere*". This can also be used with data. For example, `tapply(mtcars$mpg, list(mtcars$cyl, mtcars$gear), mean)` uses `mtcars` far too many times, but `with(mtcars, tapply(mpg, list(cyl, gear), mean))` gives us an easy fix. Ad-hoc namespaces are an amazing thing to have, particularly when using functions that don't have a `data` argument (e.g. `plot()`).
* Modelling functions like `lm()` use the data-masking facilities that I've just described, as do handy functions like `subset()`. This saves incredible amounts of typing and massively increases the readability of your stats code. For example, `aggregate(mpg ~ cyl + gear, mtcars, mean)` returns very similar output to my above calls to `tapply()` without needing the complexity of using `with()`. It also allows for ridiculously concise code like `aggregate(. ~ cyl + gear, mtcars, mean)`.
* You can write your own data-masking functions. Doing so relies on controlling the non-standard evaluation of some of your arguments and is the closest thing that R has to metaprogramming. The names mechanisms do a lot to remove any ambiguity from your attempts at this. Stealing an example from the documentation, do I even need to explain what `transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)` does? Being able to do all of that in one line is outstanding. Without R allowing developers to add new functions like this, the Tidyverse would've been impossible.
You might have spotted a pattern by now. R often lets you do very much with very little.
### Generic Functions
Generic function OO is pretty nice to have, even if I wouldn't use anything more complicated than S3. Being able to call `foo(whatever)` and be confident that it's going to do what I mean is always nice. Some positives of R's approach are:
* As mentioned [earlier on](#mathematics-and-statistics), S3 is used excellently in the base R stats library. Functions like `print()`, `plot()`, and `summary()` almost always tell me everything that I wanted to know and tell me them with great clarity.
* When you're not trapped by [the technicalities](#generic-functions-again), S3 is an outstandingly simple tool that does exactly what R needs it to do. Have a look at all of the methods that the pre-loaded libraries define for `plot()`
```{r, echo = TRUE}
methods(plot)
```
because a statistician often only need to dispatch on the type of model being used, S3 is the perfect tool to make functions like `plot()` easy to extend, meaning that it's easy to make it give your users exactly what they want. This isn't just theoretical either. The output for `methods(plot)` gets a lot longer if I go through my list of packages and start loading some random number of them. Go try it yourself!
* S3 generics and objects are very easy to write. The trade-off is that they don't do anything to protect you from yourself. However, being able to tell R to shut up and do what I want it to a nice part of S3.
* I like the idea of S3's group generics, but I don't like not being able to make my own. However, I think that you can do it for S4.
* I have it on good authority that biology people often need to dispatch on more than one type of model at a time. This means that they shower the S4 object system with greater praise than what I've just given S3. Apparently, the `bioconductor` package is the outstanding example of their love of it.
* S4 has multiple inheritance and multiple dispatch. I'm not going to say that multiple inheritance is a good thing, but it's not always found in other OOP systems.
* RC and the `R6` package are about as close as you're ever going to get to having Java-like OOP in a mostly function language.
## Syntax
Some of the syntax is nice:
* [It can cause you problems](#sequences), but the `:` operator is handy for things like `for(i in 1:20){...}`.
* The `for` loop syntax is always the same: `for(element in vector){...}`. This means that there is no difference between the typical "*do n times*" case like `for(i in 1:n)` and the "*for every member of this collection*" case like `for(v in sample(20))`. I appreciate the consistency.
* The `...` notation has a very nice "*do what I mean*" feel, particularly when you're playing around with anonymous functions.
* Because of `repeat` loops, you never need to write `while(TRUE)`.
* [Although I have major issues with them](#subsetting), the rules for accessing elements sometimes give nice results. For example `array[c(i, j)] <- array[c(j, i)]` swaps elements `i` and `j` in a very clean way.
* It's nice to be able to do many variable assignments in one line e.g. `Alice <- Bob <- "Married"`. The best examples are when you do something like `lastElement <- output[lastIndex <- lastIndex + 1] <- foo`, letting you avoid having to do anything twice.
* The syntax for manipulating environments makes sense. You have to learn the difference between `<-` and `<<-` , but having environments use a subset of the list syntax was a very good idea. It was a similarly good idea to have a lot of R's internals (e.g. quoted function calls) be pairlists. This lets them be manipulated in exactly the same way as lists. The similarities between lists, pairlists, environments, and data frames go deeper than you may expect. For example, the `eval()` function lets you evaluate an expression in the specified environment, but it's happy to take any of the data types that I've just listed in place of an environment. At times, R almost lets you forget that lists and environments aren't the same thing.
* The function names for making and manipulating S4 objects and functions are generally what you would expect them to be. For example, once you know `setClass()` and `setGeneric()`, you can probably guess what the corresponding function for methods is called.
## Miscellaneous Positives
* The built-in vectors `letters` and `LETTERS` come in handy surprisingly often. You'll see me use them a lot.
* The base library surprises me from time to time. It's always worth putting what you want in to a search engine; Sometimes, you'll find it. My most recent surprises were `findInterval()` and `cut()`.
* The `na.print` argument to `print()`, trivial as it is, can be a thing of beauty.
* R's condition handling and recovery system is build atop S3, making it extremely customisable by letting you add and handle custom metadata pretty much however you want. It also has some nice built-in types of conditions like errors, warnings, and messages, as well as having `finally` blocks in `tryCatch()`. The only real oddity of the system is that its conditionals are treated as functions of the error, meaning that you will have to write strange code like `tryCatch(code, error = function(unused) "error", warning = function(unused) "warning")`. However, this is the price that you pay for being able to use code like `tryCatch(code, myError = function(e) paste0(e$message, " was triggered by ",e$call,". Try ",e$suggestion)`. As a final point of interest, I've heard that R's condition handling system is one of the best copies of Common Lisp's, which I've heard awesome things about.
* Speaking of Lisp, [statistical Lisps used to be a thing](https://en.wikipedia.org/wiki/XLispStat). I've heard rumours of them still being used in Japan, but I can't find anything to back that up. Everything that I've found says that [R killed them off](https://www.jstatsoft.org/article/view/v013i07/v13i07.pdf). As far as I know, nobody's tried to make another such Lisp since the 90's. The fact that R can claim to have eradicated an entire category of language design is a great point in its favour. It's also possible evidence that I'm correct to say that R resembling C is to its benefit. However, I'd be overjoyed to hear of such Lisps making a comeback. Imagine if we're just one good Clojure library away from R surrendering to its Scheme roots and birthing a modern statistical Lisp.
* Base R is quite stable. Breaking changes are almost unheard of. I don't agree that they should be trying so hard to maintain compatibility with S, but this is an undeniable benefit of that decision.
* The fact that the Tidyverse is just an R library, rather than an entirely separate language, is a testament to R's metaprograming. I like being able to define new infix operators and replacement functions, but the Tidyverse went above and beyond. Where else do you see an entire library of pipes? Until very recently, base R didn't even have pipes!
* The Tidyverse is proof that people are trying to fix R. Although that comes with the implication that R is broken, the fact that people are both willing and able to fix it definitely says something nice about R.
* I generally like the RStudio IDE. When Emacs is the only alternative that anyone takes seriously, you know that you've done a good job.
* There is only one implementation of R that anyone's ever heard of, so you never need to worry about undefined behaviour.
* There's something hilarious about R being a language where dangerous forbidden techniques actually run. In most other languages, comments that read `# FORBIDDEN` would indicate that the code produces some sort of error. [Not R](https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf).
# What R Does Wrong
This is where this documents starts to get long. Brace yourself. I really don't want to give off the impression that I hate R, but there are just too many things wrong with it. Again, R's ultimate problem is the sum of its small madnesses. No language is perfectly consistent or without compromises, but R's choices of compromises and inconsistencies are utterly unpredictable. I could deal with a handful of problems like the many that will follow, but this is far more than a handful.
March 2023 update: It occurs to me that I don't really have a summary of where R goes wrong. This is consistent with my above claim that R's ultimate problem is the sum of its little problems. However, one pattern has become obvious both from learning other languages and from reading the section headings: R doesn't have the right data structures; Roughly half of the subsections here are dedicated to complaining about them. This is no small complaint. [One of the big rules in the Unix Philosophy is that data structures are central to programming](https://www.catb.org/~esr/writings/taoup/html/ch01s06.html#rule5). If your data structures are wrong, then finding the correct algorithm becomes much harder. It's little wonder that a focus of the Tidyverse is to clean up one of R's primary data structures (the data frame) and then stick to it as much as possible.
## Lists
We'll start gentle. R's list type is an unavoidable part of the language, but it's very strange. As the following examples show, it's frequently a special case that you can rarely avoid.
* https://stackoverflow.com/questions/2050790/ does a good job of demonstrating that the list type is not like anything that another language would prepare you for. It and its many answers are very much worth a read.
* Lists are the parent class of data frames. Data frames are mandatory for anyone who wants to do stats in R and most of the problems with lists are inherited by data frames. This makes the oddities of lists unavoidable.
* Particularly when extracting single elements of lists, you need to be vigilant for whether R is going to give what you wanted or the list containing what you wanted. Most of this comes down to learning the distinction between `[` and `[[` and `sapply()` and `lapply()`. It's not too bad, but it's a complication.
* Because they won't attempt to coerce your inputs to a common type and because, unless you count matrices, you cannot nest vectors (e.g. `c(c(1, 2), c(3, 4))` is `1:4`), lists are what you're most likely to use when you want to put two or more vectors in one place. A lot of your lists will therefore be nested structures. This is not inherently a problem, but extracting elements from nested structures is hard, both in a general sense and specifically for R's lists. R does little to help you with this. Give https://stackoverflow.com/q/9624169/ and some of its answers a read. Why does this simple question get seven different answers? Do we really need libraries, anonymous functions, or knowing that `[` is a function, just for what ought to be a commonplace operation?
* Some common R functions do not work properly with lists. Some functions like `sort()` and `order()` will not work at all, even if you list only contains numbers, and other will work but misbehave. For example, what do you expect to get from `c(someList, someVectorWithManyElements)`? You might expect a list that is now one item longer. Instead, you get your original list with every element of the vector appended to it in new slots, i.e. a list that is `length(someVectorWithManyElements)` longer.
```{r, echo = TRUE}
c(list(1, 2, 3), LETTERS[1:5])
```
The same output is given by `append()`. To get `list(1, 2, 3, LETTERS[1:5])`, you must do something like `x <- list(1, 2, 3); x[[4]] <- LETTERS[1:5]`.
* Note the use of `[[4]]` and not `[4]` above. Using `[4]` gets you a warning and the output `list(1, 2, 3, "A")`. The `[` version is intended for cases like `x[4:8] <- LETTERS[1:5]`, which gives the same output as `c()` did above. The `[`/`[[` distinction is a beginner's nightmare, as is R's tendency to give you many ways to do the same thing.
* Primarily due to the commonality of data frames, R has a handful of functions that are essentially "*foo, but the list version*". `lapply()` is the most obvious example.
* A few functions, such as `strsplit()`, can catch you off guard by returning a list when there's no obvious reason why a vector or matrix wouldn't have done. For `strsplit()` in particular, I think that the idea is that it's designed to be used on character vectors of lengths greater than one. However, in my experience, I almost always want a length-one version. I'd far rather have that function and `lapply()`/`sapply()`/whatever it as need be than have to constantly use `strsplit("foo")[[1]]`. Similarly, some functions, e.g. `merge()`, insist on returning data frames even when the inputs were matrices. Coercing these unwanted outputs in to what you actually wanted is often harder than it has any right to be.
I think that the ultimate problem with lists is that the right way to use them is not easy to guess from your knowledge of the language's other constructs. If everything in R worked like lists do, or if lists weren't so common, then you wouldn't really mind. As it is, you'll often make mistakes with lists and have to guess your way through correcting them. This isn't terrible. It's just annoying.
## Strings
R's strings suck. The overarching problem is that because there is no language-level distinction between characters vectors and their individual elements, R's vectorization means that almost everything that you want to do with a string needs to be done by knowing the right function to use (rather than by using R's ordinary syntax). I find that the correct functions can be hard to find and use. Although it doesn't fix many of these issues, the common sentiment of "*just use `stringr`/`stringi`*" is difficult to dismiss.
* Technically, R doesn't even have a type for strings. You would want a string to be a vector of characters, but R's characters are already vectors, so R can't have a normal string type. Despite this, the documentation often uses the word "*string*". [The language definition will tell you how to make sense of that](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Vector-objects), but I don't think that information is found anywhere in the sorts of documentation that you'll find in your IDE.
* It's a pain to have to account for how R has two types of empty string: `character(0)` and `""`.
* Character vectors aren't trivially split in to the characters that make each element. For example, `"dog"[1]` is `"dog"` because `"dog"` is a vector of length one. The idiomatic way to split up a string in to its characters -- `strsplit("dog", "")` -- returns a list, so rather than just getting the `"d"` from `"dog"` by doing `"dog"[1]`, you have to do something like `unlist(strsplit("dog", ""))[1]` or `strsplit("dog", "")[[1]][1]`. The `substr()` function can serve you better for trivial jobs, but you often need `strsplit()`.
* Here's a challenge: Find the function that checks if `"es"` is in `"test"`. You'll be on for a while.
* Many of R's best functions for handling strings expect you to know regex and are all documented in the same place (grep {base}, titled "*Pattern Matching and Replacement*"). If you don't know regex -- exactly what I'd expect of R's target audience -- then you're thrice damned:
* Firstly, you're going to have a very hard time figuring out that the functions in `?grep` are probably what you need. A glance at their documentation suggests that they're difficult materials and therefore presumably not required for your task.
* Secondly, you're going to struggle to find the correct function in that documentation because the information that you need is surrounded by concepts that are not familiar to you. This reinforces your initial impression that you're in the wrong place, letting you stumble over the first point again.
* Thirdly and finally, the function names will be meaningless to you -- "*what the heck does `regexpr()` mean and how does it relate to `regexec()`?*" -- leaving you with no straws to clutch at.
* The right function for the job can still be tough to use. Compare the `stringr` answer to [this question](https://stackoverflow.com/questions/12427385/) to the base R answers. Or better yet, use `gregexpr()` or `gregexec()` for any task and then tell me with a straight face that you both understand their outputs and find them easy to work with.
```{r, echo = TRUE}
gregexpr("a", c("greatgreat", "magic", "not"))
```
* The most useful function for printing strings seem to counter-intuitively be `cat()` rather than `print()` or `format()`. For example, `print()` ignores your `\n` characters. The only time where `print()` comes in really handy for string-related stuff is when your inputs are either quoted or lists. In both cases, `print()` accepts these but `cat()` does not. Without significant coercion (mostly `deparse()`), I've yet to find a way to mix `\n` with quoted input. Most of my attempts to do anything clever that mixes printing and lists end with me giving up and using a data frame.
* [Without defining a new operator](https://stackoverflow.com/q/4730551/), you can't add strings in the way that other languages have taught you to, i.e. `"a"+"b"`. [John Chambers is against fixing this](https://www.stat.math.ethz.ch/pipermail/r-devel/2006-August/039004.html). I'm not convinced that he's wrong, but it is annoying.
* If you're converting numbers to characters, or using a function like `nchar()` that's meant for characters but accepts numbers, a shocking number of things will break when your numbers get big enough for R to automatically start using scientific notation.
```{r, echo = TRUE}
nchar(10000)
nchar(100000)
a <- 10000
nchar(a) == nchar(a * 100)
```
You're supposed to use `format()` to coerce these sorts of numbers in to characters, but you won't know about that until something breaks and `nchar()`'s documentation doesn't seem to mention it (try `?as.character`). The `format()` function also has a habit of requiring you to set the right flags to get what you want. `trim = TRUE` comes up a lot. If you're using a package or unfamiliar function, you're forced to check to see if the author dealt with these issues before you use their work. I'd rather just have a generic `nchar()`-like function that does what I mean. Would you believe that `nchar()`'s documentation says it's a generic? It's not lying and it later tells you that `nchar()` coerces non-characters to characters, but R sure does know how to mess with your expectations.
## Variable Manipulation
R has some problems with its general facilities for manipulating variables. Some of the following will be seen **every** time that you use R.
* It lacks both `i++` and `x += i`. It also lacks anything that would make these unnecessary, such as Python's `enumerate`.
* One day, you'll be tripped up by R's hierarchy of how it likes to simplify mixed types outside of lists. The basics are documented with the `c()` function. For example, `c(2, "2")` returns `c("2", "2")`. [An exercise from *Advanced R*]( https://adv-r.hadley.nz/vectors-chap.html#exercises-4) presents a few troubling cases:
* "*Why is `1 == "1"` true?*"
* "*Why is `-1 < FALSE` true?*"
* "*Why is `"one" < 2` false?*".
* To get complete information about the typing and structure of something, you will almost certainly need to call several functions. For example, do any of the following tell you everything about `x`?
```{r, echo = TRUE}
x <- diag(3)
x
typeof(x)
class(x)
attributes(x)
str(x)
dput(x) #Dirty trick, don't use in practice.
```
Among these, `str()` is the closest. However, you can see that it doesn't give you all of the class information. This doesn't improve if you add non-implicit classes to `x`, but I'm avoiding that topic for as long as I can.
* R likes to use "*double*" and "*numeric*" almost interchangeably. You've just seen one such example (`str(x)` vs `typeof(x)`).
* Integers are almost second class. `?integer` suggests that they're mostly for talking to other languages, but the problem seems to go deeper than that. It's as if R tries to avoid integers unless you tell it not to. For example, `4` is a double, not an integer. Why? Unless you're very careful, any integer that you give to R will eventually be coerced to a double.
* There's no trivial way to express a condition like 1 < x < 5. In a maths language, I'd expect that exact syntax to work. There's probably a good reason why it doesn't, and it's not at all hard to build an equivalent condition, but it still annoys me from time to time. I suspect that the `<-` syntax is to blame.
* The distinction between `<-` and `=` is something that you'd have to look up. I'd try to explain the difference, but from what I've gathered, the difference only matters when using `=` rather than `<-` causes bugs. Like most R users, I've picked up the habit of "*use `=` only for the arguments of functions and use `<-` everywhere else*".
* `<-` was designed for keyboards that don't exist any more. It's a pain to type on a modern system. [IDEs can fix this](https://github.com/ReeceGoding/Frustration-One-Year-With-R/issues/7).
* The day that you accidentally have `<` rather than `<-` without it throwing an error will be an awful one. The reverse can also happen. For example, there are two things that you could have meant by `if(x<-2)`.
* `Y<--2` is a terrible way to have to say "*set `Y` to be equal to negative two*". `Y<<--2` is even worse.
* `<<-` is probably the only good thing about the convention of using `<-`, but it's only useful if you either know what a closure is and have reason to use one or if you're some sort of guru with R's first-class environments. You can sometimes use `<<-` to great effect without deliberately writing a closure, but it always feels like a hack because you're implicitly using one. For example, `replicate(5, x <- x+1)` and `replicate(5, x <<- x+1)` are very different, with the `<<-` case being a very cool trick,
```{r, echo = TRUE}
x <- 1
replicate(5, x <- x+1)
x
replicate(5, x <<- x+1)
x
```
but it only works because `replicate()` quietly wraps its second argument in an anonymous function.
* The idiomatic way to add an item to the end of a collection is `a[length(a) + 1] <- "foo"`. This is rather verbose and [a bit unpredictable when adding a collection to a list](#lists).
* A quote from [the language definition](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Argument-evaluation): "*supplied arguments and default arguments are treated differently*". This usually doesn't trip you up, but you're certain to discover it on the first day that you use `eval()`. It has `parent.frame()` as one of its default arguments, but calling `eval()` with that argument supplied manually will produce different results than letting it be supplied by default.
```{r, echo = TRUE}
x <- 1
(function(x) eval(quote(x + 1)))(100)
(function(x) eval(quote(x + 1), envir = parent.frame()))(100)
```
An easier-to-discover example can be found in section 8.3.19 of [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf).
* Argument names can be partially matched. See [this link](https://rosettacode.org/wiki/Named_parameters#R) for some examples. I can't tell if it's disgusting or awesome, but it's definitely dangerous. If I called `f(n = 1)`, I probably didn't mean `f(nukeEarth = 1)`! At least it throws an error if it fails to partially match (e.g. if there were multiple valid partial matches). More on that [when I cover the `$` operator](#dangers-of-).
* The `...` argument doesn't make its users throw errors when they've been called with arguments that they don't have or, even worse, those you misspelled. *Advanced R* has a great example in [its Functions chapter](https://adv-r.hadley.nz/functions.html#exercises-17). Would you have guessed that `sum(1, 2, 3, na.omit = TRUE)` returns `7`, not `6`? Similarly, [the Functionals chapter](https://adv-r.hadley.nz/functionals.html#argument-names) shows how this can lead to baffling errors and what strange things you have to do help your users avoid them.
* `NaN`, `NULL`, and `NA` [have been accused](https://github.com/ReeceGoding/Frustration-One-Year-With-R/issues/5) of inconsistencies and illogical outputs, making it impossible to form a consistent mental model of them.
* For many other examples, see section 8 of [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf).
## Switch
R has some strange ideas about switch statements:
* It's not a special form of any kind; Its syntax is that of a normal function call. If I'm being consistent in my formatting, then I should be calling it "*R's `switch()` function*".
* It's only [about 70 lines of C code](https://github.com/wch/r-source/blob/trunk/src/main/builtin.c#L1009), suggesting that it can't be all that optimised.
* R doesn't have one switch statement, it has two. There is one where it switches on a numeric input and another for characters. The numeric version makes the strange assumption that the first argument (i.e. the argument being switched on) can be safely translated to a set of cases that must follow an ordering like "*if input is 1, do the first option, if 2, do the second...*". There is no flexibility like letting you start at 2, having jumps higher than 1, or letting you supply a default case. Reread that last one: R has a switch without defaults! It's frankly the worst switch that I've ever seen. The other version, the one that switches on characters, is more sensible. I'd give examples, but I don't know how to demonstrate a non-feature.
* As is a trend in R, both versions of switch are capable of silently doing nothing. For example, these do nothing:
```{r, echo = TRUE}
switch(3, "foo", "bar")
switch("c", a = "foo", b = "bar")
print(switch("c", a = "foo", b = "bar")) #Showing off the return value.
```
and they do it silently, returning `NULL`. I'd expect a warning message informing me of this, but there is no such thing. If you want that behaviour, then you have to write it yourself e.g. `switch("c", a = "foo", b = "bar", stop("Invalid input"))` or `switch("c", a = "foo", b = "bar", warning("Invalid input"))`. You can't do that with the numeric version, because R has a switch without defaults.
## Subsetting
Now for the nasty stuff. R's rules for selecting elements, deleting elements, and any other sort of subsetting require mastery. They're extremely powerful once mastered, but until that point, they make using R a nightmare. For a stats language, this is unforgivable. Before mentioning any points that are best put in their own subsections, I'll cover some more general points:
* You never quite know whether you want to use `x`, the name of `x`, or a function like `subset()`, `which()`, `Find()`, `Position()`, or `match()`. R's operators make this even more of mess. You either want `$`, `[`, `@` or even `[[`. Making the wrong choice of `[x]`, `[x,]`, `[,x]` or `[[x]]` is another frequent source of error. You will get used to it eventually, but your hair will not survive the journey. [Similar stories](https://www.talyarkoni.org/blog/2012/06/08/r-the-master-troll-of-statistical-languages/) can be found about the apply family.
* The `[[` operator has been accused of inconsistent behaviour. [*Advanced R* covers this better than I could](https://adv-r.hadley.nz/subsetting.html#subsetting-oob). The short version is that it sometimes returns `NULL` and other times throws an error. Personally, I've never noticed these because I've rarely tried to subset `NULL` and I don't see any reason why you would use `[[` on an atomic vector. As far as I know, `[` does the same job. The only exception that I can think of is if your atomic vector was named. For example:
```{r, echo = TRUE}
a <- c(Alice = 1, Bob = 2)
a["Alice"]
a[["Alice"]]
```
* When doing variable assignment on anything more than one-dimensional, `object` and `object[]` behave differently when you try to assign variables to them. Compare:
```{r, echo = TRUE}
(c <- b <- diag(3))
b[] <- 5
c <- 5
b
c
```
This kind of makes sense, but it will trip you up.
* The syntax for deleting elements of collections by index can be rather verbose. You can't just pop out an element, you have to write `vect <- vect[-index]` or `vect <- vect[-c(index, nextIndex, ...)]`.
* R is 1-indexed, but accessing element 0 of a vector gives an empty vector rather than an error. This probably makes sense considering that index -1 deletes element 1, but it's a clear source of major errors.
* With the sole exception of environments, every named object in R is allowed to have duplicate names. I guarantee that will one day break your subsetting (e.g. see section 8.1.19 of [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf)). Fortunately, the constructor for data frames has a `check.names` argument that corrects duplicates by default. Unfortunately, it does this silently, so you may not notice that some of your column names have been changed. Another oddity is that many functions that work on data frames, most notably `[`, will silently correct duplicated names even if you told the original data frame to not do so. Why even let me have duplicated names if you're going to make it so hard to keep them?
```{r, echo = TRUE}
data.frame(x = 1:3, x = 11:13)
#Notice the x.1? You didn't ask for that. To get x twice, you need this:
correctNames <- data.frame(x = 1:3, x = 11:13, check.names = FALSE)
correctNames
correctNames[1:3, ]#As expected.
correctNames[1:2]#What?
```
Not only is this behaviour inconsistent, it is **silent**; No warnings or errors are thrown by the above code.
Tibbles are much better about this:
```{r, echo = TRUE}
library(tibble)
#We can't repeat our original first line, because tibble(x = 1:5, x = 11:15) throws an error:
## > tibble(x = 1:5, x = 11:15)
## Error: Column name `x` must not be duplicated.
## Use .name_repair to specify repair.
#We follow the error's advice.
#The .name_repair argument provides a few useful options, so we must pick one.
correctNames <- tibble(x = 1:5, x = 11:15, .name_repair = "minimal")
correctNames
correctNames[1:3,]#Good
correctNames[1:2]#Still good!
```
This may seem like an isolated example. It isn't. A related example is that the `check.names` argument in `data.frame()` is very insistent on silently doing things, even to the point of overruling arguments that you explicitly set. For example, these column names aren't what I asked for.
```{r, echo = TRUE}
as.data.frame(list(1, 2, 3, 4, 5), col.names = paste("foo=bar", 6:10))
as.data.frame(list(1, 2, 3, 4, 5), col.names = paste("foo=bar", 6:10), check.names = FALSE)#The fix.
```
I think R does this to ensure your names are suitable for subsetting. Subsetting with non-unique or non-syntactic column names could be a pain, but the decision to not inform the user of this correction is baffling. Even if you're fortunate enough to notice the silent changes, the lack of a warning message will leave you with no idea how to correct them. You could perhaps argue that duplicated names are the user's fault and they deserve what they get, but that argument falls apart for non-syntactic names. Who hasn't put a space or an equals sign in their column names before? Mind, tibbles aren't much better when it comes to non-syntactic names. Neither `tibble("Example col" = 4)` nor `data.frame("Example col" = 4)` warn you of the name change.
* For what I believe to be memory reasons, objects of greater than one dimension are stored in column order rather than row order. Quick, what output do you expect from `matrix(1:9, 3, 3)`?
```{r, echo = TRUE}
matrix(1:9, 3, 3)
```
This gives us a matrix with first row `c(1, 4, 7)`. This goes against the usual English convention of reading from left to right. It is also inconsistent with functions like `apply()`, where `MARGIN = 1` corresponds to their by-row version and `MARGIN = 2` is for by-column (if R privileges columns, shouldn't they be the `= 1` case?). This means that you can never really be sure if R is working in column order or row order. This is bad enough on its own, but it can also be a source of subtle bugs when working with matrices. Many mathematical functions don't see any difference between a matrix and its transpose.
* There is no nice way to access the last element of a vector. The idiomatic way is `x[length(x)]`. The only good part of this is that `x[length(x) - 0:n]` is a very nice way to get the last `n + 1` elements. You could use `tail()`, but Stack Overflow tells me it's very slow.
* The `sort()` and `order()` functions are the main ways to sort stuff in R. If you're trying to sort some data by a particular variable, then R counter-intuitively wants you to use `order()` rather than `sort()`. The syntax for `order()` doesn't help matters. It returns a permutation, so rather than `order(x, params)`, you will want `x[order(params),]`. My only explanation for this is that it makes `order()` much easier to use with the `with()` function. For example, `data[with(data, order(col1, col2, col3)),]` is perhaps more pleasant to write than the hypothetical `order(data, data$col1, data$col2, data$col3)`. The Tidyverse's `dplyr` solves these problems: `dplyr::arrange(data, col1, col2, col3)` does what you think. I'd much rather use `arrange(mtcars, cyl, disp)` than `mtcars[with(mtcars, order(cyl, disp)),]`.
* The `order()` case above illustrates another frequent annoyance with subsetting. Rather than asking for what you want, you often need to generate a vector that matches up to it. A collection of booleans (R calls these "*logical vectors*") is one of the most common ways to do this, with `duplicated()` being a typical example.
```{r, echo = TRUE}
head(Nile)
duplicated(head(Nile))
head(Nile)[duplicated(head(Nile))]
```
This means that you will usually be asking for `items[bools]` (and maybe `[,bools]` or `[bools,]`...) in order to get the items that you want. There is great power in being able to do this, but having to do it is annoying and can catch you off guard. For example, what do you expect `lower.tri()` to return when called on a matrix? What you wanted from `lower.tri(mat)` is probably what you get from `mat[lower.tri(mat)]`. Also, don't expect a helpful error message if your construction of `bools` is wrong. As I'll discuss later on, [the vector recycling](#vectorization-again) rules will often make an incorrect construction give sensible-looking output.
* For reasons that I cannot explain, `aperm(x, params)` is the correct syntax, not `x[aperm(params)]` or anything like it. I think that it's trying to be consistent with R's ideas of how to manipulate matrices, but it's yet another source of confusion. I don't want to have to think about if I'm treating my data like a matrix or like a data frame.
* Good luck trying to figure out how to find a particular sequence of elements within a vector. For example, try finding if/where the unbroken vector `1:3` has occurred in `sample(6, 100, replace = TRUE)`. You're best off just writing the `for` loop.
### Combining Operators
This one isn't too bad, but it's worth a mention. Combining operations can lead to some counter-intuitive results:
* If `a <- 1:5`, what do you expect to get from `a[-1] <- 12:15`? Do you expect `a[1]` to be removed or not? This is great once you know how it works, but it's confusing to a beginner.
```{r, echo = TRUE}
a <- 1:5
a[-1] <- 12:15
a
```
* Because `data[-index]` can be used to remove elements and `data["colName"]` can be used to select elements, you might expect `data[-"colName"]` or `data[-c("colName1", "colName2")]` to work. You would be wrong. Both throw errors.
```{r, echo = TRUE}
## > mtcars[-"wt"]
## Error in -"wt" : invalid argument to unary operator
```
* Attempting to remove both by index and by name at the same time will never work. For example, `mtcars[-c(1, "cyl")]` is an error and `mtcars[c(1, "cyl")] <- NULL` will only remove the `cyl` variable. Weirdly enough, I can't actually show this `mtcars[c(1, "cyl")] <- NULL` example. R is perfectly happy to show it, but R Markdown isn't. What happens is that `c(1, "cyl")` is coerced to `c("1", "cyl")`. After this, R does not inform you that there is no `1` column to remove.
Now for the serious stuff...
### Removing Dimensions
This issue is notorious: R likes to remove unnecessary dimensions from your data in ways that are not easily predicted, forcing you to waste time preventing them. Rumour has it that this can be blamed on S being designed as a calculator rather than as a programming language. I can't cite that, but it's easy to believe. No programmer would include any of the below in a programming language.
* Unless you add `, drop=FALSE` to all of your data selection/deletion lines, you run the risk of having all of your code that expects your data to have a particular structure unexpectedly break. This gives no errors or warnings. Compare:
```{r, echo = TRUE}
(mat <- cbind(1:4, 4:1))
mat[, -1]
mat[, -1, drop=FALSE]
```
and you will see that one of these is not a matrix. Data frames have the same issue unless you do all of your subsetting in a 1D form.
```{r, echo = TRUE}
mat <- cbind(1:4, 4:1)
(frame <- as.data.frame(mat))
frame[, -1]
frame[, -1, drop=FALSE]
frame[-1]#1D subsetting
```
The Tidyverse, specifically `tibble`, does its best to remove this.
```{r, echo = TRUE}
library(tibble)
mat <- cbind(1:4, 4:1)
(tib <- as_tibble(mat))
tib[, -1]
tib[, -1, drop=FALSE]
tib[-1]
tib[, -1, drop=TRUE]
```
You can think of tibbles as having `drop=FALSE` as their default. I can't explain why base R doesn't do the same. It's got to either be some sort of compromise for matrix algebra or for making working in your console nicer.
* Update: Section 6.8 of the *Software for Data Analysis: Programming with R* book by John Chambers offers a partial explanation: "_The default is, and always has been, `drop=TRUE`; probably an unwise decision on our part long ago, but now one of those back-compatibility burdens that are unlikely to be changed._"
* The `drop` argument is even stranger than I'm letting on. Its defaults differ depending on whether there may only be one column remaining or if there may only be one row. To quote the documentation (`?"[.data.frame"`): "_The default is to drop if only one column is left, but **not** to drop if only one row is left_". Unlike the previous point, I can sort of make sense of this. For example, a single column can only ever be one type (even if that may be a container for mixed types, such as a list) but a single row could easily be a mix of types. Dropping on a row of mixed types will just give you a really ugly list, so you'd much rather have a data frame. With a column, it's only with years of experience that the community has realised that they probably still want the data frame; It's nowhere near as obvious that the vector is not preferable.
* As you can tell by taking a close look at the documentation for `[` and that of `[.data.frame`, the `drop` argument does not do the same thing for arrays and matrices as it does for data frames. This means that my earlier example could be dishonest. However, the confusion that you would need to overcome in order to check for if I've been dishonest is so great that it proves that there's definitely something wrong with the `drop` argument.
* You may think that `object` and `object[,]` are the same thing. They are not. You would expect and get an error if `object` is one-dimensional. However, if it's a data frame or matrix with one of its dimensions having size 1, then you do not get an error and both `object` and `object[,]` are very different.
```{r, echo = TRUE}
library(tibble)
colMatrix <- matrix(1:3)
colMatrix
colMatrix[,]
rowMatrix <- matrix(1:3, ncol = 3)
rowMatrix
rowMatrix[,]
colFrame <- as.data.frame(colMatrix)
colFrame
colFrame[,]
rowFrame <- as.data.frame(rowMatrix)
rowFrame
rowFrame[,]
colTib <- as_tibble(colMatrix)
colTib
colTib[,]
rowTib <- as_tibble(rowMatrix)
rowTib
rowTib[,]
```
Can you guess why? It's because the use of `[` makes R check if it should be dropping dimensions. This makes `object` and `object[,,drop=FALSE]` equivalent, whereas `object[,]` is a vector rather than whatever it was originally. Tibbles, of course, don't have this issue.
* If you've struggled to read this section, then you're probably now aware of another point: It's really easy to get the commas for `drop=FALSE` mixed up. What do you think `data[4, drop=FALSE]` is? If `data` is a data frame, you get column 4 and a warning that the `drop` argument was ignored. Did you expect row 4? Whether you did or not, you should be able to see why somebody may come to the opposite answer. Although I see no sensible alternative, the `drop` argument needing its own comma is terrible syntax for a language where a stray comma is the difference between your data's life and death. This is made even worse by the syntax for `[` occasionally needing stray commas. Expressions like `data[4,]` are commonplace in R, so it's far too easy to forget that you needed the extra comma for the `drop` argument.
### Dangers of $
The `$` operator is both silently hazardous and redundant:
* As an S3 generic, you can never be certain that `$` does what you want it to when you use it on a class from a package. For example, it's common knowledge that base R's `$` and the Tidyverse's `$` are not the same thing. In fact, `$` does not even behave consistently in base R. Compare the following partial matching behaviour:
```{r, echo = TRUE}
library(tibble)
list(Bob = 5, Dobby = 7)$B
env <- list2env(list(Bob = 5, Dobby = 7))
env$B
data.frame(Bob = 5, Dobby = 7)$B
tibble(Bob = 5, Dobby = 7)$B
```
For what it's worth, replacing `Dobby` with `Bobby` gives more consistent results.
```{r, echo = TRUE}
library(tibble)
list(Bob = 5, Bobby = 7)$B
env <- list2env(list(Bob = 5, Bobby = 7))
env$B
data.frame(Bob = 5, Bobby = 7)$B
tibble(Bob = 5, Bobby = 7)$B
```
In theory, I should note that `[` and `[[` are also S3 generics and therefore should share this issue. Aside from the `drop` issues above, I rarely notice such misbehaviour in practice.
* Consistency aside, partial matching is inherently dangerous. `data$Pen` might give the `Penetration` column if you forgot that you removed the `Pen` column. By default, R does not give you any warnings when partial matches happen, so you won't have any idea that you got the wrong column.
* The documentation for `$` points out its redundancy in base R: "*`x$name` is equivalent to `x[["name", exact = FALSE]]`*". In other words, even if I want the behaviour of `$`, I can get it with `[[`. Another benefit of `[[` is that it will only partially match if you tell it to (use `exact = FALSE`). That matters because...
* The partial matching of `$` can be even worse than I've just described. If there are multiple valid partial matches, rather than get any of them, you get `NULL`. This is what happened with the Bob/Bobby example above. To give another example, `mtcars$di` and `mtcars$dr` both give sensible output because there is only one valid partial match, but `mtcars$d` is just `NULL`. I'm largely okay with this behaviour, but you don't even get a warning!
```{r, echo = TRUE}
mtcars$di
mtcars$dr
mtcars$d
```
* Tibbles try to fix the partial-matching issues of `$` by completely disallowing partial matching. They will not partially match even if you tell them to with `[[, exact=FALSE]]`. If you try to partially match anyway, it will give you a warning and return `NULL`. I sometimes wonder if it should be an error.
```{r, echo = TRUE}
library(tibble)
mtTib <- as_tibble(mtcars)
mtTib$di
mtTib$dr
mtTib$d
mtcars[["d", exact = FALSE]]
mtTib[["d", exact = FALSE]]
```
* On the base R side, there is a global option that makes `$` give you warnings whenever partial matching happens. It's disabled by default. Common sense suggests it should be otherwise.
* The `$` operator is another case of R quietly changing your data structures. For example, I would call `mtcars$mpg` unreadable.
```{r, echo = TRUE}
mtcars$mpg
typeof(mtcars$mpg)
```
You probably wanted `mtcars["mpg"]`
```{r, echo = TRUE}
mtcars["mpg"]
typeof(mtcars["mpg"])
```
and you definitely did not want `mtcars[, "mpg"]` or `mtcars[["mpg"]]`, which both give the same output as using `$`.
```{r, echo = TRUE}
mtcars[, "mpg"]
mtcars[["mpg"]]
```
Would you have guessed that? Tibbles share the above behaviour with `$` and `[[`, but keep `["name"]` and `[, "name"]` identical due to their promise to not drop dimensions with `[`.
* The `$` operator does not have any uses beyond selection. For example, there is no way to combine `$` with operators like `-` and there's no way to pass arguments like `drop=FALSE` to it.
* `$` is not allowed for atomic vectors like `c(fizz=3, buzz=5)`, unlike `[` and `[[`. This is particularly annoying when dealing with named matrices because you end up having to use `mat[, "x"]` where `mat$x` should have done.
* Section 8.1.21 of [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf): There exists a `$<-` operator. You hardly ever see it used. [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf) points out that it does not do partial matching, even for lists, unlike `$`. This is actually documented behaviour -- in fact, `?Extract` mentions it twice -- but I challenge you to find it. I can see why it would be difficult to make a `$<-` with partial matching, but making `$<-` inconsistent with `$` is just laughable.
In conclusion, once you know the difference between `["colname"]` and `[, "colname"]`, `$` is only useful if it's making your code cleaner, saving you typing, or if you actually want the partial matching. Personally, I'm uncomfortable with the inherent risks of partial matching, so `$` is only really useful for interactive use and my IDE's auto-completion. That might even be its intended job. But if that is the case, nobody warns you of it.
### Indistinguishable Errors
When dealing with any sort of collection, any of the following mistakes can give indistinguishable results. This can make your debugging so messy that by the time that you're done, you don't know what was broken.
* Trying to select an incorrect sequence of elements. This can be caused by `:` or `seq()` misbehaving or by simple user error. [A tiny bit more on that later](#sequences)
* The vector recycling rules silently causing the vector that you used to select elements to be recycled in an undesired way. [More on that later](#vectorization-again).
* Selecting an out-of-bounds value. You almost always don't get any error or warning when you do this. For example, both out-of-bounds positive numbers and logical vectors that are longer than the vector that you're subsetting silently return `NA` for the inappropriate values.
```{r, echo = TRUE}
length(LETTERS)
LETTERS[c(1, 5, 20, 100)]
LETTERS[rep(TRUE, 100)]
```
Again, as with many of the issues that we've mentioned recently, **this happens silently**.
* Accessing/subsetting a collection in the wrong way. For example, wrongly using any of `[c(x, y)]`, `[x, y]`, or `[cbind(x, y)]`, selecting `[x]` rather than `[x, ]`, `[[x]]`, or `[, x]`, using the wrong `rbind()/cbind()`, or an error in your call to anything like `subset()` or `within()`.
* Selecting element 0.
* Any sort of off-by-one errors, e.g. a modulo mistake of any sort, genuine off-by-one errors, or R's 1-indexing causing you to trip up.
* Misuse of searching functions like `which()`, `duplicated()`, or `match()`.
This list also reveals another issue with subsetting: There's too many ways to do it...
### Named Atomic Vectors
...and they don't all work everywhere. For example, there's a wide range of tools for using names to work with lists and data frames, but very few of them work for named atomic vectors (which includes named matrices).
* The `$` operator simply does not work.
* Although `namedVector["name"]` can be used for subsetting and subassignment, `namedVector["name"] <- NULL` throws an error. For a list or data frame, this would have deleted the selected data points.
```{r, echo = TRUE}
typeof(letters)
named <- setNames(letters, LETTERS)
tail(named)
named["Z"]
named["Z"] <- "Super!"
tail(named)
#So subsetting and subassignment work just fine. However, for NULL...
## > named["Z"] <- NULL
## Error in named["Z"] <- NULL : replacement has length zero
#But for a data frame, this is just fine.
(data <- data.frame(A = 1, B = 2, Z = 3))
data["Z"] <- NULL
data
```
Incidentally, `anyAtomicVector[index] <- NULL` is also an error. e.g. `LETTERS[22] <- NULL`.
* Sorry, did I say that `namedVector["name"]` works for subsetting?
```{r, echo = TRUE}
a <- diag(3)
colnames(a) <- LETTERS[1:3]
a
a["A"]
a["Z"]
```
Long story short, named atomic vectors make a distinction between names and colnames that data frames do not.
```{r, echo = TRUE}
a <- diag(3)
colnames(a) <- LETTERS[1:3]
colnames(a)
names(a)
names(mtcars)
colnames(mtcars)
identical(names(mtcars), colnames(mtcars))
```
So what happens when you give an atomic vector plain old names rather than colnames? For a non-matrix, it works fine (see the `named <- setNames(letters, LETTERS)` example above). For a matrix - and presumably for any array, but let's not get in to that distinction - it's a little bit more complicated. Look closely at this output before reading further.
```{r, echo = TRUE}
a <- diag(3)
(a <- setNames(a, LETTERS[1:3]))
a["A"]
a["Z"]#For a data frame, this would be an error...
```
When you try to give an atomic vector ordinary names, R will only try to name it element-by-element (even if said vector has dimensions). Data frames, on the other hand, treat names as colnames. R ultimately sees named matrices as named atomic vectors that happen to have a second dimension. This means that you can subset them with both `["name"]` and `[, "name"]` and get different results.
```{r, echo = TRUE}
a <- setNames(diag(3), LETTERS[1:3])
colnames(a) <- LETTERS[1:3]
a
a["A"]
a["Z"]
a[, "A"]
#I'd love to show a[, "Z"], but it throws the error "Error in a[, "Z"] : subscript out of bounds".
#This is clearly consistent with a["Z"] and my earlier bits on out-of-bounds stuff not throwing errors.
```
Of course, `["name"]` and `[, "name"]` aren't identical for data frames either, but let's not get back in to talking about the `drop` argument. Starting to see what I mean about R being inconsistent?
* You cannot use named atomic vectors to generate environments. This means that awesome tricks like `within(data, remove(columnIDoNotWant, anotherColumn))` work for lists and data frames but not for named atomic vectors.
```{r, echo = TRUE}
#Data frames are fine.
head(within(mtcars, remove("mpg")))
#Named atmomic vectord are not.
## > within(setNames(letters, LETTERS), remove("Z"))
## Error in UseMethod("within") :
## no applicable method for 'within' applied to an object of class "character"
```
* When you want to work with the names of named atomic vectors, you probably want to access their names directly and use expressions like `namedVect[!names(namedVect) %in% c("remove", "us")]`.
```{r, echo = TRUE}
namedVect <- setNames(letters, LETTERS)
namedVect[!names(namedVect) %in% c("A", "Z")]
```
However, this is a bad habit for non-atomic vectors because, unless you take the precautions [mentioned earlier](#removing-dimensions), `[` likes to remove duplicated names and unnecessary dimensions from your data.
* Don't think that functional programming will save you from my previous point. The base library's higher-order functions don't play nice with the `names()` function. I think it's got something to do with `lapply()` using `X[[i]]` under the hood (see its documentation).
```{r, echo = TRUE}
namedVect <- setNames(letters, LETTERS)
Filter(function(x) names(x) == "A", namedVect)
head(lapply(namedVect, function(x) names(x) == "A"))
head(sapply(namedVect, function(x) names(x) == "A"))
```
Did you notice that `Filter` and `lapply`'s arguments are in inconsistent orders? A little bit more on that [much later](#the-functions).
From the above few points, you can see that it's hard to find a way to manipulate named atomic vectors by their names that is both safe for them and for other named objects. The only one that comes to mind is to use `[` with the aforementioned precautions. That's bad enough on its own -- it makes R feel unsafe and inconsistent -- but it also makes named atomic vectors feel like an afterthought. I find that most of my code that makes extended use of named atomic vectors comes out looking disturbingly unidiomatic. A little bit more on that [when I talk about matrices](#extended-example-matrices).
### Silence
I've already given a few examples of R either silently doing nothing or silently doing what you don't want. Let's have a few more:
* Again, much of what I've listed in the [Indistinguishable Errors](#indistinguishable-errors
) and [Removing Dimensions](#removing-dimensions) sections occur silently.
* As documented [here](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Indexing-by-vectors), negative out-of-bounds values are silently disregarded when deleting elements. For example, if you have `x <- 1:10`, then `x[-20]` returns an unmodified version of `x` without warning or error.
```{r, echo = TRUE}
x <- 1:10
x[20]
x
x[-20]
identical(x, x[-20])
```
Given that `x[20]` is `NA` -- a questionable decision in of itself -- is this the behaviour that you expected?
* Subassigning `NULL` to a column that your data does not have does not give a warning or error. For example, trying to access `mtcars["weight"]` is an error, but `mtcars["weight"] <- NULL` silently does nothing. `$` and `$<-` have the same issue.
* Using `within()` to remove unwanted columns from your data, e.g. `within(data, rm(colName1, colName2))`, does nothing to any columns with duplicated names. Again, no warning or error...
```{r, echo = TRUE}
dupe <- cbind(mtcars, foo = 3, foo = 4)
head(dupe)
head(within(dupe, rm(carb, foo)))
```
By the way, `cbind()` doesn't silently correct duplicated column names. By now, you probably expected otherwise. This is documented behaviour, but I don't think that anyone ever bothered to read the docs for `cbind()`.
* Using `subset()` rather than `within()` is sometimes suggested for operations like what I was trying to do in the previous point. For example, you can remove columns with `subset(data, select = -c(colName1, colName2))`. However, for duplicated names, I'd argue that `subset()` is even weirder than `within()`. With `subset()`, attempting to remove a duplicated column by name will only remove the first such column and removing any non-duplicated column will change the names of your duplicated columns.
```{r, echo = TRUE}
#First, I'll show subset() working as normal and save us some space.
mtcars2 <- subset(mtcars, mpg > 25, select = -c(cyl, disp, hp, wt))
mtcars2
dupe <- cbind(mtcars2, foo = 3, foo = 4, foo = 5)
dupe
subset(dupe, select = -foo)#Names have silently changed and only one foo was dropped.
subset(dupe, select = -c(foo, foo))#Identical to previous.
subset(dupe, select = -carb)#Foo's names have silently changed, despite us not touching foo!
subset(dupe, select = -c(carb, foo))#Names have silently changed and only one foo was dropped.
```
I think that the worst example here is `subset(dupe, select = -carb)`. I didn't touch `foo`, so why change it? I'd rather have `within()`'s silent inaction than `subset()`'s silent sabotage.
Needless to say, there will be more examples of R silently misbehaving later on in this document. This was just a good place to throw in a few that are specific to subsetting.
### Subsetting by Predicates
This should be easy, shouldn't it? Go through the data and only give me the bits that have the property that I'm asking for. What could possibly go wrong? Turns out, it's quite a lot. Even predicates as simple as "*does the element equal `x`?*" are a minefield. I understand why these examples are the way that they are -- really, I do -- but how to delete unwanted elements is one of the first things that you're going to want to learn in a stats language. For something that you're going to want to be able to do on day one of using R, there are far too many pitfalls.
* You might think that `setdiff()` is sufficient for removing data -- it's certainly the first thing tool that a mathematician would reach for -- but it has the side-effect of removing duplicate entries from the original vector and destroying your data structures by applying `as.vector()` to them.
```{r, echo = TRUE}
Nile
setdiff(Nile, 1160)#Not a time series any more.
setdiff(Nile, 0)#Hey, where did the other 1160s go?
```
It's safer when you're dealing with names, e.g. `data[setdiff(names(data), "nameOfThingToDelete")]`
```{r, echo = TRUE}
head(mtcars)
head(mtcars[setdiff(names(mtcars), "wt")])
```
but anything that's only sometimes safe doesn't fill me with confidence.
* Because `which()` is an extremely intuitive function for extracting/changing subsets of your data and for dealing with missing values (see [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf), section 8.1.12), it is one of the first things that a beginner will learn about. However, although your intuition is screaming for you to do it, you almost never want to use `data <- data[-which(data==thingToDelete)]`. When `which()` finds no matches, it evaluates to something of length 0. This makes `data[-which(data==thingToDelete)]` also returns something of length 0, deleting your data.
```{r, echo = TRUE}
Nile
Nile[-which(Nile==1160)]#This is fine.
which(Nile==11600)
Nile[-which(Nile==11600)]#This is not.
```
What you probably expected was `which()` leaving your data unchanged when it has not found a match. You might also have expected a warning or error, but surely you've learned your lesson by now? Anyway, section 8.1.13 of [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf) offers some ways to get this behaviour, but the only practical-looking suggestion is `data[!(data %in% thingToDelete)]`. I think that you can get away with removing the curly brackets there.
```{r, echo = TRUE}
Nile[!Nile %in% 1160]
Nile[!Nile %in% 11600]
```
That's mostly okay. However, `identical(Nile, Nile[!Nile %in% 11600])` is `FALSE`. Can you guess why? It's like R has no always safe ways to subset.
* At least removing elements that are equal to a particular number is simple for vectors. Even for lists, it's just `data[data!=x]`. It's maybe not what a beginner would guess ("*I have to write `data` twice?*"), but it's simple enough.
* For removing a vector from a list of vectors, you're going to want to learn some functional programming idioms. Not hard if you're a programmer, but shouldn't this be easier in a stats and maths tool? Anyway, you probably want `Filter(function(x) all(x!=vectorToDelete), data)`. You can also do it with the apply family, but I don't see why you would.
* Removing what you don't want from a data frame largely comes down to mastering the subsetting rules, a nightmare that I've spent the previous few thousand words covering. I often end up with very ugly lines like `outcomes[outcomes$playerChoice == playerChoice & outcomes$computerChoice == computerChoice, "outcome"]`
* Before you ask, `subset()`, `with()`, and `within()` aren't good enough either. I've already mentioned some of their issues, but [more on them later](#non-standard-evaluation).
Overall, it's like R has no safe ways to subset. What is safe for one job is often either unsafe, invalid, or inconsistent with another. R's huge set of subsetting tools is useful -- maybe even good -- once mastered, but until then you're forced to adopt a guess-and-check style of programming and pray that you get a useful error/warning message when you get something wrong. Worse still, these prayers are rarely answered and, in the cases where R silently does something that you didn't want, they're outright mocked. Do you understand how damning that is for a stats language? I can't stress this point enough. Subsetting in R should be easy and intuitive. Instead, it's something that I've managed to produce thousands of words of complaints about and it still trips me up with [alarming regularity](https://github.com/ReeceGoding/Frustration-One-Year-With-R/issues/3), despite my clear knowledge of the correct way to do things. If I want a vector of consonants, you can bet that I'm going to write `letters[-c("a", "e", "i", "o", "u")]`, `letters[-which(letters == c("a", "e", "i", "o", "u"))]`, and `letters[c("a", "e", "i", "o", "u") %in% letters]` before remembering the right way to do it. If I'm still making those mistakes for something simple, then I can only imagine what it's like for a true beginner doing something complicated.
## Vectorization Again
[You've heard the good](#vectorization), now for the bad. R's vectorization is probably the best thing about the language and it will work miracles when you're doing mathematics. However, it will trip you up in other areas. A lot of these points are minor, but when they cause you problems their source can be tough to track down. This is because R is working as intended and therefore not giving you any warnings or errors (spotting a pattern?). Furthermore, if you have correctly identified that you have a vectorization problem, then pretty much any function in R could be to blame, because most of R's functions are vectorized.
* The commonality of vectors leads to some new syntax that must be memorised. For example, `if(x|y)` and `if(x||y)` are very different and using `&&` rather than `&` can be fatal. Compare the following:
```{r, echo = TRUE}
mtcars[mtcars$mpg < 20 && mtcars$hp > 150,]
mtcars[mtcars$mpg < 20 & mtcars$hp > 150,]
```
Personally, I find that it's easy to remember to use `&` for `if` but I often forget to use `&` for subsetting. It looks like version 4.1.4 is going to make `||` and `&&` throw warnings.
* The `if` statements accept vectors of length greater than 1 as their predicate, but will only pay attention to the very first element. This throws a warning and there is a global option to make it an error instead, but I can't see why R accepts such predicates at all. Why would I ever use `if(c(TRUE, FALSE))` to mean "*if the first element of my vector is true, then...*"? This is also what the `&&` and `||` syntax is for (e.g. `c(TRUE, FALSE) && c(TRUE, FALSE)` is `TRUE`), but I still don't see why anyone would use several logical vectors and only be interested in their first elements.
* It appears that version 4.1.4 is going to do something about this. Specifically, replace this warning with an error.
* When dealing with anything 2D, you need to be very careful to not mix up any of `length()`, `lengths()`, `nrow()`, or `ncol()`. In particular, `length()` is so inconsistent that I'm unsure why they let it work for 2D structures ([probably something to do with it being an internal generic](#internal-generics)). For example, the length of a data frame is its number of columns and the length of a matrix is its number of elements.
```{r, echo = TRUE}
(a <- diag(4))
(b <- as.data.frame(a))
length(a)
length(b)
```
* Vectors are collections and therefore inherit the previous section's issues about selecting elements.
* Because virtually everything is already a vector, you never know what to use when you want a collection or anything nested. Lists? Arrays? `c()`? Data frames? One of `cbind()`/`rbind()`? Matrices? You get used to it eventually, but it takes a while to understand the differences.
* Some functions are vectorized in such a way that you're forced to remember the difference between how they behave for n length-one vectors and and how they behave for the corresponding single vector of length n. For example, `paste("Alice", "Bob", "Charlie")` is not the same as `paste(c("Alice", "Bob", "Charlie"))`.
```{r, echo = TRUE}
paste("Alice", "Bob", "Charlie")
paste(c("Alice", "Bob", "Charlie"))
paste("Alice", "Bob", "Charlie", collapse = "")
paste(c("Alice", "Bob", "Charlie"), collapse = "")
```
I'm not saying that this doesn't make sense, but it is a source of unpredictability.
* Another unpredictable example: What does `max(100:200, 250:350, 276)` return? You might be surprised to discover that the output is the single number `350`, rather than a vector of many outputs.
```{r, echo = TRUE}
max(100:200, 250:350, 276)
```
The fix for this isn't some `collapse`-like argument like it is for `paste()`, it's an entirely different function: `pmax()`. Why?
```{r, echo = TRUE}
pmax(100:200, 250:350, 276)
```
* A further annoyance comes from how many things behave differently on vectors of length one. For example, `sample(1:5)` is exactly the same as `sample(5)`, which is bound to give you bugs when you use `sample(5:n)` for changing `n`.
* R has rules for recycling vector elements when you try to get it to do something with several vectors that don't all have the same length. You saw this abused when I gave the `x <- paste0(rep("", 100), c("", "", "Fizz"), c("", "", "", "", "Buzz"))` FizzBuzz example. When recycling occurs, R only throws a warning if the longest vector's length is not a multiple of the others. For example, neither `Map(sum, 1:6, 1:3)` nor that FizzBuzz line warn you that recycling has occurred, but `Map(sum, 1:6, 1:4)` will.
```{r, echo = TRUE}
Map(sum, 1:6, 1:3)
Map(sum, 1:6, 1:4)
```
The first case -- where no warnings are given -- can be an unexpected source of major error. [The authors of the Tidyverse seem to agree with me](https://r4ds.had.co.nz/vectors.html#scalars-and-recycling-rules). For example, you're only allowed to recycle vectors of length 1 when constructing a tibble, so `tibble(1:4, 1:2)` will throw a clear error message whereas `data.frame(1:4, 1:2)` silently recycles the second argument. Similarly, `map2(1:6, 1:3, sum)` is an error, but `map2(1:6, 1, sum)` is not.
```{r, echo = TRUE}
library(tibble)
## > tibble(1:4, 1:2)
## Error: Tibble columns must have compatible sizes.
## * Size 4: Existing data.
## * Size 2: Column at position 2.
## ℹ Only values of size one are recycled.
## Run `rlang::last_error()` to see where the error occurred.
data.frame(1:4, 1:2)
library(purrr)
## > map2(1:6, 1:3, sum)
## Error: Mapped vectors must have consistent lengths:
## * `.x` has length 6
## * `.y` has length 3
map2(1:6, 1, sum)
```
* Section 8.1.6 of [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf): The recycling of vectors lets you attempt to do things that look correct to a novice and make sense to a master, but are almost certainly not what was wanted. For example, `c(4, 6) == 1:10` is `TRUE` only in its sixth element. The recycling rules turn it in to `c(4, 6, 4, 6, 4, 6, 4, 6, 4, 6) == 1:10`. Again, there is no warning given to the user unless the longest vector's length is not a multiple of the other's. In this case, what you wanted was probably `c(4, 6) %in% 1:10`, maybe with a call to `all()`.
```{r, echo = TRUE}
c(4, 6) == 1:10
c(4, 6, 4, 6, 4, 6, 4, 6, 4, 6) == 1:10
c(4, 6) %in% 1:10
all(c(4, 6) %in% 1:10)
```
* Some functions don't recycle in the way that you would expect. For example, read the documentation for `strsplit()` and ask yourself if you expect `strsplit("Alice", c("l", "c"))` and `strsplit("Alice", "l")` to give the same output. If you think that they don't, you're wrong. If you expected the first option to warn you about the `"c"` part not being used, you're sane, but wrong. If you want to see how the second argument is supposed to work, re-run the earlier code with `c("Alice", "Boblice")` as your first argument.
```{r, echo = TRUE}
strsplit("Alice", c("l", "c"))
strsplit("Alice", "l")
strsplit(c("Alice", "Boblice"), c("l", "c"))
```
* Remember what I said about needing to generate the correct logical vector when you want to subset a collection? Logical vectors are also recycled when subsetting collections. Because this vector recycling does not always throw warnings or errors, it's a new Hell. I'm honestly not sure if the exact rules for when this does/doesn't throw warnings/errors are documented anywhere. [The language definition](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Indexing-by-vectors) claims that using a logical vector to subset a longer vector follows the same rules as when you're using two such vectors for arithmetic (i.e. you get a warning if the larger of the two's length isn't a multiple of the smaller's). However, I know this to be false.
```{r, echo = TRUE}
a <- 1:10
a + rep(1, 9) #Arithmetic; Gives a warning.
a[rep(TRUE, 9)] #Logical subsetting; 10 results without warning.
a[c(TRUE, FALSE, TRUE)] #Again, 10 results. Shouldn't it be either 10 with a warning or just 3?
```
I'll take this chance to repeat my claim that this is extremely powerful if used correctly, but the potential for errors slipping through unnoticed is huge. This toy example isn't so bad, but wait until these errors creep in to your dataset with 50 rows and columns, leaving you with no idea where it all went wrong. The first time where this really caught me out was when I used the same logical vector for two similar datasets of slightly different sizes. I had hoped that if anything went wrong, I'd get an error. Because I didn't, I continued on without knowing that half of my data was now ruined.
* Logical vectors also recycle `NA` without warning. I can't point to any documentation that contradicts this, but it will always catch you off guard. On the bright side, this is consistent with the addition and subsetting rules for numeric vectors with `NA`s.
```{r, echo = TRUE}
arithmetic <- c(2, NA)
arithmetic + c(11, 12, 13, 14) #Keeps NA and recycles.
logic <- c(TRUE, FALSE, TRUE, NA)
LETTERS[logic]
LETTERS[arithmetic] #Keeps NA and recycling is not expected.
```
* You sometimes have to tell R that you wanted to work on the entire vector rather than its elements. For example, `rep(matrix(1:4, nrow = 2, ncol = 2), 5)` will not repeat the matrix 5 times, it will repeat its elements 5 times. The fix is to use `rep(list(matrix(1:4, nrow = 2, ncol = 2)), 5)` instead.
```{r, echo = TRUE}
m <- matrix(1:4, nrow = 2, ncol = 2)
rep(m, 5)
rep(list(m), 5)
```
Similarly, you might think that `vect %in% listOfVectors` will work, but it will instead check if the elements of `vect` are elements of `listOfVectors`. Again, the solution is to wrap the vector in a list. For example, you want `list(1:4) %in% list(5:10, 10:15, 1:4)` not `1:4 %in% list(5:10, 10:15, 1:4)`.
```{r, echo = TRUE}
list(1:4) %in% list(5:10, 10:15, 1:4)
1:4 %in% list(5:10, 10:15, 1:4)
```
You might be surprised that the last result was entirely `FALSE`. After all, some of `1:4` is in the last element of the list. I'll leave that one to you.
Again, for the most part, these aren't major issues. I don't particularly like the inconsistency between functions like `paste()` and `max()`, but the only true minefield is the vector recycling rules. When they silently do things that you don't want, you're screwed.
## R Won't Help You
R makes no secret of being essentially half a century of patches for S. Many things disagree, lack any clear conventions, or are just plain bad, but show no signs of changing. Because so many packages depend on these inconsistencies, I don't think that they will ever be removed from base R. R could be salvaged if its means of helping you manage the inconsistency were up to scratch -- e.g. the documentation, the function/argument names, or the warning/error messages -- but they're not. It's therefore hard to guess about anything or to help yourself when you've guessed wrong. These sounds like minor complaints, but R can be so poor in these regards that it becomes a deal-breaker for the entire language. If there's one thing that will make you quit R forever, it's this. It may sound like I'm being harsh, but I'm not alone in saying it. Both *Advanced R* and [*The R Inferno*](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf) can barely go a section without pointing out an inconsistency in R.
Really, this is R's biggest issue. You can get used to the arcane laws powering R's subsetting and vectorization, the abnormalities of its variable manipulations, and it's tendency to do dangerous things without warning you. However, this is the one thing that you can never learn to live with. R is openly, dangerously, and eternally inconsistent and also does a poor job of helping you live with that. In the very worst cases, you can't find the relevant documentation, the thing that's conceptually close to what you're after doesn't link to it, the examples are as poor as they are few, the documentation is simultaneously incomplete and filled with irrelevant information while assuming familiarity with something alien, the error messages don't tell you what line threw the errors that your inevitable misunderstandings caused, the dissimilarity between what you're working with and the rest of the language makes it impossible to guess where you've slipped up, there's undocumented behaviour that you need to look at the C code to discover, and you know that none of this will ever be fixed!
These issues tend to overlap, but I've done my best to split this up in to sections that cover each aspect of this problem. All in all, this section came out to be shorter than I expected. However, I hope that I have made the magnitude of some of these points clear.
### The Documentation
If R had outstanding documentation, then I could live with its inconsistencies. Sadly, it doesn't. The documentation does almost nothing to help you in this regard and has more than its fair share of issues:
* Some of the docs are ancient and therefore have examples that are either terrible, few in number, or non-existent. The references in these docs suggests that this is a disease inherited from S, but sometimes it's really unforgivable:
* The Examples section in the documentation for control flow shows no examples of how to use `if`, `while`, `repeat`, `break`, or `next`. They're explained in the actual text, but I expect the Examples section to give examples!
* The documentation for the list data type has five examples, many of which are for the other seven functions that it shares it documentation with, despite it being an absolutely fundamental data type. For some reason, that documentation also includes a library call. And yes, that does mean that some of the functions don't have examples.
* For stats functions, we typically see documentation for a set of algorithms. However, said documentation will often have far fewer examples than there are members in said set. The `quantile()` function's docs are an extreme examples of this. A similar sin can be found in the docs for `lm()` and `glm()`. However, their See Also sections link to a lot of functions that use them in their own examples, so I can just barely forgive this.
The Tidyverse seems to be far better in this regard, with the examples often taking up almost as much room as the actual documentation. However, I don't like how its docs often don't have a Value section, like a lot of base R's docs do.
* Some of the docs have no examples at all e.g. `UseMethod()`, `vcov()`, and `xtfrm()`.
* Some of the docs will document many seemingly identical things and not tell you how they differ. For example, can you tell from the documentation if there's a difference between `rm()` and `remove()`? An even worse case is trying to figure out the difference between `resid()` and `residuals()`. The documentation correctly tells you that one is an alias for another, but then it tells you that `resid()` is intended to encourage you to not do a certain thing. This implies that `residuals()` does not have that same intention, incorrectly hinting that they might have different behaviour.
* In some of the standard libraries, you can find functions without any documentation. For example, `MASS::as.fraction()` is totally undocumented.
* The [*R Language Definition*](https://cran.r-project.org/doc/manuals/r-release/R-lang.html) is incomplete. I imagine that this will really bother some people on principle alone. Personally, I would be satisfied if it were incomplete in the sense of "*each section is complete and correct, but the document is missing many key sections*". However, it's really more like a rough draft. It has [sentences that stop mid-word](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Inheritance), [prompts for where to write something latter](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Propagation-of-names), and lots of information that is either clearly incomplete or very out of date.
* A lot of R's base functions are not written in R, so if you really want to understand how an R function works, you need to learn an extra language. I find that a lot of the power users have gotten used to reading the C source code for a lot of R. That wouldn't be so bad, but...
* For a long time, I didn't know why many of my technical questions on Stack Overflow were answered by direct reference to R's code, without any mention of its documentation. I eventually learned that R's functions occasionally have undocumented behaviour, meaning that you can't trust anything other than the code. For example:
* Where do the docs tell you that the `expr` argument in `replicate()` gets wrapped in an anonymous function, meaning that you can't use it to do `<-` variable assignment to its calling environment (e.g. code like `n <- 0; replicate(5, n <- n + 1)` does not change `n`)? You might just spot this if you check the R code, but even then it's not clear.
```{r, echo = TRUE}
replicate
```
* Where do `rep()`'s docs tell you that it's a special kind of generic where your extensions to it won't dispatch properly? Even the R code -- `function (x, ...) .Primitive("rep")` -- won't help you here.
* Where do `lapply()` and `Filter()`'s docs tell you that they don't play nice with the `names()` function? Again, even the R code won't help here.
```{r, echo = TRUE}
lapply
```
You can sometimes find parts of the documentation that very vaguely hint to this misbehaviour, but such things are rarely specific or said at a non-expert level: Their meaning is only obvious in retrospect. On the rare occasion that the documentation is specific about the misbehaviour, it can be incomplete. For example, the documentation for `choose()` tells you that it behaves differently for small `k`, but what is "*small `k`*"? I think that it's 29 or less, but that assumes that I've found the correct C code (I think it's [this](https://github.com/wch/r-source/blob/trunk/src/nmath/choose.c)?) and read it correctly.
* In the same vein to the `choose()` example, functions in the base stats library do not always tell you which calculation method they used. This can make you falsely assume that a figure was calculated exactly. For example, `prop.test()` computes an approximation, but the only mention of this in its documentation is the See Also section saying "*`binom.test()` for an exact test of a binomial hypothesis*". Not only is this in a terrible place, it only suggests that an approximation has been used in `prop.test()`. The details of the approximation are left for the reader to guess.
* Some functions act very strangely because they're designed with S compatibility in mind. This issue goes on to damage the documentation for said functions. For example, have a look at the docs for the `seq()` function. It won't tell you what `seq_along()` does, but it will tell you what to use `seq_along()` instead of! I'll let [Stack Overflow](https://stackoverflow.com/questions/59378862/) explain `seq.int()`'s documentation issues. Said documentation is so poor that I've been scared out of using the function. I really don't know why R pays this price: Who is still using S? Another example is the `**` operator. I'll let the Arithmetic Operators documentation (try `?'**'`) speak for itself on that one. Its three sentences on the topic are `**`'s only documentation. Given that you shouldn't use it, it would be harsh for me to say more. For further reading, I will only give [this](https://rosettacode.org/wiki/Exponentiation_order#R).
* As the previous example shows, backwards compatibility is a priority for R. This means that its inconsistencies will almost certainly never be fixed. Things would be better if the docs did a better job of helping you, but this section demonstrates ad nauseam that they do not. One wonders if there's ever been any real interest in fixing it.
* Some docs assume stats knowledge even when there should be no need to. If you don't know what "*sweeping out*" is, you will never understand the docs for `sweep()`. I find `rmultinom()`'s docs to be similarly lacking. It talks about "*the typical multinomial experiment*" as if you'll know what that is. Its Details section tells you the mathematical technicalities, but if I wanted that then I would've gone to Wikipedia. All that they had to do was give an example about biased die and that would've told the reader all that they will need to know. A similar case can be made about `rbinom()`, but I can forgive that on the grounds of "*who uses R without knowing at least that much stats?*".
* The docs often do a bad job of linking to other relevant functions. For example, `match()`'s doesn't tell you about `Position()`, `subset()`, `which()`, or the various grep things, `mapply()`'s doesn't tell you about `Map()`, and `rbinom()`'s doesn't tell you about `rmultinom()`.
* I sometimes can't understand how to search for functions in the documentation. For example, `Filter()`'s docs are in the "*funprog {base}*" category, but putting `?funprog` in to R won't return those docs. Another oddity is that it's sometimes case sensitive. For example, `?Extract` works but `?extract` doesn't. In case you missed it, there is no `Extract()` or `extract()` function.
* I find that the documentation tries to cover too many functions at once. For example, in order to understand any particular function in the funprog or grep documentation, you're probably going to have to go as far as understanding all of them. The worst case is the Condition Handling and Recovery documentation (`?tryCatch`), which lists about 30 functions, forever dooming me to never really understand any more of R's exception system than `stop()` and `stopifnot()`. A much smaller example is that both `abs()` and `sqrt()` are documented in the same place, despite barely having anything in common and not sharing this documentation with anything else. This issue also compromises the quality of the examples that are given. For example, the funprog documentation gives no examples of how to use `Map()`, `Find()`, or `Position()`, something that never would have happened if they were alone in their own documentation pages. Then again, `which()` and `arrayInd()` are the only functions in their documentation, and `arrayInd() `has no examples, so maybe I'm giving R too much credit. After all, like I hinted at earlier, even totally fundamental stuff like lists have more functions in their documentation than examples.
* The docs sometimes spend a distracting amount of time comparing their subjects to other languages that you might not know. The best example is the funprog docs, which are needlessly cluttered with mentions of Common Lisp. A close second to this is the documentation for pairlists, which even in [the language definition](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Pairlist-objects) have little more description than "*Pairlist objects are similar to Lisp’s dotted-pair lists*". My favourite example is probably "*regexpr and gregexpr with perl = TRUE allow Python-style named captures*", if only because it manages to mention two languages in a totally unexpected way. I should also mention that I've already complained about how some functions are so obsessed with S compatibility that both their documentation and functionality are compromised. As a final but forgiveable case, `sprintf()` is deliberately about C-style stuff and therefore never shuts up about C, making the R documentation pretty difficult for anyone who doesn't know C.
* If pairlists are not really intended for use by normal users, why are they documented in the exact same place as normal lists, which are critical to normal R usage?
* Guidelines for unusual operators, such as using `[` as a function, are rather hard to find in the documentation. One example that I found particularly annoying is in the `names()` documentation. It can't make its mind up about whether it wants to talk about the `names(x) <- value` version or the `"names<-"(x, value)` version. The only place where it's apparent that there's a meaningful difference between the two is in the second part of the Values section, which says:
* "*For `names<-`, the updated object. (Note that the value of `names(x) <- value` is that of the assignment, `value`, not the return value from the left-hand side.)*"
...Wasn't that helpful? You'll only really be able to understand it if you understand the abstract notion of R's replacement functions, but nowhere in the `names()` documentation will point you to that. In fact, unless you find the correct section of the language definition, you're never going to find it at all (I'm not linking to that, go prove my point and find it yourself!).
Don't get me wrong, R's documentation isn't terrible. Its primary issue is that it does a poor job of helping you navigate R's inconsistencies. If the examples were plentiful and the docs for each function linked to plenty of other related functions without themselves being cluttered with mentions of other functions and languages, then it would go a long way towards stopping R from tripping people up.
### The Functions
There are several inconsistencies in R's functions and how you use them. This means that you either have to adopt a guess-and-check style of coding or constantly double-check the documentation before using a lot of R's functions. Neither are satisfactory.
* There are a few too many functions that have names synonymous with "*do more than once*". There's `replicate()`, `repeat` loops, and `rep()`. Good luck remembering which does what.
* Why do we have both `structure()` and `str()` or `seq()` and `sequence()`, all of which are different, while having `rm()`/`remove()` and `residuals()`/`resid()`, which are not? The potential for confusion is obvious: If I were to write a new function, `Pos()`, should you or should you not assume that it's an alias for `Position()`?
* There is no consistent convention for function names in the base libraries, even for related functions. I struggle to think of a function-naming scheme that isn't found somewhere in R. For example, the documentation for `mean()` links to both `colMeans()` and `weighted.mean()`. Similarly, the `seq()` documentation contains both `seq.int()` and `seq_len()`. I also don't like how there's both `readline()` and `readLines()` or `nrow()` and `NROW()`. Or how about `all.equal()` and `anyDuplicated()`? There's even all of those functions with leading capitals like `Vectorize()` or the funprog stuff. I could go on...
* The above issue gets even worse if we discuss functions that you'd expect to exist but don't. For example, we have `write()` but not `read()` (the equivalent is probably `scan()`).
* Argument names are also inconsistent. Most of the apply family calls its function argument `FUN`, but `rapply()` and the funprog stuff use `f`.
* Even when argument names are consistent, their behaviour may not be. For example, `complex(real = 1, imaginary = 2, length.out = 0)` and `rep_len(complex(real = 1, imaginary = 2), length.out = 0)` do not have the same return value. If you ask me, it's `complex()` that has the wrong behaviour here. I can't see anywhere in its documentation mentioning that the other arguments can overrule the `length.out = 0` argument and give you vectors larger than what you asked for. At least throw a warning!
* Related functions sometimes expect their arguments to be given in a different order. For example, except for `mapply()`, the entire apply family wants the data to come before the function, whereas all of the funprog functions (e.g. `Map()`, `Filter()`, etc), want the reverse. When you realise that you picked the wrong function for a job, this makes rewriting your code infuriating.
* Functions that should be related in theory are not always related in practice. For example, `subset()` is not documented with the Set Operations (`union()`, `setdiff()`, etc) and works on completely different principles. The Set Operations are the extremely dangerous functions that remove duplicates from their inputs and apply `as.vector()` to them. The `subset()` function is a non-standard evaluation tool like `within()`, making it completely different and [dangerous in a different way](#non-standard-evaluation). Finally, despite it being documented with the Set Operations, none of these warnings apply for `is.element()`. I regret every time that I wrote off someone's advice to use `subset()` because of my (entirely reasonable!) assumption that it would be a (dangerous) Set Operation.
* Functions with related names sometimes have different effects. For example, here is a damning quote from [section 3.2.4 of *Advanced R*](https://adv-r.hadley.nz/vectors-chap.html#testing-and-coercion):
* "*Generally, you can test if a vector is of a given type with an `is.*()` function, but these functions need to be used with care. `is.logical()`, `is.integer()`, `is.double()`, and `is.character()` do what you might expect: they test if a vector is a character, double, integer, or logical. Avoid `is.vector()`, `is.atomic()`, and `is.numeric()`: they don’t test if you have a vector, atomic vector, or numeric vector; you’ll need to carefully read the documentation to figure out what they actually do.*"
Another example is that `any()`, `all()`, and `identical()` are all predicate functions, but `all.equal()` and `anyDuplicated()` are definitely not.
* Similar to the above, from [the solutions to *Advanced R*](https://advanced-r-solutions.rbind.io/vectors.html#lists):
* "*Note that `as.vector()` and `is.vector()` use different definitions of "vector!"*".
The above quote is then followed by showing that `is.vector(as.vector(mtcars))` returns `FALSE`. I've found similar issues with `as.matrix()` and `is.matrix()`.
* The language can't really decide if it wants you to be using lambdas. The apply family has arguments like `...` and `MoreArgs` to make it so you don't always have to do so, but the funprog stuff gives you no such choice. I almost always find that I want the lambdas, so the apply family's tools to help you avoid them only serve to complicate the documentation.
* As an enjoyable example of how these inconsistencies can ruin your time with R, read the documentation for `Vectorize()`. It's packed with tips for avoiding these pitfalls.
### Extended Example: Matrices
Let's talk about matrices. I've already discussed some oddities like how functions like `[`, `$` and `length()` treat them in ways that seem inconsistent with either the rest of the language or your expectations, but let's go deeper:
* [As covered earlier](#named-atomic-vectors), matrices want to have rownames and colnames rather than names. This gives us a few more inconsistencies to deal with that I didn't mention at the time. The rest of the language has trained you to use `setNames(data, names)`. When you do this, `data` is returned with its column names changed without any changes to `data`. However, matrices want `colnames(data) <- names` and the obvious equivalent for `rownames()`. This modifies `data` and does not return it.
```{r, echo = TRUE}
a <- b <- diag(3)
(colnames(a) <- c("I", "Return", "Me"))
a#Changed
setNames(b, c("I", "Return", "b"))
b#Not changed
```
Not only are the function names inconsistent (why not `colNames()`?), the syntax is wildly so. Also, take a look at the incomprehensible error message that `colnames()` gives if you use `diag(3)` directly rather than assigning it to a variable beforehand.
```{r, echo = TRUE}
a <- diag(3)
colnames(a) <- c("Not", "A", "Problem")
## > colnames(diag(3)) <- c("Big", "Bad", "Bug")
## Error in colnames(diag(3)) <- c("Big", "Bad", "Bug") :
## target of assignment expands to non-language object
## > colnames(a <- diag(3)) <- c("Has", "Similar", "Problem")
## Error in colnames(a <- diag(3)) <- c("Has", "Similar", "Problem") :
## object 'a' not found
```
`setNames()` has no such issue.
```{r, echo = TRUE}
setNames(diag(3), c("Works", "Just", "Fine"))
setNames(a <- diag(3), c("Works", "Just", "Fine"))
```
In truth, I don't mind either `colnames()` or `setNames()`. I just wish that R would pick one way of handling names and stick to it.
* Unlike anything else in R that I can think of, matrices are happy to let you work by row and even have dedicated functions for it, with `rowSums()` and `apply(..., MARGIN = 1)` being the obvious examples. There is a good reasons for this difference -- matrices are always one type, unlike things like data frames -- but it's still an inconsistency. This inconsistency leads to code that is tough to justify. For instance, I frequently find that I want to treat the output of `expand.grid()` as a matrix. `unique(t(apply(expand.grid(1:4, 1:4, 1:4, 1:4), 1, sort)))` is one of my recent examples. This isn't too bad, but I honestly have no idea why I needed the `t()`. Experience has taught me not to question it, which is pretty bad in of itself. R's inconsistency eventually makes you either fall in to the habit of not questioning sudden transformations of your data or forces you to become completely paralysed when trying to understand what ought to be trivial operations in your code. Doubts like "*is there really no better way? R is supposed to be good with this sort of stuff*" become frequent when wanting to work by row.
* March 2023 update: I don't think that there's anywhere in R's docs that tell you that `expand.grid()` is used to make Cartesian products. This compares poorly to languages such as Racket, which calls the practically equivalent function `cartesian-product`. Similarly, Python just calls it `product`. In both cases, they return a collection of lists/tuples where each list/tuple would be a row in `expand.grid()`'s data frame. Nested collections aren't the easiest things to deal with, but they seem to produce more intuitive code than the messes that you get from trying to treat a data frame like a matrix. Maybe the lesson here is that you should iterate through a data frame's rows the hard way rather than using functions that let you think of it as a matrix? I think that this explains the doubt that I've mentioned above; R's data structures have a habit of making the right way the hard way.
* So what happens if, when manipulating a matrix, you write the `sapply()` that the rest of the language has taught you to expect? At best, it gets treated like a vector in column-order.
```{r, echo = TRUE}
(mat <- matrix(1:9, nrow = 3, byrow = TRUE))
sapply(mat, max)
```
At worst, it doesn't do anything like what you wanted.
```{r, echo = TRUE}
mat <- matrix(1:9, nrow = 3, byrow = TRUE)
sapply(mat, sum)
```
The trick for avoiding this is to use numbers as your data argument and let subsetting be the function.
```{r, echo = TRUE}
mat <- matrix(1:9, nrow = 3, byrow = TRUE)
sapply(1:3, function(x) sum(mat[x, ]))
sapply(1:3, function(x) max(mat[x, ]))
```
Better yet, just use `apply()`.
```{r, echo = TRUE}
mat <- matrix(1:9, nrow = 3, byrow = TRUE)
apply(mat, MARGIN = 1, sum)
apply(mat, MARGIN = 1, max)
```
But why did we have the learn any of this in the first place?
* Your turn: What does `seq_along(diag(3))` return? `1:3` or `1:9`? What if you added a row? What if you added a column? Or is the name of that function `seq.along()`? Are you sure? Tempted to check the docs? Which docs? Feeling helpless? You should!
* Many functions that are designed for matrices should be forgotten about everywhere else. Several guides warn against using `apply()` on non-matrices and I wouldn't dare use `t()` on a non-matrix. Try `t(iris)`.
* I always expect `c()` of a matrix to work in row-order. It doesn't. However, that's probably more the fault of `c()` and I than it is of matrices. There are times when I can't explain `c(mtcars)` to myself.
* Named matrices are named atomic vectors, so they break in the ways [discussed earlier](#named-atomic-vectors). This puts you in a dilemma when you're using data that's essentially only one type: Do you keep it as a matrix and lose the awesome subsetting powers of a data frame? Or do you make it in to a data frame and lose the power to work by row that matrices give you? At times, I'm tempted to forget that I named the matrix in the first place and just manipulate it like a mathematician. None of these solutions are good.
Overall, matrices are so inconsistent with the rest of the language that your matrix-manipulation code never looks right. It leaves you with an awful sense of unease.
### The Error Messages
Something to mention while we've still got some bad error messages fresh in our minds: People often say that R's error messages aren't very good and I'm starting to agree. Errors like "*`dim(X)` must have a positive length*" are useless when you're not told which function in the line that threw the error had that error, what `X` is, or in the very worst cases, what line the error was even in. This means that almost any error that R throws is going to require you looking through both the result of `traceback()` (to find where the error happened) and the documentation (to identify the problematic argument). It seems that this issue gets even worse when you try to do statistics. Warnings like "*Warning message: In `ks.test(foo, bar)` : ties should not be present for the Kolmogorov-Smirnov test*" don't even tell you where the tie was. Was it in one of my arguments? Is it some technical detail of the test? Somewhere safe to ignore? You don't know and R won't tell you unless you study the documentation. Worst come to worst, you have to read the code or learn the secret for getting `traceback()` to work on warning messages. And yes, that last bit is something that you have to learn. It makes warnings messages a lot harder to debug than errors.
Of course, the more worrying (and frequent?) issue is when R gives you no warnings/errors at all. I'd much rather have a bad error message than none at all, but a bad error message is still annoying.
### Mapply Challenge
Maybe you think I'm clutching at straws? I admit, I sometimes wonder if my outrage is unjustified. Let's settle this with a challenge. If you win, then by all means close this document and write me off as a madman. If you lose, then maybe I've got a point.
***
**CHALLENGE**
Taking in to account R's vector recycling rules, figure out how `mapply()`'s `MoreArgs` and `...` arguments differ and when you would want to pass something as a `MoreArgs` argument rather than in the `...` argument. No cheating by going online (trust me, it won't help). Solve this without leaving your R IDE. You're encouraged to check the documentation.
If my criticisms are true, you will find that `mapply()`'s documentation is of little help and that your confidence in your R knowledge is too small to make an educated guess.
***
**HINT 1**
Don't try to cheat by looking at `mapply()`'s code; Most of it is in C and therefore will be of no help to you.
***
**HINT 2**
You might think that the documentation for `sapply()` will help you, but it'll actually mislead you because `mapply()`'s `...` is essentially `sapply()`'s `X` and `sapply()`'s `...` is most like `mapply()`'s `MoreArgs`.
Solution below. Time to stop scrolling.
***
**SOLUTION**
.
.
.
.
.
.
.
**How do `MoreArgs` and `...` differ?**
It's tough to explain. `mapply()` uses the default vector recycling rules for the `...` arguments but reuses every element of `MoreArgs` for each call. Because the `MoreArgs` argument must be a list and R recycles the elements of lists (e.g. using a length one list as a `...` argument will have the element of that list reused for each call), the difference is subtle to the point of near invisibility. Ultimately, `MoreArgs = list(a, b, c)` is equivalent to using `list(a)`, `list(b)`, and `list(c)` as three separate `...` arguments. The answer is therefore that `MoreArgs` only exists as syntactic sugar for this `...` case.
**When use `MoreArgs` rather than `...`?**
Beyond what I've already said, I barely have any idea. If you want to keep some function arguments fixed for each call, then just use an anonymous function. I struggle to invent a useful example of where I'd even consider using `MoreArgs`, never mind one that doesn't look tailor-made to make the anonymous function option look better. The one and only example that the documentation gives for using `MoreArgs` does not help here. Their example of `mapply(rep, times = 1:4, MoreArgs = list(x = 42))` is identical to `mapply(rep, times = 1:4, list(x = 42))`. Read that again: **You can get identical functionality by deleting the thing that they're trying to demonstrate!**
**Bonus**
Did you notice that the documentation for `mapply()` has a notable omission? It doesn't mention this, but you can call `mapply()` without the `...` argument, e.g. `mapply(rep, MoreArgs = list(1:4))`. You won't get sensible output, but you also don't get any warnings or errors.
***
If I've won this challenge, then allow me to take a victory lap by making the following point: By giving you the options of using `...`, `MoreArgs`, or an anonymous function to do the same task, R gives you plenty of room to confuse yourself without providing any help in its documentation. Either provide fewer options, document them better, or make them so commonplace and consistent within the language that I only need to understand it once in order to understand it everywhere!