-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes.Rmd
567 lines (335 loc) · 10.3 KB
/
notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
---
title: "Stat 33A - Lecture Notes 4"
date: September 13, 2020
output: pdf_document
---
File Systems
============
For this section, see the slides for notes.
The R Working Directory
=======================
The **working directory** is the reference point R uses for relative paths.
When navigating files in R, use the working directory to save time and make
code reproducible.
Important functions:
* `getwd()` -- get the working directory
* `setwd()` -- set the working directory
* `list.files()` -- list files in a directory
Try running these in the console first (rather than using `Ctrl` + `Enter` in a
notebook chunk).
Use `getwd()` to get the working directory:
```{r}
getwd()
```
The output on your computer will probably be different!
Use `setwd()` to set the working directory:
```{r}
setwd("..")
getwd()
setwd("/home/nick")
```
Use `list.files()` to list files in a directory:
```{r}
list.files()
list.files("/")
```
The output is an empty vector if:
* The path you provided is incorrect.
* The path you provided leads to a file, not a directory.
* There are no files in the directory.
For example, if we make a deliberate typo:
```{r}
list.files("foo")
```
## R Markdown Files
RStudio tracks the working directory for the console and each R Markdown file
separately.
So:
* Running `getwd()` from the notebook with `Ctrl + Enter` displays the
NOTEBOOK'S working directory.
* Typing `getwd()` in the "Console" window and pressing `Enter` displays the
CONSOLE's working directory.
```{r}
getwd()
```
In the notebook, if you use `setwd()` it only lasts for __that chunk__ and is
then reset:
```{r}
setwd("/home/nick")
```
So in subsequent chunks it looks like you didn't call `setwd()`:
```{r}
getwd()
```
Why does RStudio do this? It is a bad practice to include `setwd()` in your
notebooks, because people you share the notebook with, like your colleagues,
instructor, or employer, might not have the same directories on their computer
as the ones you have on your computer. The next section has more details about
this.
By default, RStudio does the right thing and sets the notebook's working
directory to the place where the notebook is saved. Then you can use relative
paths (see below) to load and save files from the
notebook.
If you really want to set the working directory in a notebook, it is possible
to override RStudio. See <https://yihui.org/knitr/options/> for details.
## Editing Paths
R also has functions to make it easier to edit/create paths:
* `normalizePath()` -- convert relative path to absolute path
* `file.path()` -- combine parts of a path
* `dirname()` -- get all except last component of path
* `basename()` -- get last component of path
You can use `normalizePath()` to inspect the path shortcuts:
```{r}
getwd()
list.files()
normalizePath("data")
```
The path `~` is your **home directory**:
```{r}
normalizePath("~")
```
Your home directory is probably different!
The `file.path()` function combines parts of a path:
```{r}
file.path("path", "to", "file")
```
The `dirname()` and `basename()` functions get parts of a path:
```{r}
dirname("/home/nick/TODO.md")
basename("/home/nick/TODO.md")
```
## Reproducible Analyses
Plan ahead so that other people can run your code and reproduce your results.
Good habits:
* Putting your notebook(s) and data in the project directory.
* Using paths relative to the project directory.
Bad habits:
* Calling `setwd()` in R notebooks and scripts.
* Using absolute paths.
It's okay to use `setwd()` in the *R console* to set the working
directory to your project directory.
Data Frames
===========
The first step of an analysis is to load a data set.
Many different file formats exist for storing data. The broadly fall into two
categories:
1. **Plain-text**, which means the format is human-readable. You can open and
edit these in any text editor.
2. **Binary**, which means the format is not human-readable. You need specific
software to open and edit these. Compared to plain-text, most binary formats
are faster to read and write, and use less space.
R provides a binary data format called **RDS** (R data, serialized). The
extension on an RDS file is usually `.rds`.
Any R object can be stored an RDS file.
Use `saveRDS()` to save an object to an RDS file:
```{r}
x = seq(1, 100, 0.8)
x
saveRDS(x, "myvector.rds")
```
This is a good way to save your work after a long computation.
Use `readRDS()` to load an object from an RDS file:
```{r}
y = readRDS("myvector.rds")
y
```
## Data Frames
In statistics, we frequently work with 2-dimensional tables of data.
For a tabular data set, typically:
* Each row corresponds to a single case or subject. These are called
**observations**.
* Each column corresponds to something the data measures. These are called
**features**, **covariates**, or variables.
A "variable" means something else in R, so I'll avoid using it to refer to
columns.
R's data structure for tabular data is the **data frame**.
Let's load a data frame:
```{r}
dogs = readRDS("data/dogs/dogs_sample.rds")
```
This data set is available on the bCourse.
The Dogs Data Set is based on:
https://informationisbeautiful.net/visualizations/
best-in-show-whats-the-top-data-dog/
## Inspecting Objects
Printing is not the only way to inspect data, and has drawbacks:
1. Slow (especially if you're knitting a notebook)
2. Hard to read if there are lots of columns
R provides functions to inspect objects.
We already saw one of these:
```{r}
class(dogs)
```
Use `head()` to print the first 6 rows (or elements):
```{r}
head(dogs)
```
Use `tail()` for the last 6:
```{r}
tail(seq(1, 100))
```
Use `dim()` to print the dimensions:
```{r}
dim(dogs)
```
Alternatively, use `ncol()` and `nrow()`:
```{r}
ncol(dogs)
nrow(dogs)
```
Use `names()` to print the column (or element) names:
```{r}
names(dogs)
```
Use `rownames()` to print the row names:
```{r}
rownames(dogs)
```
Use `str()` to print a structural summary:
```{r}
str(dogs)
```
Use `summary()` to print a statistical summary:
```{r}
summary(dogs)
```
## More about Data Frames
R uses data frames to represent tabular data.
A data frame is a list of column vectors. So:
* Elements of a column must all have the same type (like a vector).
* Elements of a row can have different types (like a list).
* Every row must be the same length.
In addition, every column must be the same length.
This idea is reflected in the type of a data frame:
```{r}
typeof(dogs)
```
## Accessing Columns
Recall that lists can have named elements:
```{r}
list(a = 5, b = "hi")
```
The dollar sign operator `$` extracts a named element from a list.
It's especially useful for getting columns from data frames:
```{r}
dogs$breed
dogs$weight
```
You can also use `$` to set an element:
```{r}
dogs$wt_by_height = dogs$weight / dogs$height
dogs
```
## Deconstructing Data Frames
The `unclass()` function resets the class of an object to match the object's
type.
You can use `unclass()` to inspect the internals of an object.
For example, you can see that a data frame is a list:
```{r}
unclass(dogs)
```
Factors
=======
Again we'll use the dogs data:
```{r}
dogs = readRDS("data/dogs/dogs_sample.rds")
class(dogs$breed)
class(dogs$group)
```
R represents categorical data using the class `factor`:
```{r}
dogs$group
```
The categories of a factor are called **levels**.
You can list the levels with the `levels()` function:
```{r}
levels(dogs$group)
```
Factors remember all possible levels even if you take a subset:
```{r}
dogs$group[c(1, 2, 3)]
```
This is one way factors are different from strings.
For example:
```{r}
x = c("sporting", "herding", "hound")
x[1]
```
You can make a factor forget levels that aren't present with `droplevels()`:
```{r}
new_groups = dogs$group[c(1, 2, 3)]
droplevels(new_groups)
```
You can create a factor with the `factor()` function:
```{r}
factor(c("red", "red", "blue", "red"))
```
## Counting Things
The `table()` function returns the frequency of each value in a vector.
This is especially useful for factors:
```{r}
table(dogs$group)
table(c(1, 1, 2, 3))
```
## Deconstructing Factors
Internally, R represents factors as integer vectors:
```{r}
typeof(dogs$group)
unclass(dogs$group)
```
Each integer corresponds to one level of the factor.
This representation uses less memory than repeating each level's
name.
File Formats
============
Recall there are two kinds of file formats: plaintext and binary.
The RDS format is a binary format.
## Plaintext Formats for Tabular Data
Several plaintext file formats are designed just for tabular data:
* Delimited files
+ Comma-separated value (CSV) files
+ Tab-separated value (TSV) files
* Fixed-width files
For example, suppose you download the Significant Volcanic Eruption Database
from:
https://www.ngdc.noaa.gov/nndc/struts/form?t=102557&s=50&d=50
This file is also on the bCourse.
The website says the file is tab-delimited, so use `read.delim()`:
```{r}
volcano = read.delim("data/volcano/volerup.txt")
```
Many things can go wrong when you read tabular data from a plaintext file:
* Extra lines in the file
* No header in the file
* Incorrect column classes
These can generally be fixed by setting parameters in the read function.
You can read more about the parameters in `?read.table`.
The most common plaintext data format is CSV.
Use `write.csv()` to write CSV files:
```{r}
write.csv(volcano, "volcano.csv")
```
For other formats:
* `read.csv()` -- read CSV files
* `read.table()` -- read delimited files in general
* `read.fwf()` -- read fixed-width files
## Binary Formats for Tabular Data
There are also a few binary formats for tabular data:
* Excel spreadsheets
* Feather
R doesn't provide built-in support for these, but check CRAN for packages.
For instance:
* `readxl` -- a package with functions for reading Excel spreadsheets
* `arrow` -- a package with functions for reading Feather files
## Non-tabular Data
Many packages are available for non-tabular file formats.
For example:
* `jsonlite` -- read JavaScript Object Notation (JSON) files
* `xml2` -- read Extensible Markup Language (XML) files
* `rvest` -- read Hypertext Markup Language (HTML) files
The built-in `readLines()` function can read lines from any plaintext file.
For example:
```{r}
readLines("volcano.csv", 1)
```
Think of `readLines()` as a last resort.