In this video you will learn how to import your flat files into R.
Want to take the interactive coding exercises and earn a certificate? Join DataCamp today, and start our intermediate R tutorial for free: https://www.datacamp.com/courses/importing-data-into-r
In this first chapter, we'll start with flat files. They're typically simple text files that contain table data. Have a look at states.csv, a flat file containing comma-separated values. The data lists basic information on some US states. The first line here gives the names of the different columns or fields. After that, each line is a record, and the fields are separated by a comma, hence the name comma-separated values. For example, there's the state Hawaii with the capital Honolulu and a total population of 1.42 million.
What would that data look like in R? Well, actually, the structure nicely corresponds to a data frame in R, that ideally looks like this: the rows in the data frame correspond to the records and the columns of the data frame correspond to the fields. The field names are used to name the data frame columns. But how to go from the CSV file to this data frame?
The mother of all these data import functions is the read.table() function. It can read in any file in table format and create a data frame from it. The number of arguments you can specify for this function is huge, so I won't go through each and every one of these arguments. Instead, let's have a look at the read.table() call that imports states.csv and try to understand what happens.
The first argument of the read.table() function is the path to the file you want to import into R. If the file is in your current working directory, simply passing the filename as a character string works.
If your file is located somewhere else, things get tricky. Depending on the platform you're working on, Linux, Microsoft, Mac, whatever, file paths are specified differently. To build a path to a file in a platform-independent way, you can use the file.path() function.
Now for the header argument. If you set this to TRUE, you tell R that the first row of the text file contains the variable names, which is the case here. read.table() sets this argument FALSE by default, which would mean that the first row is already an observation.
Next, sep is the argument that specifies how fields in a record are separated. For our csv file here, the field separator is a comma, so we use a comma inside quotes.
Finally, the stringsAsFactors argument is pretty important. It's TRUE by default, which means that columns, or variables, that are strings, are imported into R as factors, the data structure to store categorical variables. In this case, the column containing the country names shouldn't be a factor, so we set stringsAsFactors to FALSE.
If we actually run this call now, we indeed get a data frame with 5 observations and 4 variables, that corresponds nicely to the CSV file we started with.
The read table function works fine, but it's pretty tiring to specify all these arguments every time, right? CSV files are a common and standardized type of flat files. That's why the utils package also provides the read.csv function. This function is a wrapper around the read.table() function, so read.csv() calls read.table() behind the scenes, but with different default arguments to match with the CSV format.
More specifically, the default for header is TRUE and for sep is a comma, so you don't have to manually specify these anymore. This means that this read.table() call from before is thus exactly the same as this read.csv() call.
Apart from CSV files, there are also other types of flat files. Take this tab-delimited file, states.txt, with the same data:
To import it with read.table(), you again have to specify a bunch of arguments. This time, you should point to the .txt file instead of the .csv file, and the sep argument should be set to a tab, so backslash t.
You can also use the read.delim() function, which again is a wrapper around read.table; the default arguments for header and sep are adapted, among some others. The result of both calls is again a nice translation of the flat file to a an R data frame.
Now, there's one last thing I want to discuss here. Have a look at this US csv file and its european counterpart, states_eu.csv.
You'll notice that the Europeans use commas for decimal points, while normally one uses the dot. This means that they can't use the comma as the field-delimiter anymore, they need a semicolon. To deal with this easily, R provides the read.csv2() function. Both the sep argument as the dec argument, to tell which character is used for decimal points, are different. Likewise, for read.delim() you have a read.delim2() alternative. Can you spot the differences again? This time, only the dec argument had to change.