Learning Objectives
By the end of this lesson the learner will:
- Access and run a SQL query in R.
- Create a database from an existing .csv file using SQL syntax.
- Recognize how scripted database interactions enhance the reproducibility of data analysis.
Interacting with databases through scripted languages can offer advantages over querying databases via a GUI interface. A GUI interface for your database is easier to use and allows the quick viewing of adhoc queries. Querying a database with a programmatic interface (in this case R, but it could be any language) are slightly more complicated. However the trade-off is that data manipulations are preserved in the code. Aggregations, summaries and other database operations are preserved. Therefore those pre-analysis data manipulation steps are not lost and can be reproduced later by yourself or others.
R can connect to databases through a number of packages. In our case we will use RSQLite to connect to existing SQLite3 databases. However you should be able to connect to almost any database in R via JDBC or ODBC, or specific database packages (such as we are doing, or MySQL ).
To begin these exercises we’ll connect to the mammal database you’ve just created. We’ll need the RSQLite package so be sure to install it first. install.packages('RSQLite')
.
library(RSQLite)
## Set dbname and driver out of convenience
myDB <- "data/portal_mammals.sqlite"
conn <- dbConnect(drv = SQLite(), dbname= myDB)
Now we’re connected to our database, let’s explore the table structure in the database. Remember we could see the list of tables in the SQLite Firefox gui. In R, we’ll need to use SQL commands.
We can view information about the database structure which includes a list of all tables like this:
dbGetQuery(conn, "SELECT type, tbl_name FROM sqlite_master")
The RSQLite
package also has functions that can be used to list both tables and fields within a table. Here you can see the types and names of fields and get a count of records.
dbListTables(conn)
#get list of fields in the surveys table
dbListFields(conn,"surveys")
dbGetQuery(conn,"SELECT count(*) FROM surveys")
It’s good practice to always close a connection that you open in R. Let’s do that next. Note that once you’ve closed a connection, you will have to open a new connection to query and import the data again.
dbDisconnect(conn)
We’ve now covered the basics of connecting to a database and exploring its basic structure. From here we can write queries to access subsets of the data within our database, using the same methods that we used in SQLite.
Let’s try some basic queries from the previous lesson. Querying simply takes a connection to a database and query as inputs and returns a dataframe with the results.
Before we can do this, we need to re-establish a database connection.
## Set dbname and driver out of convenience
myDB <- "data/portal_mammals.sqlite"
conn <- dbConnect(drv = SQLite(), dbname= myDB)
Next, we build our query. We can use the dbGetQuery
function to run the query and access data returned in a data.frame
object.
q <- 'SELECT DISTINCT year, species_id FROM surveys'
result <- dbGetQuery(conn, q)
head(result)
Challenge
- Write a query that returns counts of genus by
plot_id
You can join multiple tables together in SQL using the following syntax where foreign key refers to your unique id (e.g.,
species_id
):SELECT table.col, table.col FROM table1 JOIN table2
ON table1.key = table2.key
JOIN table3 ON table2.key = table3.keyWrite a query that joins the species, plot, and survey tables together. The query should return counts of genus by plot type. Then create a bar plot of your results in R.
q <- "SELECT d.plot_type , c.genus, count(*)
FROM
(SELECT a.genus, b.plot_id
FROM species a
JOIN surveys b
ON a.species_id = b.species_id) c
JOIN plots d
ON c.plot_id = d.plot_id
GROUP BY d.plot_type,c.genus"
result <- dbGetQuery(conn,q)
head(result)
All we’ve done so far is execute the same sorts of queries that can easily be made with a GUI. Now let’s try leveraging the power of scripted queries. Imagine you want to know how many rodents were found every other year. To get this we’ll get the range of dates from the database, sequence them by two and make new queries.
yearRange <- dbGetQuery(conn,"SELECT min(year),max(year) FROM surveys")
years <- seq(yearRange[,1],yearRange[,2],by=2)
Next we’ll build our query string using the paste()
function.
q <- paste("
SELECT a.year,b.taxa,count(*) as count
FROM surveys a
JOIN species b
ON a.species_id = b.species_id
AND b.taxa = 'Rodent'
AND a.year in (",
paste(years,collapse=",")
,")
GROUP BY a.year, b.taxa",
sep = "" )
rCount <- dbGetQuery(conn,q)
head(rCount)
With the nested paste commands we were able to construct a query programmatically, without having to type out all the years. This could also be done with a for loop, especially if the query to be constructed is more complicated.
We can use R to create a new database and associated structure (also known as the schema) databases from existing csv
files. Let’s recreate the mammals database that we’ve been working with, in R. First let’s read in the csv
files.
Up until now we been working with an existing data. However we can use R as a way to build up databases from existing flat files. We’ll use the flat files that make up the mammals database to recreate it. First let’s read in the files.
species <- read.csv("data/species.csv")
surveys <- read.csv("data/surveys.csv")
plots <- read.csv("data/plots.csv")
Next we’ll need to create a database and add our tables.
myDB <- "portalR.db"
myConn <- dbConnect(drv = SQLite(), dbname= myDB)
dbListTables(myConn)
By opening up a connection, you’ve created a new database. If you list the tables you’ll see that the database is empty. Next we’ll add the dataframes we just read in to the database.
dbWriteTable(myConn,"species",species)
dbListTables(myConn)
dbGetQuery(myConn,"SELECT * from species limit 10")
If you check the location of our database you’ll see that data is automatically being written to disk. Not only does R and RSQLite provide easy ways to query existing databases, it also allows you to easily create your own databases from flat files.
If you check the location of your database you’ll see that data is automatically being written to disk. Not only does R and RSQLite provide easy ways to query existing databases, it also allows you to easily create your own databases from flat files.
Challenge
Add the remaining tables to the existing database. Open the new database using the Firefox SQLite manager to verify that the database was built successfully.
Extra Challenge
Run some of your queries from earlier in the lesson to verify that you have faithfully recreated the mammals database.
Don’t forget to close your database connection when you are done with it!
Data Carpentry, 2016. License. Questions? Feedback? Please file an issue on GitHub.