Learning Objectives

Following this assignment students should be able to:

Excel

  • Good data entry practices - formatting data tables in spreadsheets
  • How to avoid common formatting mistakes
  • Approaches for handling dates in spreadsheets
  • Basic quality control and data manipulation in spreadsheets
  • Exporting data from spreadsheets

OpenRefine

  • Import data
  • Clean data
  • Reproduce analysis

Reading

Spreadsheet

Lecture Notes


Exercises

  1. -- Basic-01 --

    Questions

    • How many people have used spreadsheets in their research?
    • How many people have accidentally done something that made them frustrated or sad?
  2. -- Basic-02 --

    Excel

    1. Download the data by clicking here to get it from FigShare.
    2. Open up the data in a spreadsheet program.
    3. You can see that there are two tabs. Two field assistants conducted the surveys, one in 2013 and one in 2014, and they both kept track of the data in their own way. Now you’re the person in charge of this project and you want to be able to start analyzing the data.
    4. With the person next to you, identify what is wrong with this spreadsheet. Also discuss the steps you would need to take to clean up the 2013 and 2014 tabs, and to put them all together in one spreadsheet.
  3. -- Basic-03 --

    Openrefine

    1. Using faceting, find out how many years are represented in the census.
    2. Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
    3. Which years have the most and least observations?
  4. -- Basic-04 --

    Openrefine

    1. Split the column scientific name into genus and species
    2. Explore the Undo / Redo operations
    3. Which transform solves the isues
  5. -- Basic-05 --

    OpenRefine

    1. What scientific names (genus and species) are selected by searching for ‘bai’?
    2. How would you restrict this to one of the species selected?
    3. How many rows are matched for each of the species?
    4. Use include / exclude to select entries from one of the species.
  6. -- Basic-06 --

    OpenRefine

    1. Continue to work with both species.
    2. Sort by month. How can you ensure that months are in order?
    3. Sort the data by plot. What years were observations recorded for the filtered set?
    4. How do you sort your data by month?
    5. How do you sort the subset in chronological order?
  7. -- Basic-07 --

    OpenRefine

    1. Use the full set again.
    2. Transform the columns recordID, dy, mo, period, plot_id to numbers.
    3. Edit several cells of the column “dy”to contain text or empty cells. Explore the numeric facet on that column.
    4. Create a scatterplots by plotting pairs of numeric columns.
    5. Why does the scatterplot recordID vs period have the pattern is does?
  8. -- Basic-08 --

    OpenRefine

    1. Create a project with the bed file: syst-nocallsCG69.bed https://bit.ly/2Q9gA05

    Try to determine

    a) the number of no-call regions that are larger than 1000 bases long in chromosome 21
    
    b) the length of the longest region in chromosome 1