ENVIRON-175

Programming with Big Environmental Datasets



Gleb Satyukov
Senior Research Engineer | Data Science Instructor


https://app.sli.do/event/eTREo1Uv2ucHsqP4Amos6q

Slides


R Basics 1: https://environ-175.com/basics/1

R Basics 2: https://environ-175.com/basics/2

R Basics 3: https://environ-175.com/basics/3

R Basics 4: https://environ-175.com/basics/4

R Basics 5: https://environ-175.com/basics/5

Slides


R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides


R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

Schedule

Schedule

Agenda for today

Use OR logic

Use AND logic

Date Formats

Appending data

New library and functions!


library(lubridate) # For working with dates

select()
as.Date()
bind_rows()
case_when()
            

Conditional Operations


Building on top of our ifelse() function

We specify what to do if a condition is true, and what to do if that condition is not truei, for example:


#############################
# IFELSE() USED AS A FUNCTION
#############################

ifelse(<SOME CONDITION>, <RESULT IF TRUE>, <RESULT IF FALSE>)

ifelse(temperature < 70, "Cold", "Warm")

          

Combining conditions


| symbol is used for OR logic

& symbol is used for AND logic


! symbol is used for NOT (negation) logic

e.g. != is used for NOT EQUALS TO conditions

order of operations


We use parentheses to specify the order of operations:


x <- 5

if ((x > 3) & (x < 10)) {
  print("x is between 3 and 10")
}
          

Project Example


You can combine as many conditions as you want!


#####################
# STEP 7. EXTRACT DAY OF WEEK AND DROP WEEKENDS
#####################

#Create variable to determine the weekends
fbi_drop <- mutate(fbi_drop, dow=wday(fbi_date))

#Drop weekends
fbi_nowkd = filter(fbi_drop,
                   dow==2 | dow==3 | dow==4 | dow==5 | dow==6)
          

Evaluations


Certain values may evaluate to TRUE or FALSE:

0 evaluates to FALSE

1 evaluates to TRUE

⚠️ Note


Be careful using as.logical() because it may not exactly behave the way you'd expect it to behave

It is much better to specify the conditions explicitely whenever possible

Try it yourself in the R Studio console

Logic Gates

https://en.wikipedia.org/wiki/Truth_table

Amusement Park Rides

Amusement Park Rides

OR Logic - Venn Diagram

Condition A | Condition B

AND Logic - Venn Diagram

Condition A & Condition B

name == "Emma" & age == 30

Person Name Age
A Emma 45
B Emma 30
C Ryo 30

name == "Emma" & age == 30

Person Name Age
A Emma 45
B Emma 30
C Ryo 30

name == "Emma" | age == 30

Person Name Age
A Emma 45
B Emma 30
C Ryo 30

name == "Emma" | age == 30

Person Name Age
A Emma 45
B Emma 30
C Ryo 30

Switch / Dispatch functions


A function that diverts to other functions

Or a function that returns different values in different scenarios, different results in different cases

Same effect can be achieved using multiple if/else statements, potentially getting really complicated

CASE_WHEN() function


Using case_when() you apply multiple conditions to create new variables, for example:


people <- mutate(people, age_group = case_when(
    age < 18 ~ "Child",
    age >= 18 & age < 65 ~ "Adult",
    age >= 65 ~ "Senior"
  ))
          

The tilde ~ symbol separates a condition (on the left) from a value to return (on the right)

Project Example


DST happens on a different day every year


#######################
# STEP 5. FORMAT DATE of DST
#######################

fbi_clean <- mutate(fbi_clean,
                    change_date=case_when(
                      year==2021 ~ "March 14 2021",
                      year==2022 ~ "March 13 2022"
                    )
)
          

The tilde ~ symbol separates a condition (on the left) from a value to return (on the right)

Date Formats in R

📅 Date Formats differ


In the U.S. we write dates as:

month/day/year


In Europe, you might write the date as:

day/month/year

Internal Representation


as.character(...) -> "character" type

as.numeric(...) -> "numeric" type


as.Date(...) -> "function" type??

📅 Some Examples


Some examples of date formatting:


as.Date("03/14/2021", format = "%m/%d/%Y")

as.Date("14-Mar-21", format = "%d-%b-%y")

as.Date("Sunday, March 14, 2021", format = "%A, %B %d, %Y")
          

Note: as.Date() can only handle dates, not times!

Common Date Formats


  • United States: MM/DD/YYYY
            e.g. 03/14/2021 is formatted as %m/%d/%Y

  • Europe (many countries): DD/MM/YYYY
            e.g. 14/03/2021 is formatted as %d/%m/%Y

Other International Examples


  • ISO 8601 (International Standard): YYYY-MM-DD
            e.g. 2021-03-14 is formated as %Y-%m-%d

  • Japan: YYYY年MM月DD日
            e.g. 2021年03月14日 is %Y年%m月%d日

  • Text formats: March, 14 2021 is %B, %d %Y

Project Example


We are converting a date from text to Date object

(an internal representation of a Date object in R)


# Converting text to Date objects
as.Date("March 14 2021", format = "%B %d %Y")
          

We need to tell R which format our dates are stored in

In this example the format is: "%B %d %Y"

STRPTIME


Stands for string parse time

This is a very common way to format date and time

The origins go back to PWB/UNIX 1.0 released in 1977

https://en.wikipedia.org/wiki/C_date_and_time_functions#History

strptime Format Codes

CodeMeaningExample
%Y4-digit year2021
%y2-digit year21
%m2-digit month03
%BFull month nameMarch
%bAbbreviated monthMar
%d2-digit day14
%AFull weekday nameSunday
%aAbbreviated weekdaySun
%jDay of year (001–366)073
%%Literal percent sign%

Check the docs for ?Date

Check the docs for ?strptime

Slido


ABOUT STRPTIME








Interesting Facts


There is no year 0!

https://en.wikipedia.org/wiki/Year_zero

Y2K - Year 2000 problem

https://en.wikipedia.org/wiki/Year_2000_problem

The year 2038 - a problem?

https://en.wikipedia.org/wiki/Year_2038_problem

Lubridate library



https://lubridate.tidyverse.org/

Lubridate Cheatsheet

lubridate Date Parsing


Some example functions used to parse dates:


        library(lubridate)

        ymd("2021-03-14")      # Year-Month-Day
        dmy("14/03/2021")      # Day-Month-Year
        mdy("March 14, 2021")  # Month-Day-Year
          

lubridate Date Parsing


Lubridate library can also handle time!


library(lubridate)

# Can also handle times
# With hours, minutes, seconds
ymd_hms("2021-03-14 09:45:00")

# Parsing multiple formats in a vector
parse_date_time(
      c("14-03-2021", "2021/03/14"),
      orders = c("dmy", "ymd")
)
          

And you don't have to specify format like %Y-%m-%d

Other Helpful functions


There's the wday() to get the day of the week

year() to easily extract year from date

month() to easily extract month

day() to easily extract day of month


Note: you do need the variable to be in date format

DATA DATA DATA DATA DATA DATA

File Storage


There are benefits to storing data in separate files:


Latency optimization in download speeds

Space optimization through sharding

Offer user the choice to select

Appending Data


Suppose we have some Great Blue Heron nest counts from two separate years, stored in two different tables:


        library(tibble)
        library(dplyr)

        # Data from 2022
        herons_2022 <- tibble(
          location = c("Via Marina", "Palawan Way"),
          nests = c(14, 9),
          year = 2022
        )
          

Appending Data


And the data collected in 2023:


# Data from 2023
herons_2023 <- tibble(
  location = c("Via Marina", "Palawan Way"),
  nests = c(10, 6),
  year = 2023
)
          

Appending Data with bind_rows()


bind_rows() stacks the rows of the two data frames together, one on top of the other, for example:


# Combine data using bind_rows()
herons_all <- bind_rows(herons_2022, herons_2023)

print(herons_all)
          

Uneven Tables: Extra Column

Now let's say the 2023 data also includes the number of chicks observed, we do not have that data for 2022:


# Updated 2023 data with an extra column
herons_2023 <- tibble(
  location = c("Via Marina", "Palawan Way"),
  nests = c(10, 6),
  chicks = c(5, 3),
  year = 2023
)
          

Uneven Tables: Extra Column

Rows from 2022 don't have a chicks column, so R inserted NA for those missing values


# Combine again
herons_all <- bind_rows(herons_2022, herons_2023)

print(herons_all)
          

Crime Rates

After Daylight Savings Time

Environmental conditions and crime rates


Lead -> Health + Brain Development

Temperature -> Aggression

Crop Loss -> Civil Unrest

Sunlight -> Crime?

DST - DAYLIGHT SAVINGS TIME


Rule used for daylight savings time:

Second Sunday in March we move our clocks forward one hour


Question:

After DST, does the sun set one hour later or one hour earlier than the day before?

PDT vs PST


PDT - Pacific Daylight Time

PST - Pacific Standard Time

Checklist

1. Set up R (e.g. libraries, clear environment, directory)

2. Import two FBI data files

3. Append data files (new!)

4. Clean up data, i.e. paste year, month, and day together

5. Create date variables (new!)

6. Calculate days to DST

7. Extract day of week and drop weekends

8. Make two-line scatterplot (and export it!)

Final Result

R Advanced 4: Assignment 9


Going to be published on canvas today

Assignment is going to be due next Monday

Due Date: Monday May 12, 2025 at 11:59 pm PT