ENVIRON-175

Programming with Big Environmental Datasets



Gleb Satyukov
Senior Research Engineer | Data Science Instructor


Slido

Attendance etc










https://app.sli.do/event/b8fioU6GZiEq8b7SCKuZw7

Slides


R Basics 1: https://environ-175.com/basics/1

R Basics 2: https://environ-175.com/basics/2

R Basics 3: https://environ-175.com/basics/3

R Basics 4: https://environ-175.com/basics/4

R Basics 5: https://environ-175.com/basics/5

Slides


R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides


R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

Schedule

Schedule

Agenda for today

GGplot

Debugging

Reshaping data

Hexadecimal Colors

New library and functions!


            library(tidyr)

            pivot_wider()
            pivot_longer()
            

Reminder about the best practices


Attention to detail

Clean your environment

Use proper file paths

Use proper code spacing

Use inline and block comments!!

Use correct variable names (lowercase)

Save charts programmatiaclly with ggsave

Best Practices 2.0


Using Global Variables

Set a directory using path_main

Inspecting the data using head()

Keep data in a dedicated data folder

Be consistent with your use of quotes (' vs ")

And more best practices coming soon!

Follow instructions in the assignments exactly

Slido

Legend

Criticisms of Using a Legend


  • Requires eye movement between plot and legend
  • Forces mental matching between legend entries and visual marks
  • Often inaccessible to colorblind users
  • Takes up valuable space that could be used for the actual data
  • Inconsistent use across plots can confuse viewers

Direct labeling with geom_text()

Coercion ( casting )

Coercion refers to the implicit or explicit conversion of an object's type (class) to another, often to ensure compatibility with a function or operation.

For example, converting from Character to Numeric:


        text <- "10"
        number <- as.numeric(text)

        class(number)
        # [1] "numeric"

        numbers <- c("1", "2", "3")
        real_numbers <- as.numeric(numbers)

        class(real_numbers)
        # [1] "numeric"
          

Note


If character values are not valid numbers, converting to numeric will return NA and show you a warning:


as.numeric("asdf")
[1] NA
Warning message:
NAs introduced by coercion
          

Coercion ( casting )

Converting from Numeric to Character:


        numbers <- c(10, 20, 30)
        text <- as.character(numbers)
        # [1] "10" "20" "30"

        as.character("asdf")
        # [1] "asdf"

        as.character(FALSE)
        # [1] "FALSE"
          

Slido


What does as.character(NA)return?











Troubleshooting/ Debugging


Admiral Grace Hopper



First mention of debugging:

https://en.wikipedia.org/wiki/Debugging

https://en.wikipedia.org/wiki/Troubleshooting

GGPlot: Grammar of Graphics


The first parameter is always data

aes = aesthetic

geom = geometry

labs = labels

...

R Basics 4: https://environ-175.com/basics/4/#/30

What's Wrong With This?



        ggplot(data = so2_data, aes(x = year, y = value)) +
          geom_line(
            data = filter(so2_data, dist_group == "Close"),
            aes(color = "Close")
          ) +

          geom_point(
            data = filter(so2_data, dist_group == "Close"),
            aes(color = "Close")
          ) +

          labs(x = "Year", y = "SO2 Value")
          
  • We are mapping the string "Close" to the color aesthetic
  • "Close" is a string, not a variable — it is treated as a constant
  • This tells R to use the same color for all lines labeled "Close"
  • ggplot2 treats this like a new group and creates a legend entry
  • This is only helpful for adding a fixed legend label for a specific highlight

Correct Example


        ggplot(data, aes(x = year, y = value, color = dist_group)) +
          geom_line(.....) +
          geom_point(.....) +
          labs(x = "Year", y = "Value")
          
  • This way color is mapped to the actual variable dist_group
  • Each group gets its own color and line
  • Legend entries are automatically created
  • Standard way to distinguish groups in a plot

Hexadecimal Colors









Hexadecimal Colors

16-base


Counting continues with letters A through F

Starting at 0 being darkest, ending with F as lightest


0 1 2 3 4 5 6 7 8 9 A B C D E F
          

Why 16 values?


16 * 16 = 256


1 bit = 0 or 1

1 byte = 8 bits

Each byte has 2 ^ 8 = 256 unique options

(Bit and Byte Explained in 6 Minutes)

Some Examples


#000000 - black

#FFFFFF - white

#FF0000 - red

#00FF00 - green

#0000FF - blue

#2774AE - UCLA Blue

Colors


Make sure you are using an appropriate palette

https://colormoods.co/
https://colorbrewer2.org/
https://coolors.co/generate
HTML Color Codes
Image Color Picker

NIH: Types of Color Vision Deficiency

Slido

ROYGBIV










Brexit










Hypothesis


Brexit raised trade costs to other European countries

This should reduce exports from the UK to Europe

Brexit might have also reduced trade to non-EU countries because the EU had trade agreements

Impacts of international trade


1. Leads to specialization in labor markets

2. Has a big impact on the environment


Goods being transported long distances would add to global pollution levels

Some countries have lax environmental regulations, so importing goods from abroad could "offshore" pollution to poorer countries

How did Brexit affect exports?


We will use WTO data for the years 2010 through 2022

This data include exports of most countries

https://stats.wto.org/

WTO Data

Exports by Country and Year

This data is in a so-called long format:

country year exports
Italy20102.3
Italy20115.0
Italy20123.6
UK20104.4
UK20111.0
UK20122.9

Exports by Country and Year


This data is in a so-called wide format:

country 2010 2011 2012
Italy 2.3 5.0 3.6
UK 4.4 1.0 2.9

Exports by Year and Country

This data is also in a wide format:

year Italy UK ...
2010 2.3 4.4 ...
2011 5.0 1.0 ...
2012 3.6 2.9 ...

WTO Data - Wide Format

Note


You can conduct analysis on both long and wide format data, but long format is almost always preferred

People often store data in wide format, and convert it to long format after importing

The process for converting the data from wide to long is known as "reshaping"

Exports by Country and Year


We need to specify what the column names (not the values) represent

In this case, 2010, 2011, and 2012 - represent "year"

country 2010 2011 2012
Italy 2.3 5.0 3.6
UK 4.4 1.0 2.9

Exports by Country and Year


And these values here represent "export" data

country 2010 2011 2012
Italy 2.3 5.0 3.6
UK 4.4 1.0 2.9

Exports by Country and Year

This is our data after reshaping it from wide to long:

country year exports
Italy20102.3
Italy20115.0
Italy20123.6
UK20104.4
UK20111.0
UK20122.9

Checklist


1. Clean the Environment

2. Load all required libraries

3. Import international trade data

4. Reshape year columns from wide to long

5. Collapse-sum exports by country and year

6. Rescale our exports variable

7. Make two-line scatterplot (and export/save it!)

Diff-in-diff


Difference in differences

https://en.wikipedia.org/wiki/Difference_in_differences

We need to compare UK trade to a different country (control group) with similar trend before Brexit (i.e. treatment)

The hard part is choosing the correct control group

Maybe Ireland?

Italy as Control Group

Final Result

R Advanced 3: Assignment 8


Going to be published on canvas today

Assignment is going to be due this Friday

Due Date: Friday May 9, 2025 at 11:59 pm PT