Paul Julian II, PhD

South Florida Estuaries

2021-08-13T00:00:00+00:00

Keywords: Okeechobee, water level, LOSOM

Estuaries are the transition zone from fresh to saltwater and to function properly requires a balance of freshwater inputs. These inputs create a gradient of fresh to saline waters that are important for the composition, distribution of vegetative communities and wildlife. However too much freshwater in the estuary can be detrimental to the health of the ecosystem.

Here we have two estuaries and upstream watersheds, the Caloosahatchee and St Lucie river estuary and watershed. Both are connected to Lake Okeechobee thank to the landscape modifications in the early 1900s.

Caloosahatchee and St Lucie watersheds.

The Caloosahatchee estuary receives freshwater either from the upstream watershed in the form of runoff or discharges from Lake Okeechobee. Similarly, the St Lucie Estuary (on the east coast) receives either basin runoff from a comparatively smaller watershed or discharges from the Lake. The St Lucie river has the added benefit of being able to return water to Lake Okeechobee (AKA backflow) rather than discharging to the Estuary. While this backflow may prevent extreme or damaging discharges to the St Lucie Estuary, it is detrimental to lake ecology by adding nutrients to an already nutrient-rich system. In extreme cases some water can be returned to the lake via S77, but this is rare.

May 1978 - Apirl 2021 peroid of record annual mean discharge volume from source within the Caloosahatchee and St Lucie watersheds to each estuary. Values above bars are average discharge volume in Ac-Ft WY^-1.

Estuaries rely on a balance fresh and saline water. Here in the Caloosahatchee River Estuary the main source of freshwater is the S-79 structure. As discharges from this structure increases the estuary becomes more fresh with discharges about 2600 cfs (~5157 Ac-Ft d^-1) causing significant damage to marine species.

Water Level Limbo

2021-07-13T00:00:00+00:00

Finding Lake O’s Water Level Sweet Spot

Keywords: Okeechobee, water level, LOSOM

Original article published as a SCCF Wednesday Update.

When it comes to water, Florida has two seasons: wet and dry. This seasonality in rainfall causes water levels within lakes and wetlands to fluctuate with large seasonal and within-year (intra annual) variability that can affect ecosystem function and structure. This is especially true for Lake Okeechobee, the largest freshwater lake in the Southeast. Lake Okeechobee supports a large freshwater recreational fishery and is an integral part of South Florida’s hydroscape.

Lake Okeechobee is the beating heart of the Everglades. Throughout the seasons, water levels rise and fall, providing freshwater to the downstream Everglades and coastal estuaries. Historically, this large, shallow lake would overflow to the south and west during periods of high water, providing a sheetflow of water to the south into the Everglades. To the west, Lake Okeechobee’s littoral zone (nearshore area from the high-water line occupied by wetlands) adapted to high and low water events by expanding and contracting across the landscape. As extensive levee, canal, and lock/gate systems were constructed around the lake in the early and mid-1900s, the lake became encircled by infrastructure. This infrastructure ultimately isolated the lake’s littoral zone to a fraction of its historic size.

Photo Credit: Charles Hanlon, South Florida Water Management District Edge of the littoral zone at Cochrans Pass (April 2021).

Since being impounded more than 60 years ago, water levels within Lake Okeechobee have been managed in part by the U.S. Army Corps of Engineers through a series of regulation schedules. Under these regulation schedules, Lake Okeechobee has experienced periods of extreme high and low water conditions, which has resulted in detrimental effects to the near-shore ecosytems.

Across the lake, high water levels can negatively impact plants and animals due to deep water, flooding of the littoral zone, erosion, nutrient transport, and algae blooms. Low water levels can dry out portions of the littoral zone, spread exotic species, impact the fishery, and reduce prey for snail kites and alligators. Moderate water levels—“the sweet spot”—can provide optimal recruitment of the fishery, provide ample forage for wading birds and snail kites, and allow for plants to proliferate across the lake.

The littoral zone of Lake Okeechobee is composed of a variety of different plant species all adapted for different growing conditions. From left to right: an upper broadleaf marsh system, mid-level floating leaf marsh, and an outer submerged and emergent vegetation marsh. Photo Credit: Zach Welch

Efforts are underway to revisit the Lake Okeechobee System Operating Manual and revise how water is managed for Lake Okeechobee. With the recent completion of the Herbert Hoover Dike improvements, a greater range of water levels can be exercised in Lake Okeechobee. Currently, all of the alternative plans proposed by the U.S. Army Corps of Engineers, which manages the lake, intend to keep the lake higher than existing conditions. (However, as previously noted, high water levels can have harmful effects on the lake’s overall health.) Some plans add an extra foot to water levels and reduce the amount of time it is within the moderate, sweet spot, level.

High and low water levels can be tolerated for short periods of time, and sometimes are needed for the ecology of the lake, but prolonged and extreme events can cause significant ecological consequences and impact water quality within the lake. How a prospective plan manages these events is important. Ultimately, a plan that optimizes water levels for the ecology of the lake is needed—and this has to be done in a way that minimizes the damaging discharges to coastal estuaries.

Lake Okeechobee at 9 and 17 feet National Geodetic Vertical Datum (NGVD) of 1929. Water levels on the lake are measured relative to sea level, NGVD 1929 is the vertical datum used by the South Florida Water Management District and U.S. Army Corps of Engineers.

Hydrologic Modeler Paul Julian’s position is funded jointly by SCCF and The Conservancy of Southwest Florida.

Want to know more about SCCF? Visit our webpage.

More information on LOSOM can be found at the USACE LOSOM project webpage

Evaluating Algal Dynamics within the Okeechobee-Caloosahatchee System

2021-06-30T00:00:00+00:00

It’s Not Easy Being Green: Evaluating Algal Dynamics within the Okeechobee-Caloosahatchee System

Keywords: Okeechobee, Caloosahatchee, Algae

Original article published as a SCCF Wednesday Update.

As summertime temperatures begin to warm and seasonal rains sweep across Southwest Florida, you may notice a change in conditions on the waterways. During this time of the year, the occurrence of algae within Lake Okeechobee and the Caloosahatchee River becomes more noticeable.

Visually, algae blooms can appear as streaks of green, discolored water, or floating mats of green, blue, and white, depending on the species. Under the right conditions, some algae species when in bloom can be classified as harmful algal blooms (HABs) which produce toxins such as Microcystis (in freshwater) or Karenia brevis (in saltwater) that kill fish and other sea life. Other algae are nontoxic but also can lead to fish kills and impact benthic communities by consuming dissolved oxygen and changing the color of the water.

Over the past two decades, algal biomass (measured as suspended chlorophyll-a in the water) has significantly increased at Franklin Lock (S-79). This increase in algal biomass is important as the S-79 structure is fed by both Lake Okeechobee and the upstream C-43 canal as they discharge freshwater to the Caloosahatchee River estuary.

Increased nutrient (nitrogen and phosphorus) loading has been identified as a major factor contributing to an increase in algal blooms in the lake and estuaries. However, within the Okeechobee-Caloosahatchee system, no one thing can be singled out as the ultimate driver of algae; rather, it’s a combination of several factors. Algal growth and bloom proliferation can be driven by several factors: light availability (how much light travels through the water column); water temperature; nutrient concentration; and hydrology (water level and discharge).

Currently underway, the Lake Okeechobee System Operating Manual (LOSOM) planning effort intends to change how water is managed for Lake Okeechobee. A specific topic of interest is understanding how the different water management schemes will affect the risk of algal bloom formation and transport within the Caloosahatchee and St Lucie estuaries. This metric is important to reduce the potential risk of HABs within our local waters which can lead to primary effects—fish kills and human health impacts—and secondary issues, such as environmental degradation and negative impacts on the local economy.

To evaluate algal bloom risk to the estuaries, the U.S. Army Corps of Engineers (USACE) will compare discharges from Lake Okeechobee during the time of the year where algal bloom potential is highest (June – August). This evaluation is based on the concept of moving water with algae from Lake Okeechobee along the C-43 canal to the Caloosahatchee estuary. Based on the available data, an algal biomass transport hypothesis from the lake to the estuary does not paint the entire picture. Other processes contribute to algae bloom formation and transport within the Okeechobee-Caloosahatchee system.

As part of the LOSOM planning effort, SCCF provided these recommendations: developing a more robust monitoring network to assess changes in algae; evaluating algal bloom potential relative to the amount of time water moves from the lake to the estuary; and including other factors, such as temperature and light availability. Ultimately, our goal is to develop an operations plan that reduces the risk of algal blooms in the estuaries and balances the needs of the Caloosahatchee and St Lucie estuaries, Lake Okeechobee, and the Southern Everglades to improve the ecology and sustainability of our system.

By evaluating the existing science, assessing the LOSOM alternatives, and studying nutrient loading from Lake Okeechobee and the upstream basin and the resulting loads to the estuary, we are gaining a better understanding of algal dynamics within the Okeechobee-Caloosahatchee system. As water management changes for Lake Okeechobee, we continue to develop our understanding of algal and nutrient dynamics to inform management and policy decisions.

Hydrologic Modeler Paul Julian’s position is funded jointly by SCCF and The Conservancy of Southwest Florida.

More information on LOSOM can be found at the USACE LOSOM project webpage

Legacy Phosphorus In Lake Okeechobee

2021-04-22T00:00:00+00:00

Keywords: Okeechobee, Phosphorus, Sediments

As presented at the Greater Everglades Ecosystem Restoration Conference 2021.

Here’s the almost 15 minute presentation I gave at the GEER conference on Thursday April 22, 2021 in the Ecological Processes in Lake Okeechobee moderated by Todd Z Osborne (University of Florida Whitney Lab) and Paul Jones (South Florida Water Management District).

Enjoy!!

GEER 2021 Presentation - Click to Watch!

I’m Calling To You Like A Long Lost Friend: Legacy Phosphorus In Lake Okeechobee Click to watch (it it will redirect to YouTube).

Nutrient inputs are highly variable driven by upstream inputs with potential impacts to downstream systems
Within lake trends in nutrient are variable driven by ecosystem specific factors in a water - sediment feedback mechanisms.

Spatial and temporal trends in lake sediment TP concentrations are apparent with spatial trends mirroring the water column.

The difference from input to output and water column to sediment is largely due to (high) internal loading.

The word of the talk was variable and dynamic, whilst these words are often over used it adequately describes the water quality and overall system biogeochemical cycling of nutrients of Lake Okeechobee.

Abstract: Lake Okeechobee displays many features of a shallow, polymictic lake including frequent mixing of the water column and resuspension of unconsolidated sediments, and internal loading of nutrients to name a few. Additionally, the Lake has characteristically high phosphorus (P) loading due to changes in land use and drainage patterns upstream. The lake provides essential ecosystem services in the form of water supply, flood protection, navigation, and recreation, as well as vital habitat for south Florida’s flora and fauna. However, these values are threatened by current and historic excessive inputs of P influencing endo- and exogenic processes leading to fish-kills, hypoxic events, algal blooms, and degraded aquatic habitat.

Over the last decade and a half, nutrient loading to the lake has significantly increased. Utilizing the long-term ambient monitoring network, this study evaluated water column total nitrogen (TN), total P (TP), and chlorophyll-a (Chl-a) concentrations over 23 years (May 1996 – April 2020). Water quality trends across Lake Okeechobee varied spatially with significantly declining trends in TN and Chl-a, and increasing trends in TP. Coupled with these trends the lake has notable water column nutrient gradients. Lake sediments are a long-term integrator of ecosystem conditions, over the last 30 years four lake sediment surveys have been completed. Using data from these surveys, sediment N and P concentrations were evaluated both spatially and temporally to evaluate the change in sediment nutrients throughout the Lake. Despite the lake’s shallow bathymetry and the occurrence of frequent mixing events (i.e. high winds, hurricanes, drought), lake sediments have remained relatively stable, although notable shifting in sediment TP and TN concentrations have been observed.

The nutrient balance of Lake Okeechobee and the understanding of endo and exogenic drivers of nutrient mobilization are important to aid in the restoration of the Lake and the Greater Everglades. As restoration activities progress, it is expected that nutrient inputs to the lake will decline. However, given the volume of N and P stored in the lake’s sediments, internal loading could result in delayed improvements to nutrient concentrations within the Lake. Despite the potential for delayed results continued to study and restoration activities are crucial to preserving our long-lost friend.

Changes to CRS in R

2021-01-21T00:00:00+00:00

Keywords: R, spatial data, coordinates

There is nothing more deceptive than an obvious fact.
- Arthur Conan Doyle, The Boscombe Valley Mystery

If you haven’t heard already, big changes are afoot in the R-spatial community.

…if you were/are like me you experienced a mix of emotions. But not to worry there are loads of resources and a lot of really smart people working the issues right now.

…so expect lots of blog posts and resources.

The cliff notes version (short, short version) is that changes in how the representation of coordinate reference systems (CRS) have finally caught up with how spatial data is handled in R packages (or maybe its the otherway around). In a vignette titled title “Why have CRS, projections and transformations”, Roger Bivand explains the nitty gritty.

Here are some more resources:

YouTube lecture by Roger Bivand (link)
Associated material (link)
Bivand, R.S. Progress in the R ecosystem for representing and handling spatial data. J Geogr Syst (2020). https://doi.org/10.1007/s10109-020-00336-0

Roger also penned this post explaining the migration specific for the rgdal, sp and raster packages specific to read, write, project, and transform objects using PROJ strings (“Migration to PROJ6/GDAL3”). It gets rather complex but a good resource.

In another resource I came across in my sleuthing and troubleshooting by Edzer Pebesma and Roger Bivand discussing how GDAL and PROJ (formerly proj.4) relates to geospatial tools including several R packages in a post titled “R spatial follows GDAL and PROJ developement”. As an example, they outline the dependency for the sf package, pictured here:

Also something worth reiterating here, briefly:

PROJ provides methods for coordinate representation, conversion (projection) and transformation, and
GDAL allows reading and writing of spatial raster and vector data in a standardized form, and provides a high-level interface to PROJ for these data structures, including the representation of coordinate reference systems (CRS)

We are ultimately dealing with coordinate reference systems (or CRS) but it also goes by another name…spatial reference system (SRS). This might make more sense soon. As summarized by INBO, CRS are defined by several elements:

a coordinate system,
a ‘datum’; it localizes the geodetic coordinate system relative to the Earth and needs a geometric definition of the ellipsoid, and
only for projected CRSes coordinate conversion parameters that determine the conversion from the geodetic to the projected coordinates.

INBO did a fantastic tutorial (https://inbo.github.io/tutorials/tutorials/spatial_crs_coding/) briefly discussing on the changes and walking through the how-to for sp, sf and raster packages. The rgdal package leans heavily on the sp package…incase you were worried.

Here are some examples and things that I have learned dealing with this issue. Nothing special and I suggest visiting the resources identified above (especially https://inbo.github.io/tutorials/tutorials/spatial_crs_coding/). I am partial to the sp and rgdal packages, this is what I initially learned and got comfortable using. So lets load rgdal.

library(rgdal)

In the “good’ol days” you could define a CRS with this

utm17 <- CRS("+proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs")

Do this now and you get…

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition

Fast-forward to now. There might be several ways to do this but the easiest I found is

utm17 <- CRS(SRS_string="EPSG:4326")

Notice the the argument SRS_string … as in spatial reference system! (I just picked that up writing this post).

Another thing in the update is the use of WKT (well-known text) over that of PROJ strings. WKT strings are interesting and provides lots of good information on the CRS (or SRS) if your into that kind of thing. To make a WKT you use the wkt() function.

utm17 <- CRS(SRS_string="EPSG:4326")

utm17.wkt=wkt(utm17)
utm17.wkt

## [1] "GEOGCRS[\"WGS 84 (with axis order normalized for visualization)\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]],\n        ID[\"EPSG\",6326]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433],\n        ID[\"EPSG\",8901]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433,\n                ID[\"EPSG\",9122]]],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433,\n                ID[\"EPSG\",9122]]]]"

or you can print the WKT to be more readable/organized with:

cat(utm17.wkt)

## GEOGCRS["WGS 84 (with axis order normalized for visualization)",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ID["EPSG",6326]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433],
##         ID["EPSG",8901]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433,
##                 ID["EPSG",9122]]],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433,
##                 ID["EPSG",9122]]]]

Further down the road when you are doing analyses or even plotting in some packages (i.e. tmap) you might get a bunch of warnings like:

Warning message:
In sp::proj4string(obj) : CRS object has comment, which is lost in output

This shouldn’t stop any of the operations but you can “mute” the warnings by running options("rgdal_show_exportToProj4_warnings"="none") in your console. I keep my “un-muted” to make sure I don’t inadvertently miss something.

If your wanting to transform a dataset from one datum to another you will need to use the WKT string. For instance I use several different state agency spatial datasets, one of which uses NAD83 HARN (which is a discarded datum…still learning about what this means) and I usually work in UTM. I find UTM CRSes easier to work with in general. Going back to the example dataset…if I read the file into R I get:

dat<-readOGR(shapefile) #just as an example

Warning message:
In OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS = dumpSRS,  :
  Discarded datum NAD83_High_Accuracy_Reference_Network in CRS definition: +proj=tmerc +lat_0=24.3333333333333 +lon_0=-81 +k=0.999941177 +x_0=200000.0001016 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs

That was enough to make my head spin…but if you notice its just a warning message and it will still read the file into the R environment. Now to transform the CRS:

dat.tran<-spTransform(dat,utm17.wkt)

But lets say you are making a SpatialPointsDataFrame, one of the arguments is proj4string (which we are moving away from and the motivation for this whole post!).

Here is some data…

dat2<-data.frame(SITE=c(1,2,3),
                 UTMX=c(590382,583910,585419),
                 UTMY=c(2830587,2821685,2819900))
dat2

##   SITE   UTMX    UTMY
## 1    1 590382 2830587
## 2    2 583910 2821685
## 3    3 585419 2819900

dat2.shp<-SpatialPointsDataFrame(dat2[,c("UTMX","UTMY")],
                                data=dat2,
                                proj4string=utm17)

This is as much as I have been able to work through these changes. It’s not huge scale changes to existing work-flows but enough to cause some heartburn.

Hope this was helpful (sorry for all the Sherlock gifs)…keep coding friends.

Nearest Neighbor and Hot Spot Analysis - Geospatial data analysis in #rstats. Part 3b

2020-10-08T00:00:00+00:00

Keywords: geostatistics, R, nearest neighbor, Getis-Ord

As promised here is another follow-up to our Geospatial data analysis blog series. So far we have covered interpolaiton, spatial auto-correlation and the basics of Hot-Spot (Getis-Ord) analysis.

In this post we will discuss nearest neighbor estimates and how it can affect hot spot detection. In essence this is “Getis-Ord Strikes Back” (sorry my Star Wars nerd is showing).

Let’s take a step back before jumping back into nearest neighbor (see my post on Moran’s I). Most spatial statistics compare a test statistic estimated from the data then compared to an expected value given the null hypothesis of complete spatial randomness (CSR; Fortin and Dale (2005); not to be confused with CRS(...) coordinate reference system). This is a point process model that can be estimated from a particular distribution, in most cases a Poisson (Diggle 2006). A theme in the analysis of spatial point patterns such as Moran’s I, Getis-Ord G or Ripley’s K provides a distinction between spatial patterns where CSR is a dividing hypothesis (Cox 1977), which leads to classification of random (complete spatial randomness), under-dispersed (clumped or aggregated), or over-dispersed (spaced or regular) patterns.

Below we are going to import some data, use different techniques to estimate nearest neighbor and see how that affects Hot spot detection.

Let’s get started

Before we get too deep into things here are the necessary packages we will be using.

## Libraries
# read xlsx files
library(readxl)

# Geospatial 
library(rgdal)
library(rgeos)
library(raster)
library(spdep)

Same data and links from last post.

Download the data (as a zip file) here!
Download the Water Conservation Areas shapefile here!

# Define spatial datum
utm17<-CRS("+proj=utm +zone=17 +datum=WGS84 +units=m")
wgs84<-CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")

# Read shapefile
wcas<-readOGR(GISdata,"WCAs")
wcas<-spTransform(wcas,utm17)

# Read the spreadsheet
p12<-readxl::read_xls("data/P12join7FINAL.xls",sheet=2)

# Clean up the headers
colnames(p12)<-sapply(strsplit(names(p12),"\\$"),"[",1)
p12<-data.frame(p12)
p12[p12==-9999]<-NA
p12[p12==-3047.6952]<-NA

# Convert the data.frame() to SpatialPointsDataFrame
vars<-c("STA_ID","CYCLE","SUBAREA","DECLONG","DECLAT","DATE","TPSDF")
p12.shp<-SpatialPointsDataFrame(coords=p12[,c("DECLONG","DECLAT")],
                               data=p12[,vars],proj4string =wgs84)
# transform to UTM (something I like to do...but not necessary)
p12.shp<-spTransform(p12.shp,utm17)

# Subset the data for wet season data only and only WCA sites
p12.shp2<-subset(p12.shp,CYCLE%in%c(0,2))
p12.shp.wca<-p12.shp2[wcas,]

# Double check for NAs in the dataset
subset(p12.shp.wca@data,is.na(TPSDF)==T)

# Remove NA sample
p12.shp.wca<-subset(p12.shp.wca,is.na(TPSDF)==F)

Here is a quick map the of the subsetted data

par(mar=c(0.1,0.1,0.1,0.1),oma=c(0,0,0,0))
plot(wcas)
plot(p12.shp.wca,pch=21,bg=adjustcolor("dodgerblue1",0.5),cex=1,add=T)
mapmisc::scaleBar(utm17,"bottomright",bty="n",cex=1,seg.len=4)

Monitoring location from R-EMAP Phase I, wet season sampling (cycles 0 and 2) within the Water Conservation Areas.

Nearest Neighbor

As discussed in our prior blog post, average nearest neighbor (ANN) analysis measures the average distance from each point in the study area to its nearest point. In some cases, this methods can be sensitive to which distance bands are identified and can therefore be carried forward into other analyses that rely on nearest neighbor spatial weighting. However, ANN statistic is one of many distance based point pattern analysis statistics that can be used to spatially weight the dataset necessary for spatial statistical evaluation. Others include K, L and pair correlation function (g; not to confused with Getis-Ord G) (Gimond 2020).

One way to spatially weight the data is by using the dnearneigh() function which identifies neighbors within the lower and upper bounds (provided in the function) by Euclidean distance. Here is where selection of “distance bands” matter. This function was used in the initial Hot-Spot blog post. Lets see how changing the upper bounds in the dnearneigh() can affect the outcome.

# Find distance range
ptdist=pointDistance(p12.shp.wca)

min.dist<-min(ptdist); # Minimum

q10.dist<-as.numeric(quantile(ptdist,probs=0.10)); # Q10
q25.dist<-as.numeric(quantile(ptdist,probs=0.25)); # Q25
q75.dist<-as.numeric(quantile(ptdist,probs=0.75)); # Q75

# Using 25th percentile distance for upper bound
nb.q10<-dnearneigh(coordinates(p12.shp.wca),min.dist,q10.dist)

# Using 25th percentile distance for upper bound
nb.q25<-dnearneigh(coordinates(p12.shp.wca),min.dist,q25.dist)

# Using 75th percentile distance for upper bound
nb.q75<-dnearneigh(coordinates(p12.shp.wca),min.dist,q75.dist)

Neighborhood network with different upper bound values

As you can see the number of links between locations increases as the upper bound is expanded thereby increasing the average number of links within the network. How would this potentially influence the detection of clusters within the data set. Remember the last Hot-Spot blog post? Well lets run through the code, below is using the 10th quantile as the upper bound as an example.

# Convert nearest neighbor to a list
nb_lw<-nb2listw(nb.q10)

# local G
local_g<-localG(p12.shp.wca$TPSDF,nb_lw)

# convert to matrix
local_g.ma=as.matrix(local_g)

# column-bind the local_g data
p12.shp.wca<-cbind(p12.shp.wca,local_g.ma)

# change the names of the new column
names(p12.shp.wca)[ncol(p12.shp.wca)]="localg.Q10"

# determine p-value of z-score
p12.shp.wca$pval.q10<- 2*pnorm(-abs(p12.shp.wca$localg.Q10))

# See if any site is a "Hot-Spot"
subset(p12.shp.wca@data,localg.Q10>0&pval.q10<0.05)$STA_ID

##  [1] "M009" "M011" "M012" "M014" "M015" "M024" "M025" "M027" "M028" "M029"
## [11] "M032" "M033" "M034" "M260" "M261" "M262" "M274" "M276" "M278" "M280"
## [21] "M282"

Looks like a couple of sites are considered Hot-Spots. Now do that same thing for nb.q25 and nb.q75 and this is what you get.

Soil total phosphorus hot-spots identified using the Getis-Ord $G_{i}^{*}$ spatial statistic based on different nearest neighbor bands.

Hot-Spots are identified with $G_{i}^{*}$ > 0 and associated with significant $\rho$ values (in this cast our $\alpha$ is 0.05). Alternatively “Cold-Spots”, or areas associated with clustering of relatively low values are identified with $G_{i}^{*}$ < 0 (and significant $\rho$ values). Across the three different distance bands, you can see a potential shift in Hot-Spots and the occurrence (and shift) of Cold-Spots across the study area.

Number of sites identified as Hot-Spots across the study by Nearest Neigbhor upper bound band.

An alternative to selecting distance bands is to use a different approach such as K-function or K nearest neighbors. K-function summarizes the distance between points for all distances (Gimond 2020). This method can also be sensitive to distance bands but less so than above. In k-function nearest neighbor using knearneigh(), the function will eventually give a warning letting you know but will still compute the values anyways.

Warning messages:
1: In knearneigh(p12.shp.wca, k = 45) :
  k greater than one-third of the number of data points

The affect of the number of nearest neighbors on average nearest neighbor distance.

Based on the plot above, a k=6 seems to be conservative enough. As suggested in the last blog post this could be done by…

k1<-knn2nb(knearneigh(p12.shp.wca,k=6))

# Convert nearest neighbor to a list
nb_lw<-nb2listw(k1)

# local G
local_g<-localG(p12.shp.wca$TPSDF,nb_lw)

# convert to matrix
local_g.ma=as.matrix(local_g)

# column-bind the local_g data
p12.shp.wca<-cbind(p12.shp.wca,local_g.ma)

# change the names of the new column
names(p12.shp.wca)[ncol(p12.shp.wca)]="localg.k"

# determine p-value of z-score
p12.shp.wca$pval.k<- 2*pnorm(-abs(p12.shp.wca$localg.k))

Soil total phosphorus hot-spots identified using the Getis-Ord $G_{i}^{*}$ spatial statistic with k-function nearest neighbor spatial weighting.

using K-function nearest neighbor we have the occurrence of Hot-Spots in the general area of the other evaluations presented above. As suggested in the original Hot-Spot blog, the selection of spatial weights is important and the test is sensitive to the weights assigned.

The next post will cover how spatial aggregation can play a role in Hot-Spot detection. Until then I’ll leave you with this quote that helps put spatial statistical analysis into perspective.

“The first law of geography: Everything is related to everything else, but near things are more related than distant things.” (Tobler 1970)

Cox, D. R. 1977. “The Role of Significance Tests.” Scandinavian Journal of Statistics 4 (2): 49–63. https://www.jstor.org/stable/4615652.

Diggle, Peter J. 2006. “Spatio-Temporal Point Processes: Methods and Applications.” Monographs on Statistics and Applied Probability 107: 1–45.

Fortin, Marie-Josée, and Mark R. T. Dale. 2005. Spatial Analysis: A Guide for Ecologists. Cambridge University Press.

Gimond, Manuel. 2020. Intro to GIS and Spatial Analysis. https://mgimond.github.io/Spatial/index.html.

Tobler, W. R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (June): 234. https://doi.org/10.2307/143141.

Hot Spot Analysis - Geospatial data analysis in #rstats. Part 3

2020-09-18T00:00:00+00:00

Keywords: geostatistics, R, hot-spot, Getis-Ord

Continuing our series on geospatial analysis we are diving deeper into spatial statistics Hot-spot analysis. In my prior posts I presented spatial interpolation techniques such as kriging and spatial auto-correlation with Moran’s I.

Kriging is a value tool to detect spatial structure and patterns across a particular area. These spatial models rely on understanding the spatial correction and auto-correlation. A common component of spatial correlation/auto-correlation analyses is they are applied on a global scale (entire dataset). In some cases, it may be warranted to examine patterns at a more local (fine) scale. The Getis-Ord G statistic provides information on local spatial structures and can identify areas of high (or low) clustering. This clustering is operationally defined as hot-spots and is done by comparing the sum in a particular variable within a local neighborhood network relative to the global sum of the area-of-interest extent (Getis and Ord 2010).

Getis and Ord (2010) introduced a family of measures of spatial associated called G statistics. When used with spatial auto-correlation statistics such as Moran’s I, the G family of statistics can expand our understanding of processes that give rise to spatial association, in detecting local hot-spots (in their original paper they used the term “pockets”). The Getis-Ord statistic can be used in the global ($G$) and local ($G_{i}^{*}$) scales. The global statistic ($G$) identifies high or low values across the entire study area (i.e. forest, wetland, city, etc.), meanwhile the local ($G_{i}^{*}$) statistic evaluates the data for each feature within the dataset and determining where features with high or low values (“pockets” or hot/cold) cluster spatially.

At this point I would probably throw some equations around and give you the mathematical nitty gritty. Given I am not a maths wiz and Getis and Ord (2010) provides all the detail (lots of nitty and a little gritty) in such an eloquent fashion I’ll leave it up to you if you want to peruse the manuscript. The Getis-Ord statistic has been applied across several different fields including crime analysis, epidemiology and a couple of forays into biogeochemistry and ecology.

Play Time

For this example I will be using a dataset from the United States Environmental Protection Agency (USEPA) as part of the Everglades Regional Environmental Monitoring Program (R-EMAP).

Some on the dataset

The Everglades R-EMAP program has been monitoring the Everglades ecosystem since 1993 in a probability-based sampling approach covering ~5000 km² from a multi-media aspect (water, sediment, fish, etc.). This large scale sampling has occurred in four phases, Phase I (1995 - 1996), Phase II (1999), Phase III (2005) and Phase IV (2013 - 2014). For the purposes of this post, we will be focusing on sediment/soil total phosphorus concentrations collected during the wet-season sampling during Phase I (April 1995 & May 1996).

Analysis time!!

Here are the necessary packages.

## Libraries
# read xlsx files
library(readxl)

# Geospatial 
library(rgdal)
library(rgeos)
library(raster)
library(spdep)

Incase you are not sure if you have these packages installed here is a quick function that will check for the packages and install if needed from CRAN.

# Function
check.packages <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

pkg<-c("openxlsx","readxl","rgdal","rgeos","raster","spdep")
check.packages(pkg)

Download the data (as a zip file) here!

Download the Water Conservation Area shapefile here!

# Define spatial datum
utm17<-CRS("+proj=utm +zone=17 +datum=WGS84 +units=m")
wgs84<-CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")

# Read shapefile
wcas<-readOGR(GISdata,"WCAs")
wcas<-spTransform(wcas,utm17)

# Read the spreadsheet
p12<-readxl::read_xls("data/P12join7FINAL.xls",sheet=2)

# Clean up the headers
colnames(p12)<-sapply(strsplit(names(p12),"\\$"),"[",1)
p12<-data.frame(p12)
p12[p12==-9999]<-NA
p12[p12==-3047.6952]<-NA

# Convert the data.frame() to SpatialPointsDataFrame
vars<-c("STA_ID","CYCLE","SUBAREA","DECLONG","DECLAT","DATE","TPSDF")
p12.shp<-SpatialPointsDataFrame(coords=p12[,c("DECLONG","DECLAT")],
                               data=p12[,vars],proj4string =wgs84)
# transform to UTM (something I like to do...but not necessary)
p12.shp<-spTransform(p12.shp,utm17)

# Subset the data for wet season data only
p12.shp.wca2<-subset(p12.shp,CYCLE%in%c(0,2))
p12.shp.wca2<-p12.shp.wca2[subset(wcas,Name=="WCA 2A"),]

Here is a quick map the of data

par(mar=c(0.1,0.1,0.1,0.1),oma=c(0,0,0,0))
plot(p12.shp,pch=21,bg="grey",cex=0.5)
plot(wcas,add=T)

Much like other spatial statistics (i.e. Moran’s I) the G statistics relies on spatially weighting the data. In my last post we discussed average nearest neighbor (ANN). Average nearest neighbor analysis measures the average distance from each point in the area of interest to it nearest point. As a reminder here is changes in ANN versus the degree of clustering. Here is a quick reminder.

Three different point patterns: a single cluster (top left), a dual cluster (top center) and a randomly scattered pattern (top right). Three different ANN vs. neighbor order plots. The black ANN line is for the first point pattern (single cluster); the blue line is for the second point pattern (double cluster) and the red line is for the third point pattern.

For demonstration purposes we are going to look at a subset of the entire datsaset. We are going to look at data within Water Conservation Area 2A.

Soil total phosphorus concentration within Water Conservation Area 2A during Phase I sampling.

Most examples of Getis-Ord analysis across the interest looks at polygon type data (i.e. city block, counties, watersheds, etc.). For this example, we are evaluating the data based on point data.

Let’s determine the spatial weight (nearest neighbor distances). Since we are looking at point data, we are going to need to do something slightly different than what was done with Moran’s I in the prior post. The dnearneigh() uses a matrix of point coordinates combined with distance thresholds. To work with the function coordinates will need to be extracted from the data using coordinates(). To find the distance range in the data we can use pointDistance() function. We don’t want to include all possible connections so setting the upper distance bound in the dnearneigh() to the mean distance across the site.

# Find distance range
ptdist=pointDistance(p12.shp.wca2)

min.dist<-min(ptdist); # Minimum
mean.dist<-mean(ptdist); # Mean

nb<-dnearneigh(coordinates(p12.shp.wca2),min.dist,mean.dist)

Neighborhood network for WCA-2A sites

Another spatial weights approach could be to apply k-nearest neighbor distances and could be used in the nb2listw(). In general there are minor differences in how these spatial weights are calculated and can be data specific. For purposes of our example we will be using euclidean distance (above) but for completeness below is the k-nearest neighbor approach.

k1<-knn2nb(knearneigh(p12.shp.wca2))

Global G

Now that we have the nearest neighbor values we need to convert the data into a list for both the global and local G statistics. For the global G (globalG.test(...)), it is recommended that the spatial weights be binary, therefore in the nb2listw() function we need to use the style="B".

nb_lw<-nb2listw(nb,style="B")

Now to evaluate the dataset from the Global G statistic.

globalG.test(p12.shp.wca2$TPSDF,nb_lw,alternative="two.sided")

## 
##  Getis-Ord global G statistic
## 
## data:  p12.shp.wca2$TPSDF 
## weights: nb_lw 
## 
## standard deviate = 0.092001, p-value = 0.9267
## alternative hypothesis: two.sided
## sample estimates:
## Global G statistic        Expectation           Variance 
##         0.48775147         0.48333333         0.00230619

In the output it standard deviate is the standard deviation of Moran’s I or the $z_{G}$-score and $\rho$-value of the test. Other information in the output include the observed statistic, its expectation and variance.

Based on the Global G results it suggests that there is no high/low clustering globally across the dataset.

If you want more information on the Global test, ESRI provides a robust review including all the additional maths that is behind the statistic.

Local G

Similar to the Global test, the local G test uses nearest neighbors. Unlike the Global test, the nearest neighbor can be row standardized (default setting).

nb_lw<-nb2listw(nb)

# local G
local_g<-localG(p12.shp.wca2$TPSDF,nb_lw)

The output of the function is a list of $z_{G_{i}^{*}}$-scores for each site. A little extra coding to determine $\rho$-values and hot/cold spots. Essentially values need to be extracted from the local_g object and $\rho$-value based on the z-score.

# convert to matrix
local_g.ma=as.matrix(local_g)

# column-bind the local_g data
p12.shp.wca2<-cbind(p12.shp.wca2,local_g.ma)

# change the names of the new column
names(p12.shp.wca2)[ncol(p12.shp.wca2)]="localg"

Lets determine the two.side $\rho$-value.

p12.shp.wca2$pval<- 2*pnorm(-abs(p12.shp.wca2$localg))

Based on the $z_{G_{i}^{*}}$-scores and $\rho$-value we operationally define a hot-spot as $z_{G_{i}^{*}}$-scores > 0 and $\rho$-value < $\alpha$ (usually 0.05). Let see if we have hot-spots.

subset(p12.shp.wca2@data,localg>0&pval<0.05)$STA_ID

## [1] "M258"

We have one site identified as a hot-spot. Lets maps it out too.

Soil total phosphorus hot-spots identified using the Getis-Ord $G_{i}^{*}$ spatial statistic.

For context, this soil TP hot-spot occurs near discharge locations into Water Conservation Area 2A. Historically run-off from the upstream agricultural area would be diverted to the area to protection both the agricultural area and the downstream urban areas. Currently restoration activities has eliminated these direct discharge and water quality has improved. However, we still see the legacy affect from past water management. If your interested in how the system is responding check out the South Florida Environmental Report here is last years Everglades Water Quality chapter.

If you would like more background on hot-spot analysis, ESRI produces a pretty good resource on Getis-Ord $G_{i}^{*}$.

This analysis can also be spatially aggregated (from ESRI) in the R by creating a grid, aggregating the data, estimate the nearest neighbor and evaluating on a local or global scale (maybe we will get to that another time).

Getis, Arthur, and J. K. Ord. 2010. “The Analysis of Spatial Association by Use of Distance Statistics.” Geographical Analysis 24 (3): 189–206. https://doi.org/10.1111/j.1538-4632.1992.tb00261.x.

Too much outside the box - Outliers and Boxplots

2020-01-24T00:00:00+00:00

Keywords: boxplots, outlier, data analysis

In a recent commentary due out in Marine Biology soon (hopefully) I argue against the use of boxplots as a method of outlier detection. Also seems that boxplots are very popular with people having strong opinons …

Before we get too into the weeds lets present the classical definition of what an outlier is, here I use Gotelli and Ellison (2013) but across statistical literature outliers are generally defined/described similarly.

“…extreme data points that are not characteristic of the distribution they were sampled…” (Gotelli and Ellison 2013).

What would a classic example of this definition look like in “real data” (below is generated data…technically not real data)?

Here is how the data was generated for demonstration purposes

set.seed(123)
# "Data
N.val<-100
x.val<-seq(0,1,length.out=N.val)
m<-5
b<-1
error.val<-1
y.val<-((m*x.val)+b)+rnorm(N.val,0,error.val)

# Outlier
y.val.out<-y.val[95]+2.5

Visual example of an outlier based on the definition above.

Clearly, based on the example above it seems like the red point in the plot to the left looks like it doesn’t really belong. A quick density plot of the data with and without the point (use plot(density(...))) gives you a sense of if the extreme data point is outside of the data distribution. The plot to the right demonstrates the data distribution and mean (dashed) without the extreme value relative to the extreme value (red line).

The next step to really determine if its an outlier would be to conduct an outlier test on your data. Outliers in data can distort the data distribution, affect predictions (if used in a model) and affect the overall accuracy of estimates if they are not detected and handled, especially in bi-variate analysis (such as linear modeling). Most of the information you will see on the internet and in some textbooks is that boxplots are good way to identify outliers. I fully endorse using boxplots as a first looks at the data, just to get a sense of things as they were intended by Tukey (1977). Thats right Dr. John W Tukey was the mastermind behind the boxplot…you may remember him from such statistical analyses as Tukey’s range test/HSD or Tukey lambda distribution.

Overall, boxplots are extremely helpful in quickly visualization of the central tendency and spread of the data. Don’t confuse the central tendency and spread for mean and standard deviation, as these values are not usually displayed in boxplots.

Components of a classic Tukey boxplot.

At its root, boxplots providing no information on the underlying data distribution and provide a somewhat arbitrary detection of extreme values especially for non-normal data distributions (Kampstra 2008; Krzywinski and Altman 2014). Extreme values are identified using a univariate boxplot simply identifies values that fall outside of 1.5 time the inter-quartile range (IQR) of the first or third quartile (Tukey 1977). As discussed above, outliers are extreme values outside the distribution of the data. Since IQR (i.e. median, 25th quantile, 75th quantile, etc.) calculations are distributionless calculations, values outside the IQR therefore are not based on any distribution. Below are four examples of data pulled from different distributions with a mean of zero ($\mu = 0$) and standard deviation of one ($\sigma = 1$). In these cases, especially for normally and skewed normal distributions, median, 25^th quantile and 75^th quantile values do not differ greatly, but the number of outliers do differ.

Boxplot and distribution plots of uniform, normal and skewed normal distributions with μ = 0 and σ = 1 (mean and standard deviation) and an N = 10,000.

The boxplot examples above show the span of over 10,000 values pulled from uniform, normal and skewed normal distribtuions. A directly obvious observations is that the uniform distribition does not generate any extreme values while the others generate some depending on the skewness of the distributions. Kampstra (2008) suggests that even for normal distributions the number of extreme values identified will increase concurrently with sample size. This is demonstrated below where as sample size increases, the number of extreme values identified also increases. Furthermore, as sample size increases the IQR estimates narrows which you would expect given the central limit theorem. This sample size dependance ultimately makes individual “outlier” detection problematic.

Number of potential outliers detected using a univariate boxplot (top) and inter-quartile range as a function of sample size (bottom) from a normally distributed simulated dataset with a mean of zero and a standard deviation of one (μ = 0; σ = 1).

Bottom line, a boxplot is not a suitable outlier detection test but rather an exploratory data analysis to understand the data. While boxplots do identify extreme values, these extreme values are not truely outliers, they are just values that outside a distribution-less metric on the near extremes of the IQR. Outlier tests such as the Grubbs test, Cochran test or even the Dixon test all can be used to idenify outliers. These tests and more can be found in the outlier R package. Outlier identification and culling is a tricky situtation and requires a strong and rigirous justification and validation that data points identified as an outlier is truely an outlier otherwise you can run afoul of type I and/or type II errors.

References

Gotelli, Nicholas J., and Aaron M. Ellison. 2013. A Primer of Ecological Statistics. Sunderland, MA: Sinauer Associates, Inc.

Kampstra, Peter. 2008. “Beanplot: A Boxplot Alternative for Visual Comparison of Distributions.” Journal of Statistical Software 28 (Code Snippet 1). https://doi.org/10.18637/jss.v028.c01.

Krzywinski, Martin, and Naomi Altman. 2014. “Visualizing Samples with Box Plots.” Nature Methods 11 (2): 119–20. https://doi.org/10.1038/nmeth.2813.

Tukey, John Wilder. 1977. “Exploratory Data Analysis.” In Statistics and Public Policy, edited by Frederick Mosteller, 1st ed. Addison-Wesley Series in Behavioral Science. Quantitative Methods. Addison-Wesley.

PCA basics in #Rstats

2019-12-10T00:00:00+00:00

Keywords: ordination, R, PCA

The masses have spoken!!

Also I got a wise piece of advice from mikefc regarding R blog posts.

This post was partly motivated by an article by the BioTuring Team regarding PCA. In their article the authors provide the basic concepts behind interpreting a Principal Component Analysis (PCA) plot. Before rehashing PCA plots in R I would like to cover some basics.

Ordination analysis, which PCA is part of, is used to order (or ordinate…hence the name) multivariate data. Ultimately ordination makes new variables called principal axes along which samples are scored and/or ordered (Gotelli and Ellison 2004). There are at least five routinely used ordination analyses, here I intend to cover just PCA. Maybe in the future I cover the other four as it relates to ecological data analysis.

Principal Component Analysis

I have heard PCA call lots of things in my day including but not limiting to magic, statistical hand waving, mass plotting, statistical guesstimate, etc. When you have a multivariate dataset (data with more than one variable) it can be tough to figure out what matters. Think water quality data with a whole suite of nutrients or fish study with biological, habitat and water chemistry data for several sites along a stream/river. PCA is the best way to reduce the dimesionality of multivariate data to determine what statistically and practically matters. But its also beyond a data winnowing technique it can also be used to demonstrate similarity (or difference) between groups and relationships between variables. A major disadvantage of PCA is that it is a data hungry analysis (see assumptions below).

Assumptions of PCA

Finding a single source related to the assumptions of PCA is rare. Below is combination of several sources including seminars, webpages, course notes, etc. Therefore this is not an exhaustive list of all assumptions and I could have missed some. I put this together for my benefit as well as your. Proceed with caution!!

Multiple Variables: This one is obvious. Ideally, given the nature of the analysis, multiple variables are required to perform the analysis. Moreover, variables should be measured at the continuous level, although ordinal variable are frequently used.
Sample adequacy: Much like most (if not all) statistical analyses to produce a reliable result large enough sample sizes are required. Generally a minimum of 150 cases (i.e. rows), or 5 to 10 cases per variable is recommended for PCA analysis. Some have suggested to perform a sampling adequacy analysis such as Kaiser-Meyer-Olkin Measure (KMO) Measure of Sampling Adequacy. However, KMO is less a function of sample size adequacy as its a measure of the suitability of the data for factor analysis, which leads to the next point.
Linearity relationships: It is assumed that the relationships between variables are linearly related. The basis of this assumption is rooted in the fact that PCA is based on Pearson correlation coefficients and therefore the assumptions of Pearson’s correlation also hold true. Generally, this assumption is somewhat relaxed…even though it shouldn’t be…with the use of ordinal data for variable.

The KMOS and bart_spher functions in the REdaS R library can be used to check the measure of sampling adequacy and if the data is different from an identity matrix , below is a quick example.

library(REdaS)
library(vegan)
library(reshape)

data(varechem);#from vegan package

# KMO
KMOS(varechem)

## 
## Kaiser-Meyer-Olkin Statistics
## 
## Call: KMOS(x = varechem)
## 
## Measures of Sampling Adequacy (MSA):
##         N         P         K        Ca        Mg         S        Al 
## 0.2770880 0.7943090 0.6772451 0.7344827 0.6002924 0.7193302 0.4727618 
##        Fe        Mn        Zn        Mo  Baresoil  Humdepth        pH 
## 0.5066961 0.6029551 0.6554475 0.4362350 0.7007942 0.5760349 0.4855293 
## 
## KMO-Criterion: 0.6119355

# Bartlett's Test Of Sphericity
bart_spher(varechem)

##  Bartlett's Test of Sphericity
## 
## Call: bart_spher(x = varechem)
## 
##      X2 = 260.217
##      df = 91
## p-value < 2.22e-16

The varechem dataset appears to be suitable for factor analysis. The KMO value for the entire dataset is 0.61, above the suggested 0.5 threshold. Furthermore, the data is significantly different from an identity matrix (H₀ : all off-diagonal correlations are zero).

No significant outliers: Like most statistical analyses, outliers can skew any analysis/ In PCA, outliers can have a disproportionate influence on the resulting component computation. Since principal components are estimated by essentially re-scaling the data retaining the variance outlier could skew the estimate of each component within a PCA. Another way to visualize how PCA is performed is that it uses rotation of the original axes to derive a new axes, which maximizes the variance in the data set. In 2D this looks like this:

You would expect that if true outliers are present that the newly derived axes will be skewed. Outlier analysis and issues associated with identifying outliers is a whole other ball game that I will not cover here other than saying box-plots are not a suitable outlier identification analysis, see Tukey (1977) for more detail on boxplots (I have a manuscript In Prep focusing on this exact issue).

Terminology

Before moving forward I wanted to dedicate some additional time to some terms specific to component analysis. By now we know the general gist of PCA … incase you were paying attention PCA is essentially a dimensionality reduction or data compression method to understand how multiple variable correlate in a given dataset. Typically when people discuss PCA they also use the terms loading, eigenvectors and eigenvalues.

Eigenvectors are unit-scaled loadings. Mathematically, they are the column sum of squared loadings for a factor. It conceptually represents the amount of variance accounted for by a given factor.
Eigenvalues also called characteristic roots is the measure of variation in the total sample accounted for by each factor. Computationally, a factor’s eigenvalues are determined as the sum of its squared factor loadings for all the variables. The ratio of eigenvalues is the ratio of explanatory importance of the factors with respect to the variables (remember this for later).
Factor Loadings is the correlation between the original variables and the factors. Analogous to Pearson’s r, the squared factor loadings is the percent of variance in that variable explained by the factor (…again remember this for later).

Analysis

Now that we have the basic terminology laid out and we know the general assumptions lets do an example analysis. Since I am an aquatic biogeochemist I am going to use some limnological data. Here we have a subset of long-term monitoring locations from six lakes within south Florida monitored by the South Florida Water Management District (SFWMD). To retrieve the data we will use the AnalystHelper package (link), which has a function to retrieve data from the SFWMD online environmental database DBHYDRO.

Let retrieve and format the data for PCA analysis.

#Libraries/packages needed
library(AnalystHelper)
library(reshape)

#Date Range of data
sdate=as.Date("2005-05-01")
edate=as.Date("2019-05-01")

#Site list with lake name (meta-data)
sites=data.frame(Station.ID=c("LZ40","ISTK2S","E04","D02","B06","A03"),
                 LAKE=c("Okeechobee","Istokpoga","Kissimmee","Hatchineha",
                        "Tohopekaliga","East Tohopekaliga"))

#Water Quality parameters (meta-data)
parameters=data.frame(Test.Number=c(67,20,32,179,112,8,10,23,25,80,18,21),
                      param=c("Alk","NH4","Cl","Chla","Chla","DO","pH",
                              "SRP","TP","TN","NOx","TKN"))

# Retrieve the data
dat=DBHYDRO_WQ(sdate,edate,sites$Station.ID,parameters$Test.Number)

# Merge metadata with dataset
dat=merge(dat,sites,"Station.ID")
dat=merge(dat,parameters,"Test.Number")

# Cross tabulate the data based on parameter name
dat.xtab=cast(dat,Station.ID+LAKE+Date.EST~param,value="HalfMDL",mean)

# Cleaning up/calculating parameters
dat.xtab$TN=with(dat.xtab,TN_Combine(NOx,TKN,TN))
dat.xtab$DIN=with(dat.xtab, NOx+NH4)

# More cleaning of the dataset 
vars=c("Alk","Cl","Chla","DO","pH","SRP","TP","TN","DIN")
dat.xtab=dat.xtab[,c("Station.ID","LAKE","Date.EST",vars)]

head(dat.xtab)

##   Station.ID              LAKE   Date.EST Alk   Cl Chla   DO   pH    SRP
## 1        A03 East Tohopekaliga 2005-05-17  17 19.7 4.00 7.90 6.10 0.0015
## 2        A03 East Tohopekaliga 2005-06-21  22 15.4 4.70 6.90 6.40 0.0015
## 3        A03 East Tohopekaliga 2005-07-19  16 15.1 5.10 7.10  NaN 0.0015
## 4        A03 East Tohopekaliga 2005-08-16  17 14.0 3.00 6.90 6.30 0.0015
## 5        A03 East Tohopekaliga 2005-08-30 NaN  NaN 6.00 7.07 7.44    NaN
## 6        A03 East Tohopekaliga 2005-09-20  17 16.3 0.65 7.30 6.70 0.0010
##      TP    TN   DIN
## 1 0.024 0.710 0.040
## 2 0.024 0.680 0.030
## 3 0.020 0.630 0.020
## 4 0.021 0.550 0.030
## 5   NaN    NA   NaN
## 6 0.018 0.537 0.017

If you are playing the home game with this dataset you’ll notice some NA values, this is because that data was either not collected or removed due to fatal laboratory or field QA/QC. PCA doesn’t work with NA values, unfortunately this means that the whole row needs to be excluded from the analysis.

Lets actually get down to doing a PCA analysis. First off, you have several different flavors (funcations) of PCA to choose from. Each have there own nuisances and come from different packages.

prcomp() and princomp() are from the base stats package. The quickest, easiest and most stable version since its in base.
PCA() in the FactoMineR package.
dubi.pca() in the ade4 package.
acp() in the amap package.
rda() in the vegan package. More on this later.

Personally, I only have experience working with prcomp, princomp and rda functions for PCA. The information shown here in this post can be extracted or calculated from any of these functions. Some are straight forward others are more sinuous. Above I mentioned using the rda function for PCA analysis. rda() is a function in the vegan R package for redundancy analysis (RDA) and the function I am most familiar with to perform PCA analysis. Redundancy analysis is a technique used to explain a dataset Y using a dataset X. Normally RDA is used for “constrained ordination” (ordination with covariates or predictors). Without predictors, RDA is the same as PCA.

As I mentioned above, NAs are a no go in PCA analysis so lets format/clean the data and we can see how much the data is reduced by the na.omit action.

dat.xtab2=na.omit(dat.xtab)

nrow(dat.xtab)

## [1] 725

nrow(dat.xtab2)

## [1] 515

Also its a good idea as with most data, is to look at your data. Granted when the number of variables get really big…imagine trying to looks at a combination of more than eight or nine parameters. Here we have a scatterplot of water quality data within our six lakes. The parameters in this analysis is Alkalinity (ALK), Chloride (Cl), chlorophyll-a (Chl-a), dissolved oxygen (DO), pH, soluble reactive phosphorus (SRP), total phosphorus (TP), total nitrogen (TN) and dissolved inorganic nitrogen (DIN).

Scatterplot of all data for the example dat.xtab2 dataset.

Alright, now the data is formatted and we have done some general data exploration. Lets check the adequacy of the data for component analysis…remember the KMO analysis?

KMOS(dat.xtab2[,vars])

## 
## Kaiser-Meyer-Olkin Statistics
## 
## Call: KMOS(x = dat.xtab2[, vars])
## 
## Measures of Sampling Adequacy (MSA):
##       Alk        Cl      Chla        DO        pH       SRP        TP 
## 0.7274872 0.7238120 0.5096832 0.3118529 0.6392602 0.7777460 0.7524428 
##        TN       DIN 
## 0.6106997 0.7459682 
## 
## KMO-Criterion: 0.6972786

Based on the KMO analysis, the KMO-Criterion of the dataset is 0.7, well above the suggested 0.5 threshold.

Lets also check if the data is significantly different from an identity matrix.

bart_spher(dat.xtab2[,vars])

##  Bartlett's Test of Sphericity
## 
## Call: bart_spher(x = dat.xtab2[, vars])
## 
##      X2 = 4616.865
##      df = 36
## p-value < 2.22e-16

Based on Sphericity test (bart_spher()) the results looks good to move forward with a PCA analysis. The actual PCA analysis is pretty straight forward after the data is formatted and “cleaned”.

library(vegan)

dat.xtab2.pca=rda(dat.xtab2[,vars],scale=T)

Before we even begin to plot out the typical PCA plot…try biplot() if your interested. Lets first look at the importance of each component and the variance explained by each component.

#Extract eigenvalues (see definition above)
eig <- dat.xtab2.pca$CA$eig

# Percent of variance explained by each compoinent
variance <- eig*100/sum(eig)

# The cumulative variance of each component (should sum to 1)
cumvar <- cumsum(variance)

# Combine all the data into one data.frame
eig.pca <- data.frame(eig = eig, variance = variance,cumvariance = cumvar)

As with most things in R there are always more than one way to do things. This same information can be extract using the summary(dat.xtab2.pca)$cont.

What does the component eigenvalue and percent variance mean…and what does it tell us. This information helps tell us how much variance is explained by the components. It also helps identify which components should be used moving forward.

Generally there are two general rules:

Pick components with eignvalues of at least 1.
- This is called the Kaiser rule. A variation of this method has been created where the confidence intervals of each eigenvalue is calculated and only factors which have the entire confidence interval great than 1.0 is retained (Beran and Srivastava 1985, 1987; Larsen and Warne 2010). There is an R package that can calculate eignvalue confidence intervals through bootstrapping, I’m not going to cover this in this post but below is an example if you wanted to explore it for yourself.

library(eigenprcomp)

boot_pr_comp(as.matrix(dat.xtab2[,vars]))

The selected components should be able to describe at least 80% of the variance.

If you look at eig.pca you’ll see that based on these criteria component 1, 2 and 3 are the components to focus on as they are enough to describe the data. While looking at the raw numbers are good, nice visualizations are a bonus. A scree plot displays these data and shows how much variation each component captures from the data.

Scree plot of eigenvalues for each prinicipal component of dat.xtab2.pca with the Kaiser threshold identified.

Scree plot of the variance and cumulative variance for each priniciple component from dat.xtab2.pca.

Now that we know which components are important, lets put together our biplot and extract components (if needed). To extract out components and specific loadings we can use the scores() function in the vegan package. It is a generic function to extract scores from vegan oridination objects such as RDA, CCA, etc. This function also seems to work with prcomp and princomp PCA functions in stats package.

scrs=scores(dat.xtab2.pca,display=c("sites","species"),choices=c(1,2,3));

scrs is a list of two item, species and sites. Species corresponds to the columns of the data and sites correspond to the rows. Use choices to extract the components you want, in this case we want the first three components. Now we can plot the scores.

PCA biplot of two component comparisons from the data.xtab2.pca analysis.

Typically when you see a PCA biplot, you also see arrows of each variable. This is commonly called loadings and can interpreted as:

When two vectors are close, forming a small angle, the variables are typically positively correlated.
If two vectors are at an angle 90$^\circ$ they are typically not correlated.
If two vectors are at a large angle say in the vicinity of 180$^\circ$ they are typically negatively correlated.

PCA biplot of two component comparisons from the data.xtab2.pca analysis with rescaled loadings.

You can take this one even further with by showing how each lake falls in the ordination space by joining the sites to the original data frame. This is also how you use the derived components for further analysis.

dat.xtab2=cbind(dat.xtab2,scrs$sites)

head(dat.xtab2)

##   Station.ID              LAKE   Date.EST Alk   Cl Chla  DO  pH    SRP
## 1        A03 East Tohopekaliga 2005-05-17  17 19.7 4.00 7.9 6.1 0.0015
## 2        A03 East Tohopekaliga 2005-06-21  22 15.4 4.70 6.9 6.4 0.0015
## 4        A03 East Tohopekaliga 2005-08-16  17 14.0 3.00 6.9 6.3 0.0015
## 6        A03 East Tohopekaliga 2005-09-20  17 16.3 0.65 7.3 6.7 0.0010
## 8        A03 East Tohopekaliga 2005-10-19  15 14.3 2.60 7.8 6.8 0.0010
## 9        A03 East Tohopekaliga 2005-11-15  13 15.8 3.70 8.6 6.7 0.0020
##      TP    TN   DIN        PC1        PC2        PC3
## 1 0.024 0.710 0.040 -0.3901117 -0.2240239 -0.5666993
## 2 0.024 0.680 0.030 -0.3912797 -0.2083258 -0.6284024
## 4 0.021 0.550 0.030 -0.4290627 -0.2486860 -0.6599207
## 6 0.018 0.537 0.017 -0.4045084 -0.2775129 -0.4566961
## 8 0.017 0.454 0.014 -0.4194518 -0.2718903 -0.3418373
## 9 0.010 0.437 0.017 -0.4232014 -0.2807803 -0.2434219

PCA biplot of two component comparisons from the data.xtab2.pca analysis with rescaled loadings and Lakes identified.

You can extract a lot of great information from these plots and the underlying component data but immediately we see how the different lakes are group (i.e. Lake Okeechobee is obviously different than the other lakes) and how differently the lakes are loaded with respect to the different variables. Generally this grouping makes sense especially for the lakes to the left of the plot (i.e. East Tohopekaliga, Tohopekaliga, Hatchineha and Kissimmee), these lakes are connected, similar geomorphology, managed in a similar fashion and generally have similar upstream characteristics with shared watersheds.

I hope this blog post has provided a better appreciation of component analysis in R. This is by no means a comprehensive workflow of component analysis and lots of factors need to be considered during this type of analysis but this only scratches the surface.

References

Beran, Rudolf, and Muni S. Srivastava. 1985. “Bootstrap Tests and Confidence Regions for Functions of a Covariance Matrix.” The Annals of Statistics 13 (1): 95–115. https://doi.org/10.1214/aos/1176346579.

———. 1987. “Correction: Bootstrap Tests and Confidence Regions for Functions of a Covariance Matrix.” The Annals of Statistics 15 (1): 470–71. https://doi.org/10.1214/aos/1176350284.

Gotelli, Nicholas J., and Aaron M. Ellison. 2004. A Primer of Ecological Statistics. Sinauer Associates Publishers.

Larsen, Ross, and Russell T. Warne. 2010. “Estimating Confidence Intervals for Eigenvalues in Exploratory Factor Analysis.” Behavior Research Methods 42 (3): 871–76. https://doi.org/10.3758/BRM.42.3.871.

July 9, 2019 Eco DataViz

2019-07-09T00:00:00+00:00

Keywords: dataviz, R, Sea Ice

Following the progression of my data viz journey, I decided to tackle some Arctic sea-ice data after checking out Zack Labe’s Arctic Ice figures. The data this week is modeled sea-ice volume and thickness from the Polar Science Center Pan-Arctic Ice Ocean Modeling and Assimilation System (PIOMAS). Sea ice volume is an important climate indicator. It depends on both ice thickness and extent and therefore more directly tied to climate forcing than extent alone.

Each data viz endeavor I try to learn something new or explore existing technique. Dates in R can be stressful to say the least. For anyone who has worked with time-series data would agree. Dates can be formatted as a date format using as.Date(), format(), as.POSIXct() or as.POSIXlt()…most of my time in R is spent formatting dates. Here is a useful page on working with dates in R. The PIOMAS data has three variables…Year, Day of Year (1 - 365) and Thickness (or Volume). I downloaded the data from webpage and unzipped the gzipped tar file using a third party to extract the data, but this can also be done in R. The two data sets volume and thickness data s ASCII files.

Lets load our libraries/packages.

#Libraries
#devtools::install_github("SwampThingPaul/AnalystHelper")
library(AnalystHelper);
library(plyr)
library(reshape)

thick.dat=read.table("PIOMAS.thick.daily.1979.2019.Current.v2.1.dat",
header=F,skip=1,col.names=c("Year","Day","Thickness_m"))

head(thick.dat,5L)

##   Year Day Thickness_m
## 1 1979   1       1.951
## 2 1979   2       1.955
## 3 1979   3       1.962
## 4 1979   4       1.965
## 5 1979   5       1.973

The sea-ice volume data is in the same format.

vol.dat=read.table("PIOMAS.vol.daily.1979.2019.Current.v2.1.dat",
                   header=F,skip=1,col.names=c("Year","Day","Vol_km3"))
vol.dat$Vol_km3=vol.dat$Vol_km3*1E+3;#To convert data

head(vol.dat,5L)

##   Year Day Vol_km3
## 1 1979   1   26405
## 2 1979   2   26496
## 3 1979   3   26582
## 4 1979   4   26672
## 5 1979   5   26770

The sea-ice thickness data are expressed in meters, and volume in x10³ km³. Understanding what the data represent and how they are derived is most of the job of a scientist especially in data visualization. Inherently all data has its limits.

Currently we have two different data files vol.dat and thick.dat, lets get them into one single data.frame and sort the data accordingly (just in case).

dat=merge(thick.dat,vol.dat,c("Year","Day"))
dat=dat[order(dat$Year,dat$Day),]

Alright here come the fun part…dates in R. Remember the data is Year and Day of Year, which mean no month or day (i.e. Date). Essentially you have to back calculate day of the year to an actual date. Thankfully this is pretty easy. Check out ?strptime and ?format!!

dat$month.day=format(strptime(dat$Day,"%j"),"%m-%d")

This gets us Month-Day from day of the year. Now for some tricky. Lets actually make this a date by using paste and leveraging date.fun() from AnalystHelper.

dat$Date=with(dat,date.fun(paste(Year,month.day,sep="-"),tz="GMT"))

Viola!! We have a POSIXct formatted field that has Year-Month-Day…in-case you wanted to check the sea-ice volume on your birthday, wedding anniversary, etc. …no one? Just me? …OK moving on!!

Some more tricky which comes in handy when aggregating data is to determine the month and year (for monthly summary statistics). Also we can determine what decade the data is from, it wasn’t used in this analysis but something interesting I discovered in my data musings.

dat$month.yr=with(dat,date.fun(paste(Year,format(Date,"%m"),01,sep="-"),tz="GMT"))
dat$decade=((dat$Year)%/%10)*10

Now that we have the data put together lets start plotting.

Here we have just daily (modeled) sea-ice thickness data from PIOMAS.

Pan Arctic Sea-Ice thickness from 1979 to present. Data source: Polar Science Center - (PIOMAS).

Now we can estimate annual mean and some confidence interval around the mean…lets say 95%.

#Calculate annual mean, sd and N. Excluding 2019 (partial year)
period.mean=ddply(subset(dat,Year!=2019),"Year",summarise,
                  mean.val=mean(Thickness_m,na.rm=T),
                  sd.val=sd(Thickness_m,na.rm=T),
                  N.val=N(Thickness_m))
#Degrees of freedom
period.mean$Df=period.mean$N.val-1
#Student-T statistic
period.mean$Tp=abs(qt(1-0.95,period.mean$Df))
#Lower and Upper CI calculation
period.mean$LCI=with(period.mean,mean.val-sd.val*(Tp/sqrt(N.val)))
period.mean$UCI=with(period.mean,mean.val+sd.val*(Tp/sqrt(N.val)))

Now lets add that to the plot with some additional trickery to plot annual mean $\pm$ 95% CI stating on Jan 1st of every year.

with(period.mean,lines(date.fun(paste(Year,"01-01",sep="-"),tz="GMT"),mean.val,lty=1,col="red"))
with(period.mean,lines(date.fun(paste(Year,"01-01",sep="-"),tz="GMT"),LCI,lty=2,col="red"))
with(period.mean,lines(date.fun(paste(Year,"01-01",sep="-"),tz="GMT"),UCI,lty=2,col="red"))

Pan Arctic Sea-Ice thickness from 1979 to present with annual mean and 95% confidence interval. Data source: Polar Science Center - (PIOMAS).

What does sea-ice volume look like?

Pan Arctic Sea-Ice volume from 1979 to present with annual mean and 95% confidence interval. Data source: Polar Science Center - (PIOMAS).

Some interesting and alarming trends in both thickness and volume for sure! There is an obvious seasonal trend in the data…one way to look at this is to look at the period of record daily change.

Period of record mean (1979 - 2018) daily mean and 95% confidence interval sea-ice volume and thickness. Data source: Polar Science Center - (PIOMAS).

Now how does the thickness versus volume relationship look? Since the volume of data is so much we can do some interesting color coding for the different years. Here I use a color ramp colorRampPalette(c("dodgerblue1","indianred1")) with each year getting a color along the color ramp.

Here is how I set up the color ramp.

N.yrs=length(unique(dat$Year))
cols=colorRampPalette(c("dodgerblue1","indianred1"))(N.yrs)

In the plot I use a loop to plot each year with a different color.

plot(...)

for(i in 1:N.yrs){
  with(subset(dat,Year==yrs.val[i]),
       points(Vol_km3,Thickness_m,pch=21,
              bg=adjustcolor(cols[i],0.2),
              col=adjustcolor(cols[i],0.4),
              lwd=0.1,cex=1.25))
}

As is with most data viz, especially in base R is some degree of tricking and layering. To build the color ramp legend I used the following (I adapted a version of this.).

# A raster of the color ramp
legend_image=as.raster(matrix(cols,ncol=1))
# Empty plot
plot(c(0,1),c(0,1),type = 'n', axes = F,xlab = '', ylab = '')
# Gradient labels
text(x=0.6, y = c(0.5,0.8), labels = c(2019,1979),cex=0.8,xpd=NA,adj=0)
# Put the color ramp on the legend
rasterImage(legend_image, 0.25, 0.5, 0.5,0.8)
# Label to legend
text(0.25+(0.5-0.25)/2,0.85,"Year",xpd=NA)

Sea-ice thickness versus volume for the 41 year period. Minimum ice thickness and volume identified for 1980, 1990, 2000 and 2010. Data source: Polar Science Center - (PIOMAS).

Hope you found this data visualization exercise interesting and thought provoking. Happy data trails!