Jekyll2021-08-13T21:02:41+00:00https://swampthingecology.org/blog/feed.xmlPaul Julian II, PhDEcologist, Wetland Biogeochemist, Data-scientist, lover of Rstats.South Florida Estuaries2021-08-13T00:00:00+00:002021-08-13T00:00:00+00:00https://swampthingecology.org/blog/south-florida-estuaries<script src="https://swampthingecology.org/blog/knitr_files/2021-08-13-Estuary_files/header-attrs-2.7/header-attrs.js"></script>
<section class="main-content">
<p><strong>Keywords:</strong> Okeechobee, water level, LOSOM</p>
<p>Estuaries are the transition zone from fresh to saltwater and to function properly requires a balance of freshwater inputs. These inputs create a gradient of fresh to saline waters that are important for the composition, distribution of vegetative communities and wildlife. However too much freshwater in the estuary can be detrimental to the health of the ecosystem.</p>
<p>Here we have two estuaries and upstream watersheds, the Caloosahatchee and St Lucie river estuary and watershed. Both are connected to Lake Okeechobee thank to the landscape modifications in the early 1900s.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2021-08-13-Estuary_files/figure-html/unnamed-chunk-1-1.png" alt="Caloosahatchee and St Lucie watersheds." />
<p class="caption">
Caloosahatchee and St Lucie watersheds.
</p>
</div>
<p>The Caloosahatchee estuary receives freshwater either from the upstream watershed in the form of runoff or discharges from Lake Okeechobee. Similarly, the St Lucie Estuary (on the east coast) receives either basin runoff from a comparatively smaller watershed or discharges from the Lake. The St Lucie river has the added benefit of being able to return water to Lake Okeechobee (AKA backflow) rather than discharging to the Estuary. While this backflow may prevent extreme or damaging discharges to the St Lucie Estuary, it is detrimental to lake ecology by adding nutrients to an already nutrient-rich system. In extreme cases some water can be returned to the lake via S77, but this is rare.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2021-08-13-Estuary_files/figure-html/unnamed-chunk-3-1.png" alt="May 1978 - Apirl 2021 peroid of record annual mean discharge volume from source within the Caloosahatchee and St Lucie watersheds to each estuary. Values above bars are average discharge volume in Ac-Ft WY^-1^." />
<p class="caption">
May 1978 - Apirl 2021 peroid of record annual mean discharge volume from source within the Caloosahatchee and St Lucie watersheds to each estuary. Values above bars are average discharge volume in Ac-Ft WY<sup>-1</sup>.
</p>
</div>
<p>Estuaries rely on a balance fresh and saline water. Here in the Caloosahatchee River Estuary the main source of freshwater is the S-79 structure. As discharges from this structure increases the estuary becomes more fresh with discharges about 2600 cfs (~5157 Ac-Ft d<sup>-1</sup>) causing significant damage to marine species.</p>
<p><img src="https://swampthingecology.org/blog\images\20210813_Estuary\Surf_bot_da_ff_GAM.gif" width="75%" style="display: block; margin: auto;" /></p>
<hr />
</section>Water Level Limbo2021-07-13T00:00:00+00:002021-07-13T00:00:00+00:00https://swampthingecology.org/blog/water-level-limbo<script src="https://swampthingecology.org/blog/knitr_files/2021-07-14-LakeLevel_files/header-attrs-2.7/header-attrs.js"></script>
<section class="main-content">
<div id="finding-lake-os-water-level-sweet-spot" class="section level2">
<h2><em>Finding Lake O’s Water Level Sweet Spot</em></h2>
<p><strong>Keywords:</strong> Okeechobee, water level, LOSOM</p>
<p>Original article published as a <a href="http://www.sccf.org/news/blog/finding-lake-os-water-level-sweet-spot" target="_blank">SCCF</a> Wednesday Update.</p>
<hr />
<p>When it comes to water, Florida has two seasons: wet and dry. This seasonality in rainfall causes water levels within lakes and wetlands to fluctuate with large seasonal and within-year (intra annual) variability that can affect ecosystem function and structure. This is especially true for Lake Okeechobee, the largest freshwater lake in the Southeast. Lake Okeechobee supports a large freshwater recreational fishery and is an integral part of South Florida’s hydroscape.</p>
<p>Lake Okeechobee is the beating heart of the Everglades. Throughout the seasons, water levels rise and fall, providing freshwater to the downstream Everglades and coastal estuaries. Historically, this large, shallow lake would overflow to the south and west during periods of high water, providing a sheetflow of water to the south into the Everglades. To the west, Lake Okeechobee’s littoral zone (nearshore area from the high-water line occupied by wetlands) adapted to high and low water events by expanding and contracting across the landscape. As extensive levee, canal, and lock/gate systems were constructed around the lake in the early and mid-1900s, the lake became encircled by infrastructure. This infrastructure ultimately isolated the lake’s littoral zone to a fraction of its historic size.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog\images\20210714_Lake\cochran.jpg" alt="Photo Credit: Charles Hanlon, South Florida Water Management District Edge of the littoral zone at Cochrans Pass (April 2021)." width="50%" />
<p class="caption">
Photo Credit: Charles Hanlon, South Florida Water Management District Edge of the littoral zone at Cochrans Pass (April 2021).
</p>
</div>
<p>Since being impounded more than 60 years ago, water levels within Lake Okeechobee have been managed in part by the U.S. Army Corps of Engineers through a series of regulation schedules. Under these regulation schedules, Lake Okeechobee has experienced periods of extreme high and low water conditions, which has resulted in detrimental effects to the near-shore ecosytems.</p>
<p>Across the lake, high water levels can negatively impact plants and animals due to deep water, flooding of the littoral zone, erosion, nutrient transport, and algae blooms. Low water levels can dry out portions of the littoral zone, spread exotic species, impact the fishery, and reduce prey for snail kites and alligators. Moderate water levels—“the sweet spot”—can provide optimal recruitment of the fishery, provide ample forage for wading birds and snail kites, and allow for plants to proliferate across the lake.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog\images\20210714_Lake\combo_ZachWelch_labels.png" alt="The littoral zone of Lake Okeechobee is composed of a variety of different plant species all adapted for different growing conditions. From left to right: an upper broadleaf marsh system, mid-level floating leaf marsh, and an outer submerged and emergent vegetation marsh. Photo Credit: Zach Welch" width="100%" />
<p class="caption">
The littoral zone of Lake Okeechobee is composed of a variety of different plant species all adapted for different growing conditions. From left to right: an upper broadleaf marsh system, mid-level floating leaf marsh, and an outer submerged and emergent vegetation marsh. Photo Credit: Zach Welch
</p>
</div>
<p>Efforts are underway to revisit the Lake Okeechobee System Operating Manual and revise how water is managed for Lake Okeechobee. With the recent completion of the Herbert Hoover Dike improvements, a greater range of water levels can be exercised in Lake Okeechobee. Currently, all of the alternative plans proposed by the U.S. Army Corps of Engineers, which manages the lake, intend to keep the lake higher than existing conditions. (However, as previously noted, high water levels can have harmful effects on the lake’s overall health.) Some plans add an extra foot to water levels and reduce the amount of time it is within the moderate, sweet spot, level.</p>
<p>High and low water levels can be tolerated for short periods of time, and sometimes are needed for the ecology of the lake, but prolonged and extreme events can cause significant ecological consequences and impact water quality within the lake. How a prospective plan manages these events is important. Ultimately, a plan that optimizes water levels for the ecology of the lake is needed—and this has to be done in a way that minimizes the damaging discharges to coastal estuaries.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog\images\20210714_Lake\LOK_rayshader.png" alt="Lake Okeechobee at 9 and 17 feet National Geodetic Vertical Datum (NGVD) of 1929. Water levels on the lake are measured relative to sea level, NGVD 1929 is the vertical datum used by the South Florida Water Management District and U.S. Army Corps of Engineers." width="100%" />
<p class="caption">
Lake Okeechobee at 9 and 17 feet National Geodetic Vertical Datum (NGVD) of 1929. Water levels on the lake are measured relative to sea level, NGVD 1929 is the vertical datum used by the South Florida Water Management District and U.S. Army Corps of Engineers.
</p>
</div>
<p><strong>Hydrologic Modeler Paul Julian’s position is funded jointly by SCCF and The Conservancy of Southwest Florida.</strong></p>
<hr />
<p>Want to know more about SCCF? Visit our <a href="http://www.sccf.org/" target="_blank">webpage</a>.</p>
<p>More information on LOSOM can be found at the USACE LOSOM project <a href="https://www.saj.usace.army.mil/LOSOM/" target="_blank">webpage</a></p>
</div>
</section>Evaluating Algal Dynamics within the Okeechobee-Caloosahatchee System2021-06-30T00:00:00+00:002021-06-30T00:00:00+00:00https://swampthingecology.org/blog/evaluating-algal-dynamics-within-the-okeechobee-caloosahatchee-system<script src="https://swampthingecology.org/blog/knitr_files/2021-06-30-Algae_files/header-attrs-2.7/header-attrs.js"></script>
<section class="main-content">
<div id="its-not-easy-being-green-evaluating-algal-dynamics-within-the-okeechobee-caloosahatchee-system" class="section level2">
<h2><em>It’s Not Easy Being Green: Evaluating Algal Dynamics within the Okeechobee-Caloosahatchee System</em></h2>
<p><strong>Keywords:</strong> Okeechobee, Caloosahatchee, Algae</p>
<p>Original article published as a <a href="http://www.sccf.org/our-work/wednesday-update" target="_blank">SCCF</a> Wednesday Update.</p>
<hr />
<p><img src="https://swampthingecology.org/blog\images\20210630_Algae\Franklin Lock 5.19.21 long shot.jpg" width="50%" style="display: block; margin: auto;" /></p>
<p>As summertime temperatures begin to warm and seasonal rains sweep across Southwest Florida, you may notice a change in conditions on the waterways. During this time of the year, the occurrence of algae within Lake Okeechobee and the Caloosahatchee River becomes more noticeable.</p>
<p>Visually, algae blooms can appear as streaks of green, discolored water, or floating mats of green, blue, and white, depending on the species. Under the right conditions, some algae species when in bloom can be classified as harmful algal blooms (HABs) which produce toxins such as Microcystis (in freshwater) or Karenia brevis (in saltwater) that kill fish and other sea life. Other algae are nontoxic but also can lead to fish kills and impact benthic communities by consuming dissolved oxygen and changing the color of the water.</p>
<p>Over the past two decades, algal biomass (measured as suspended chlorophyll-a in the water) has significantly increased at Franklin Lock (S-79). This increase in algal biomass is important as the S-79 structure is fed by both Lake Okeechobee and the upstream C-43 canal as they discharge freshwater to the Caloosahatchee River estuary.</p>
<p>Increased nutrient (nitrogen and phosphorus) loading has been identified as a major factor contributing to an increase in algal blooms in the lake and estuaries. However, within the Okeechobee-Caloosahatchee system, no one thing can be singled out as the ultimate driver of algae; rather, it’s a combination of several factors. Algal growth and bloom proliferation can be driven by several factors: light availability (how much light travels through the water column); water temperature; nutrient concentration; and hydrology (water level and discharge).</p>
<p><img src="https://swampthingecology.org/blog\images\20210630_Algae\Algae Franklin Lock 5.19.21.png" width="50%" style="display: block; margin: auto;" /></p>
<p>Currently underway, the Lake Okeechobee System Operating Manual (LOSOM) planning effort intends to change how water is managed for Lake Okeechobee. A specific topic of interest is understanding how the different water management schemes will affect the risk of algal bloom formation and transport within the Caloosahatchee and St Lucie estuaries. This metric is important to reduce the potential risk of HABs within our local waters which can lead to primary effects—fish kills and human health impacts—and secondary issues, such as environmental degradation and negative impacts on the local economy.</p>
<p>To evaluate algal bloom risk to the estuaries, the U.S. Army Corps of Engineers (USACE) will compare discharges from Lake Okeechobee during the time of the year where algal bloom potential is highest (June – August). This evaluation is based on the concept of moving water with algae from Lake Okeechobee along the C-43 canal to the Caloosahatchee estuary. Based on the available data, an algal biomass transport hypothesis from the lake to the estuary does not paint the entire picture. Other processes contribute to algae bloom formation and transport within the Okeechobee-Caloosahatchee system.</p>
<p>As part of the LOSOM planning effort, SCCF provided these recommendations: developing a more robust monitoring network to assess changes in algae; evaluating algal bloom potential relative to the amount of time water moves from the lake to the estuary; and including other factors, such as temperature and light availability. Ultimately, our goal is to develop an operations plan that reduces the risk of algal blooms in the estuaries and balances the needs of the Caloosahatchee and St Lucie estuaries, Lake Okeechobee, and the Southern Everglades to improve the ecology and sustainability of our system.</p>
<p>By evaluating the existing science, assessing the LOSOM alternatives, and studying nutrient loading from Lake Okeechobee and the upstream basin and the resulting loads to the estuary, we are gaining a better understanding of algal dynamics within the Okeechobee-Caloosahatchee system. As water management changes for Lake Okeechobee, we continue to develop our understanding of algal and nutrient dynamics to inform management and policy decisions.</p>
<p><strong>Hydrologic Modeler Paul Julian’s position is funded jointly by SCCF and The Conservancy of Southwest Florida.</strong></p>
<hr />
<p>More information on LOSOM can be found at the USACE LOSOM project <a href="https://www.saj.usace.army.mil/LOSOM/" target="_blank">webpage</a></p>
</div>
</section>Legacy Phosphorus In Lake Okeechobee2021-04-22T00:00:00+00:002021-04-22T00:00:00+00:00https://swampthingecology.org/blog/legacy-phosphorus-in-lake-okeechobee<script src="https://swampthingecology.org/blog/knitr_files/2021-04-22-GEER_files/header-attrs-2.7/header-attrs.js"></script>
<section class="main-content">
<p><strong>Keywords:</strong> Okeechobee, Phosphorus, Sediments</p>
<p><img src="https://swampthingecology.org/blog\images\20210422_GEER\GEER.jpg" width="50%" style="display: block; margin: auto;" /></p>
<p>As presented at the Greater Everglades Ecosystem Restoration Conference 2021.</p>
<p>Here’s the almost 15 minute presentation I gave at the GEER conference on Thursday April 22, 2021 in the <em>Ecological Processes in Lake Okeechobee</em> moderated by Todd Z Osborne (University of Florida Whitney Lab) and Paul Jones (South Florida Water Management District).</p>
<p>Enjoy!!</p>
<p><img src="https://swampthingecology.org/blog\images\20210422_GEER\Julian_titleslide.png" width="50%" style="display: block; margin: auto;" /></p>
<p><a href="https://youtu.be/avazGnAAPco" target="_blank">GEER 2021 Presentation - Click to Watch!</a></p>
<center>
I’m Calling To You Like A Long Lost Friend: Legacy Phosphorus In Lake Okeechobee <em>Click to watch (it it will redirect to YouTube)</em>.
</center>
<ul>
<li><p>Nutrient inputs are highly <em>variable</em> driven by upstream inputs with potential impacts to downstream systems</p></li>
<li><p>Within lake trends in nutrient are <em>variable</em> driven by ecosystem specific factors in a water - sediment feedback mechanisms.</p></li>
</ul>
<p><img src="https://swampthingecology.org/blog\images\20210422_GEER\TP_SW_GAM.gif" width="25%" style="display: block; margin: auto;" /></p>
<ul>
<li>Spatial and temporal trends in lake sediment TP concentrations are apparent with spatial trends mirroring the water column.</li>
</ul>
<p><img src="https://swampthingecology.org/blog\images\20210422_GEER\LakeOSed_GAM.gif" width="25%" style="display: block; margin: auto;" /></p>
<ul>
<li>The difference from input to output and water column to sediment is largely due to (high) internal loading.</li>
</ul>
<p>The word of the talk was <em>variable</em> and <em>dynamic</em>, whilst these words are often over used it adequately describes the water quality and overall system biogeochemical cycling of nutrients of Lake Okeechobee.</p>
<hr />
<p><strong>Abstract:</strong> Lake Okeechobee displays many features of a shallow, polymictic lake including frequent mixing of the water column and resuspension of unconsolidated sediments, and internal loading of nutrients to name a few. Additionally, the Lake has characteristically high phosphorus (P) loading due to changes in land use and drainage patterns upstream. The lake provides essential ecosystem services in the form of water supply, flood protection, navigation, and recreation, as well as vital habitat for south Florida’s flora and fauna. However, these values are threatened by current and historic excessive inputs of P influencing endo- and exogenic processes leading to fish-kills, hypoxic events, algal blooms, and degraded aquatic habitat.</p>
<p>Over the last decade and a half, nutrient loading to the lake has significantly increased. Utilizing the long-term ambient monitoring network, this study evaluated water column total nitrogen (TN), total P (TP), and chlorophyll-a (Chl-a) concentrations over 23 years (May 1996 – April 2020). Water quality trends across Lake Okeechobee varied spatially with significantly declining trends in TN and Chl-a, and increasing trends in TP. Coupled with these trends the lake has notable water column nutrient gradients. Lake sediments are a long-term integrator of ecosystem conditions, over the last 30 years four lake sediment surveys have been completed. Using data from these surveys, sediment N and P concentrations were evaluated both spatially and temporally to evaluate the change in sediment nutrients throughout the Lake. Despite the lake’s shallow bathymetry and the occurrence of frequent mixing events (i.e. high winds, hurricanes, drought), lake sediments have remained relatively stable, although notable shifting in sediment TP and TN concentrations have been observed.</p>
<p>The nutrient balance of Lake Okeechobee and the understanding of endo and exogenic drivers of nutrient mobilization are important to aid in the restoration of the Lake and the Greater Everglades. As restoration activities progress, it is expected that nutrient inputs to the lake will decline. However, given the volume of N and P stored in the lake’s sediments, internal loading could result in delayed improvements to nutrient concentrations within the Lake. Despite the potential for delayed results continued to study and restoration activities are crucial to preserving our long-lost friend.</p>
<hr />
</section>Changes to CRS in R2021-01-21T00:00:00+00:002021-01-21T00:00:00+00:00https://swampthingecology.org/blog/changes-to-crs-in-r<script src="https://swampthingecology.org/blog/knitr_files/2021-01-21-CRS_files/header-attrs-2.6/header-attrs.js"></script>
<section class="main-content">
<p><strong>Keywords:</strong> R, spatial data, coordinates</p>
<pre><code>There is nothing more deceptive than an obvious fact.
- Arthur Conan Doyle, The Boscombe Valley Mystery</code></pre>
<p>If you haven’t heard already, big changes are afoot in the <a href="https://rspatial.org/" target="_blank">R-spatial</a> community. <img src="https://swampthingecology.org/blog\images\20210121_CRS\sherlock_afoot.gif" width="50%" style="display: block; margin: auto;" /></p>
<p>…if you were/are like me you experienced a mix of emotions. But not to worry there are loads of resources and a lot of really smart people working the issues right now.</p>
<p><img src="https://swampthingecology.org/blog\images\20210121_CRS\sherlock_shcoked.gif" width="40%" style="display: block; margin: auto;" /></p>
<p>…so expect lots of blog posts and resources.</p>
<hr />
<p>The cliff notes version (short, short version) is that changes in how the representation of coordinate reference systems (CRS) have finally caught up with how spatial data is handled in R packages (or maybe its the otherway around). In a vignette titled title <a href="https://rgdal.r-forge.r-project.org/articles/CRS_projections_transformations.html" target="_blank"><em>“Why have CRS, projections and transformations”</em></a>, <a href="https://twitter.com/RogerBivand" target="_blank">Roger Bivand</a> explains the nitty gritty.</p>
<p>Here are some more resources:</p>
<ul>
<li>YouTube lecture by Roger Bivand (<a href="https://youtu.be/2H1Tn4oN32M" target="_blank">link</a>)</li>
<li>Associated material (<a href="https://rsbivand.github.io/ECS530_h20/ECS530_III.html" target="_blank">link</a>)</li>
<li>Bivand, R.S. Progress in the R ecosystem for representing and handling spatial data. J Geogr Syst (2020). <a href="https://doi.org/10.1007/s10109-020-00336-0" target="_blank">https://doi.org/10.1007/s10109-020-00336-0</a></li>
</ul>
<p>Roger also penned this post explaining the migration specific for the <code>rgdal</code>, <code>sp</code> and <code>raster</code> packages specific to read, write, project, and transform objects using PROJ strings (<a href="https://cran.r-project.org/web/packages/rgdal/vignettes/PROJ6_GDAL3.html#Migration_to_PROJ6GDAL3" target="_blank"><em>“Migration to PROJ6/GDAL3”</em></a>). It gets rather complex but a good resource.</p>
<p>In another resource I came across in my sleuthing and troubleshooting by Edzer Pebesma and Roger Bivand discussing how <a href="https://gdal.org/" target="_blank">GDAL</a> and <a href="https://proj.org" target="_blank">PROJ</a> (formerly proj.4) relates to geospatial tools including several <code>R</code> packages in a post titled <a href="https://www.r-spatial.org/r/2020/03/17/wkt.html" target="_blank"><em>“R spatial follows GDAL and PROJ developement”</em></a>. As an example, they outline the dependency for the <code>sf</code> package, pictured here:</p>
<p><img src="https://keen-swartz-3146c4.netlify.com/images/sf_deps.png" width="75%" style="display: block; margin: auto;" /></p>
<p>Also something worth reiterating here, briefly:</p>
<ul>
<li><p>PROJ provides methods for coordinate representation, conversion (projection) and transformation, and</p></li>
<li><p>GDAL allows reading and writing of spatial raster and vector data in a standardized form, and provides a high-level interface to PROJ for these data structures, including the representation of coordinate reference systems (CRS)</p></li>
</ul>
<hr />
<p>We are ultimately dealing with coordinate reference systems (or CRS) but it also goes by another name…spatial reference system (SRS). This might make more sense soon. As summarized by <a href="https://github.com/inbo" target="_blank">INBO</a>, CRS are defined by several elements:</p>
<ul>
<li>a coordinate system,</li>
<li>a ‘datum’; it localizes the geodetic coordinate system relative to the Earth and needs a geometric definition of the ellipsoid, and</li>
<li>only for projected CRSes coordinate conversion parameters that determine the conversion from the geodetic to the projected coordinates.</li>
</ul>
<p><a href="https://github.com/inbo" target="_blank">INBO</a> did a fantastic tutorial (<a href="https://inbo.github.io/tutorials/tutorials/spatial_crs_coding/" target="_blank">https://inbo.github.io/tutorials/tutorials/spatial_crs_coding/</a>) briefly discussing on the changes and walking through the how-to for <code>sp</code>, <code>sf</code> and <code>raster</code> packages. The <code>rgdal</code> package leans heavily on the <code>sp</code> package…incase you were worried.</p>
<hr />
<p>Here are some examples and things that I have learned dealing with this issue. Nothing special and I suggest visiting the resources identified above (especially <a href="https://inbo.github.io/tutorials/tutorials/spatial_crs_coding/" target="_blank">https://inbo.github.io/tutorials/tutorials/spatial_crs_coding/</a>). I am partial to the <code>sp</code> and <code>rgdal</code> packages, this is what I initially learned and got comfortable using. So lets load <code>rgdal</code>.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(rgdal)</span></code></pre></div>
<p>In the “good’ol days” you could define a CRS with this</p>
<pre><code>utm17 <- CRS("+proj=utm +zone=17 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs")</code></pre>
<p>Do this now and you get…</p>
<pre><code>## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition</code></pre>
<p>Fast-forward to now. There might be several ways to do this but the easiest I found is</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>utm17 <span class="ot"><-</span> <span class="fu">CRS</span>(<span class="at">SRS_string=</span><span class="st">"EPSG:4326"</span>)</span></code></pre></div>
<p>Notice the the argument <code>SRS_string</code> … as in spatial reference system! (I just picked that up writing this post).</p>
<p>Another thing in the update is the use of WKT (well-known text) over that of PROJ strings. WKT strings are interesting and provides lots of good information on the CRS (or SRS) if your into that kind of thing. To make a WKT you use the <code>wkt()</code> function.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>utm17 <span class="ot"><-</span> <span class="fu">CRS</span>(<span class="at">SRS_string=</span><span class="st">"EPSG:4326"</span>)</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>utm17.wkt<span class="ot">=</span><span class="fu">wkt</span>(utm17)</span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>utm17.wkt</span></code></pre></div>
<pre><code>## [1] "GEOGCRS[\"WGS 84 (with axis order normalized for visualization)\",\n DATUM[\"World Geodetic System 1984\",\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]],\n ID[\"EPSG\",6326]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433],\n ID[\"EPSG\",8901]],\n CS[ellipsoidal,2],\n AXIS[\"geodetic longitude (Lon)\",east,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433,\n ID[\"EPSG\",9122]]],\n AXIS[\"geodetic latitude (Lat)\",north,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433,\n ID[\"EPSG\",9122]]]]"</code></pre>
<p>or you can print the WKT to be more readable/organized with:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="fu">cat</span>(utm17.wkt)</span></code></pre></div>
<pre><code>## GEOGCRS["WGS 84 (with axis order normalized for visualization)",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]</code></pre>
<p>Further down the road when you are doing analyses or even plotting in some packages (i.e. <code>tmap</code>) you might get a bunch of warnings like:</p>
<pre><code>Warning message:
In sp::proj4string(obj) : CRS object has comment, which is lost in output</code></pre>
<p>This shouldn’t stop any of the operations but you can “mute” the warnings by running <code>options("rgdal_show_exportToProj4_warnings"="none")</code> in your console. I keep my “un-muted” to make sure I don’t inadvertently miss something.</p>
<p>If your wanting to transform a dataset from one datum to another you will need to use the WKT string. For instance I use several different state agency spatial datasets, one of which uses <code>NAD83 HARN</code> (which is a discarded datum…still learning about what this means) and I usually work in <code>UTM</code>. I find UTM CRSes easier to work with in general. Going back to the example dataset…if I read the file into <code>R</code> I get:</p>
<pre><code>dat<-readOGR(shapefile) #just as an example
Warning message:
In OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS = dumpSRS, :
Discarded datum NAD83_High_Accuracy_Reference_Network in CRS definition: +proj=tmerc +lat_0=24.3333333333333 +lon_0=-81 +k=0.999941177 +x_0=200000.0001016 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs</code></pre>
<p>That was enough to make my head spin…but if you notice its just a warning message and it will still read the file into the <code>R</code> environment. Now to transform the CRS:</p>
<pre><code>dat.tran<-spTransform(dat,utm17.wkt)</code></pre>
<p>But lets say you are making a <code>SpatialPointsDataFrame</code>, one of the arguments is <code>proj4string</code> (which we are moving away from and the motivation for this whole post!).</p>
<p>Here is some data…</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>dat2<span class="ot"><-</span><span class="fu">data.frame</span>(<span class="at">SITE=</span><span class="fu">c</span>(<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>),</span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a> <span class="at">UTMX=</span><span class="fu">c</span>(<span class="dv">590382</span>,<span class="dv">583910</span>,<span class="dv">585419</span>),</span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a> <span class="at">UTMY=</span><span class="fu">c</span>(<span class="dv">2830587</span>,<span class="dv">2821685</span>,<span class="dv">2819900</span>))</span>
<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a>dat2</span></code></pre></div>
<pre><code>## SITE UTMX UTMY
## 1 1 590382 2830587
## 2 2 583910 2821685
## 3 3 585419 2819900</code></pre>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>dat2.shp<span class="ot"><-</span><span class="fu">SpatialPointsDataFrame</span>(dat2[,<span class="fu">c</span>(<span class="st">"UTMX"</span>,<span class="st">"UTMY"</span>)],</span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a> <span class="at">data=</span>dat2,</span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a> <span class="at">proj4string=</span>utm17)</span></code></pre></div>
<p>This is as much as I have been able to work through these changes. It’s not huge scale changes to existing work-flows but enough to cause some heartburn.</p>
<p><img src="https://swampthingecology.org/blog\images\20210121_CRS\sherlock_drugged.gif" width="50%" style="display: block; margin: auto;" /></p>
<p>Hope this was helpful (sorry for all the Sherlock gifs)…keep coding friends.</p>
<p><img src="https://swampthingecology.org/blog\images\20210121_CRS\sherlock_smile.gif" width="40%" style="display: block; margin: auto;" /></p>
<hr />
</section>Nearest Neighbor and Hot Spot Analysis - Geospatial data analysis in #rstats. Part 3b2020-10-08T00:00:00+00:002020-10-08T00:00:00+00:00https://swampthingecology.org/blog/nearest-neighbor-and-hot-spot-analysis---geospatial-data-analysis-in-#rstats.-part-3b<section class="main-content">
<p><strong>Keywords:</strong> geostatistics, R, nearest neighbor, Getis-Ord</p>
<p>As promised here is another follow-up to our Geospatial data analysis blog series. So far we have covered interpolaiton, spatial auto-correlation and the basics of Hot-Spot (Getis-Ord) analysis.</p>
<ul>
<li><p>Part I: <a href="https://swampthingecology.org/blog/geospatial-data-analysis-in-rstats.-part-1/" target="_blank">Interpolation</a></p></li>
<li><p>Part 2: <a href="https://swampthingecology.org/blog/geospatial-data-analysis-in-rstats.-part-2/" target="_blank">Spatial Autocorrelation</a></p></li>
<li><p>Part 3: <a href="https://swampthingecology.org/blog/hot-spot-analysis-geospatial-data-analysis-in-rstats.-part-3/" target="_blank">Hot Spot Analysis</a></p></li>
</ul>
<p>In this post we will discuss nearest neighbor estimates and how it can affect hot spot detection. In essence this is <strong>“Getis-Ord Strikes Back”</strong> (sorry my Star Wars nerd is showing).</p>
<hr />
<p>Let’s take a step back before jumping back into nearest neighbor (see my post on <a href="https://swampthingecology.org/blog/geospatial-data-analysis-in-rstats.-part-2/" target="_blank">Moran’s <em>I</em></a>). Most spatial statistics compare a test statistic estimated from the data then compared to an expected value given the null hypothesis of complete spatial randomness (CSR; <span class="citation">Fortin and Dale (<a href="#ref-fortin_spatial_2005" role="doc-biblioref">2005</a>)</span>; <em>not to be confused with <code>CRS(...)</code> coordinate reference system</em>). This is a point process model that can be estimated from a particular distribution, in most cases a Poisson <span class="citation">(Diggle <a href="#ref-diggle_spatio-temporal_2006" role="doc-biblioref">2006</a>)</span>. A theme in the analysis of spatial point patterns such as Moran’s <em>I</em>, Getis-Ord <em>G</em> or Ripley’s <em>K</em> provides a distinction between spatial patterns where CSR is a dividing hypothesis <span class="citation">(Cox <a href="#ref-cox_role_1977" role="doc-biblioref">1977</a>)</span>, which leads to classification of random (complete spatial randomness), under-dispersed (clumped or aggregated), or over-dispersed (spaced or regular) patterns.</p>
<!-- resource:
https://joparga3.github.io/spatial_point_pattern/
https://www.seas.upenn.edu/~ese502/NOTEBOOK/Part_I/2_Models_of_Spatial_Randomness.pdf
https://training.fws.gov/courses/references/tutorials/geospatial/CSP7304/documents/PointPatterTutorial.pdf
### Models of Spatial Randomness
The _Principle of Insufficient Reason_ or Laplace Principle asserts that if there is no information to indicate that either of two events is more likely than others, then they should be treated as equally likely. Translating this into a graphical explanation, if we have an area divided in equal areas, there is no reason to believe that this point is more likely to appear in either left half or the (identical) right half. If we look at the image below, for the first case, any given point should have the same probability (1/2) of appearing in either half of the area. If we divide the areas again by half, then points should have the same probability (1/4) of appearing in any of the 4 squares and so on.
<img src="https://swampthingecology.org/blog/knitr_files/2020-10-08-NN_HotSpot_files/figure-html/Laplace-1.png" style="display: block; margin: auto;" />
Therefore the assumptions of spatially random models are:
1. Without any given information on the likelihood of events occurring being different across the dataset (study area), the probability should be the same for all events across the study area (Laplace Principal).
2. Locations of points have no influence on one another (i.e. spatial autocorrelation)
-->
<p>Below we are going to import some data, use different techniques to estimate nearest neighbor and see how that affects Hot spot detection.</p>
<div id="lets-get-started" class="section level3">
<h3>Let’s get started</h3>
<p>Before we get too deep into things here are the necessary packages we will be using.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" title="1"><span class="co">## Libraries</span></a>
<a class="sourceLine" id="cb1-2" title="2"><span class="co"># read xlsx files</span></a>
<a class="sourceLine" id="cb1-3" title="3"><span class="kw">library</span>(readxl)</a>
<a class="sourceLine" id="cb1-4" title="4"></a>
<a class="sourceLine" id="cb1-5" title="5"><span class="co"># Geospatial </span></a>
<a class="sourceLine" id="cb1-6" title="6"><span class="kw">library</span>(rgdal)</a>
<a class="sourceLine" id="cb1-7" title="7"><span class="kw">library</span>(rgeos)</a>
<a class="sourceLine" id="cb1-8" title="8"><span class="kw">library</span>(raster)</a>
<a class="sourceLine" id="cb1-9" title="9"><span class="kw">library</span>(spdep)</a></code></pre></div>
<p>Same data and links from last post.</p>
<ul>
<li><p>Download the data (as a zip file) <a href="https://www.epa.gov/sites/production/files/2014-03/sf1data.zip" target="_blank">here</a>!</p></li>
<li><p>Download the Water Conservation Areas shapefile <a href="%22https://www.swampthingecology.org/blog/data/hotspot/WCAs.zip%22" target=""_blank">here</a>!</p></li>
</ul>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb2-1" title="1"><span class="co"># Define spatial datum</span></a>
<a class="sourceLine" id="cb2-2" title="2">utm17<-<span class="kw">CRS</span>(<span class="st">"+proj=utm +zone=17 +datum=WGS84 +units=m"</span>)</a>
<a class="sourceLine" id="cb2-3" title="3">wgs84<-<span class="kw">CRS</span>(<span class="st">"+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"</span>)</a>
<a class="sourceLine" id="cb2-4" title="4"></a>
<a class="sourceLine" id="cb2-5" title="5"><span class="co"># Read shapefile</span></a>
<a class="sourceLine" id="cb2-6" title="6">wcas<-<span class="kw">readOGR</span>(GISdata,<span class="st">"WCAs"</span>)</a>
<a class="sourceLine" id="cb2-7" title="7">wcas<-<span class="kw">spTransform</span>(wcas,utm17)</a>
<a class="sourceLine" id="cb2-8" title="8"></a>
<a class="sourceLine" id="cb2-9" title="9"><span class="co"># Read the spreadsheet</span></a>
<a class="sourceLine" id="cb2-10" title="10">p12<-readxl<span class="op">::</span><span class="kw">read_xls</span>(<span class="st">"data/P12join7FINAL.xls"</span>,<span class="dt">sheet=</span><span class="dv">2</span>)</a>
<a class="sourceLine" id="cb2-11" title="11"></a>
<a class="sourceLine" id="cb2-12" title="12"><span class="co"># Clean up the headers</span></a>
<a class="sourceLine" id="cb2-13" title="13"><span class="kw">colnames</span>(p12)<-<span class="kw">sapply</span>(<span class="kw">strsplit</span>(<span class="kw">names</span>(p12),<span class="st">"</span><span class="ch">\\</span><span class="st">$"</span>),<span class="st">"["</span>,<span class="dv">1</span>)</a>
<a class="sourceLine" id="cb2-14" title="14">p12<-<span class="kw">data.frame</span>(p12)</a>
<a class="sourceLine" id="cb2-15" title="15">p12[p12<span class="op">==-</span><span class="dv">9999</span>]<-<span class="ot">NA</span></a>
<a class="sourceLine" id="cb2-16" title="16">p12[p12<span class="op">==-</span><span class="fl">3047.6952</span>]<-<span class="ot">NA</span></a>
<a class="sourceLine" id="cb2-17" title="17"></a>
<a class="sourceLine" id="cb2-18" title="18"><span class="co"># Convert the data.frame() to SpatialPointsDataFrame</span></a>
<a class="sourceLine" id="cb2-19" title="19">vars<-<span class="kw">c</span>(<span class="st">"STA_ID"</span>,<span class="st">"CYCLE"</span>,<span class="st">"SUBAREA"</span>,<span class="st">"DECLONG"</span>,<span class="st">"DECLAT"</span>,<span class="st">"DATE"</span>,<span class="st">"TPSDF"</span>)</a>
<a class="sourceLine" id="cb2-20" title="20">p12.shp<-<span class="kw">SpatialPointsDataFrame</span>(<span class="dt">coords=</span>p12[,<span class="kw">c</span>(<span class="st">"DECLONG"</span>,<span class="st">"DECLAT"</span>)],</a>
<a class="sourceLine" id="cb2-21" title="21"> <span class="dt">data=</span>p12[,vars],<span class="dt">proj4string =</span>wgs84)</a>
<a class="sourceLine" id="cb2-22" title="22"><span class="co"># transform to UTM (something I like to do...but not necessary)</span></a>
<a class="sourceLine" id="cb2-23" title="23">p12.shp<-<span class="kw">spTransform</span>(p12.shp,utm17)</a>
<a class="sourceLine" id="cb2-24" title="24"></a>
<a class="sourceLine" id="cb2-25" title="25"><span class="co"># Subset the data for wet season data only and only WCA sites</span></a>
<a class="sourceLine" id="cb2-26" title="26">p12.shp2<-<span class="kw">subset</span>(p12.shp,CYCLE<span class="op">%in%</span><span class="kw">c</span>(<span class="dv">0</span>,<span class="dv">2</span>))</a>
<a class="sourceLine" id="cb2-27" title="27">p12.shp.wca<-p12.shp2[wcas,]</a>
<a class="sourceLine" id="cb2-28" title="28"></a>
<a class="sourceLine" id="cb2-29" title="29"><span class="co"># Double check for NAs in the dataset</span></a>
<a class="sourceLine" id="cb2-30" title="30"><span class="kw">subset</span>(p12.shp.wca<span class="op">@</span>data,<span class="kw">is.na</span>(TPSDF)<span class="op">==</span>T)</a>
<a class="sourceLine" id="cb2-31" title="31"></a>
<a class="sourceLine" id="cb2-32" title="32"><span class="co"># Remove NA sample</span></a>
<a class="sourceLine" id="cb2-33" title="33">p12.shp.wca<-<span class="kw">subset</span>(p12.shp.wca,<span class="kw">is.na</span>(TPSDF)<span class="op">==</span>F)</a></code></pre></div>
<p>Here is a quick map the of the subsetted data</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb3-1" title="1"><span class="kw">par</span>(<span class="dt">mar=</span><span class="kw">c</span>(<span class="fl">0.1</span>,<span class="fl">0.1</span>,<span class="fl">0.1</span>,<span class="fl">0.1</span>),<span class="dt">oma=</span><span class="kw">c</span>(<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>))</a>
<a class="sourceLine" id="cb3-2" title="2"><span class="kw">plot</span>(wcas)</a>
<a class="sourceLine" id="cb3-3" title="3"><span class="kw">plot</span>(p12.shp.wca,<span class="dt">pch=</span><span class="dv">21</span>,<span class="dt">bg=</span><span class="kw">adjustcolor</span>(<span class="st">"dodgerblue1"</span>,<span class="fl">0.5</span>),<span class="dt">cex=</span><span class="dv">1</span>,<span class="dt">add=</span>T)</a>
<a class="sourceLine" id="cb3-4" title="4">mapmisc<span class="op">::</span><span class="kw">scaleBar</span>(utm17,<span class="st">"bottomright"</span>,<span class="dt">bty=</span><span class="st">"n"</span>,<span class="dt">cex=</span><span class="dv">1</span>,<span class="dt">seg.len=</span><span class="dv">4</span>)</a></code></pre></div>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-10-08-NN_HotSpot_files/figure-html/unnamed-chunk-3-1.png" alt="Monitoring location from R-EMAP Phase I, wet season sampling (cycles 0 and 2) within the Water Conservation Areas." />
<p class="caption">
Monitoring location from R-EMAP Phase I, wet season sampling (cycles 0 and 2) within the Water Conservation Areas.
</p>
</div>
</div>
<div id="nearest-neighbor" class="section level3">
<h3>Nearest Neighbor</h3>
<p>As discussed in our prior blog post, average nearest neighbor (ANN) analysis measures the average distance from each point in the study area to its nearest point. In some cases, this methods can be sensitive to which distance bands are identified and can therefore be carried forward into other analyses that rely on nearest neighbor spatial weighting. However, ANN statistic is one of many distance based point pattern analysis statistics that can be used to spatially weight the dataset necessary for spatial statistical evaluation. Others include K, L and pair correlation function (g; not to confused with Getis-Ord <em>G</em>) <span class="citation">(Gimond <a href="#ref-gimond_intro_2020" role="doc-biblioref">2020</a>)</span>.</p>
<!-- https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/h-how-average-nearest-neighbor-distance-spatial-st.htm#:~:text=The%20average%20nearest%20neighbor%20ratio,covering%20the%20same%20total%20area). -->
<p>One way to spatially weight the data is by using the <code>dnearneigh()</code> function which identifies neighbors within the lower and upper bounds (provided in the function) by Euclidean distance. Here is where selection of “distance bands” matter. This function was used in the initial <a href="https://swampthingecology.org/blog/hot-spot-analysis-geospatial-data-analysis-in-rstats.-part-3/" target="_blank">Hot-Spot</a> blog post. Lets see how changing the upper bounds in the <code>dnearneigh()</code> can affect the outcome.</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb4-1" title="1"><span class="co"># Find distance range</span></a>
<a class="sourceLine" id="cb4-2" title="2">ptdist=<span class="kw">pointDistance</span>(p12.shp.wca)</a>
<a class="sourceLine" id="cb4-3" title="3"></a>
<a class="sourceLine" id="cb4-4" title="4">min.dist<-<span class="kw">min</span>(ptdist); <span class="co"># Minimum</span></a>
<a class="sourceLine" id="cb4-5" title="5"></a>
<a class="sourceLine" id="cb4-6" title="6">q10.dist<-<span class="kw">as.numeric</span>(<span class="kw">quantile</span>(ptdist,<span class="dt">probs=</span><span class="fl">0.10</span>)); <span class="co"># Q10</span></a>
<a class="sourceLine" id="cb4-7" title="7">q25.dist<-<span class="kw">as.numeric</span>(<span class="kw">quantile</span>(ptdist,<span class="dt">probs=</span><span class="fl">0.25</span>)); <span class="co"># Q25</span></a>
<a class="sourceLine" id="cb4-8" title="8">q75.dist<-<span class="kw">as.numeric</span>(<span class="kw">quantile</span>(ptdist,<span class="dt">probs=</span><span class="fl">0.75</span>)); <span class="co"># Q75</span></a>
<a class="sourceLine" id="cb4-9" title="9"></a>
<a class="sourceLine" id="cb4-10" title="10"><span class="co"># Using 25th percentile distance for upper bound</span></a>
<a class="sourceLine" id="cb4-11" title="11">nb.q10<-<span class="kw">dnearneigh</span>(<span class="kw">coordinates</span>(p12.shp.wca),min.dist,q10.dist)</a>
<a class="sourceLine" id="cb4-12" title="12"></a>
<a class="sourceLine" id="cb4-13" title="13"><span class="co"># Using 25th percentile distance for upper bound</span></a>
<a class="sourceLine" id="cb4-14" title="14">nb.q25<-<span class="kw">dnearneigh</span>(<span class="kw">coordinates</span>(p12.shp.wca),min.dist,q25.dist)</a>
<a class="sourceLine" id="cb4-15" title="15"></a>
<a class="sourceLine" id="cb4-16" title="16"><span class="co"># Using 75th percentile distance for upper bound</span></a>
<a class="sourceLine" id="cb4-17" title="17">nb.q75<-<span class="kw">dnearneigh</span>(<span class="kw">coordinates</span>(p12.shp.wca),min.dist,q75.dist)</a></code></pre></div>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-10-08-NN_HotSpot_files/figure-html/unnamed-chunk-5-1.png" alt="Neighborhood network with different upper bound values" />
<p class="caption">
Neighborhood network with different upper bound values
</p>
</div>
<p>As you can see the number of links between locations increases as the upper bound is expanded thereby increasing the average number of links within the network. How would this potentially influence the detection of clusters within the data set. Remember the last <a href="https://swampthingecology.org/blog/hot-spot-analysis-geospatial-data-analysis-in-rstats.-part-3/" target="_blank">Hot-Spot</a> blog post? Well lets run through the code, below is using the 10th quantile as the upper bound as an example.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" title="1"><span class="co"># Convert nearest neighbor to a list</span></a>
<a class="sourceLine" id="cb5-2" title="2">nb_lw<-<span class="kw">nb2listw</span>(nb.q10)</a>
<a class="sourceLine" id="cb5-3" title="3"></a>
<a class="sourceLine" id="cb5-4" title="4"><span class="co"># local G</span></a>
<a class="sourceLine" id="cb5-5" title="5">local_g<-<span class="kw">localG</span>(p12.shp.wca<span class="op">$</span>TPSDF,nb_lw)</a>
<a class="sourceLine" id="cb5-6" title="6"></a>
<a class="sourceLine" id="cb5-7" title="7"><span class="co"># convert to matrix</span></a>
<a class="sourceLine" id="cb5-8" title="8">local_g.ma=<span class="kw">as.matrix</span>(local_g)</a>
<a class="sourceLine" id="cb5-9" title="9"></a>
<a class="sourceLine" id="cb5-10" title="10"><span class="co"># column-bind the local_g data</span></a>
<a class="sourceLine" id="cb5-11" title="11">p12.shp.wca<-<span class="kw">cbind</span>(p12.shp.wca,local_g.ma)</a>
<a class="sourceLine" id="cb5-12" title="12"></a>
<a class="sourceLine" id="cb5-13" title="13"><span class="co"># change the names of the new column</span></a>
<a class="sourceLine" id="cb5-14" title="14"><span class="kw">names</span>(p12.shp.wca)[<span class="kw">ncol</span>(p12.shp.wca)]=<span class="st">"localg.Q10"</span></a>
<a class="sourceLine" id="cb5-15" title="15"></a>
<a class="sourceLine" id="cb5-16" title="16"><span class="co"># determine p-value of z-score</span></a>
<a class="sourceLine" id="cb5-17" title="17">p12.shp.wca<span class="op">$</span>pval.q10<-<span class="st"> </span><span class="dv">2</span><span class="op">*</span><span class="kw">pnorm</span>(<span class="op">-</span><span class="kw">abs</span>(p12.shp.wca<span class="op">$</span>localg.Q10))</a>
<a class="sourceLine" id="cb5-18" title="18"></a>
<a class="sourceLine" id="cb5-19" title="19"><span class="co"># See if any site is a "Hot-Spot"</span></a>
<a class="sourceLine" id="cb5-20" title="20"><span class="kw">subset</span>(p12.shp.wca<span class="op">@</span>data,localg.Q10<span class="op">></span><span class="dv">0</span><span class="op">&</span>pval.q10<span class="op"><</span><span class="fl">0.05</span>)<span class="op">$</span>STA_ID</a></code></pre></div>
<pre><code>## [1] "M009" "M011" "M012" "M014" "M015" "M024" "M025" "M027" "M028" "M029"
## [11] "M032" "M033" "M034" "M260" "M261" "M262" "M274" "M276" "M278" "M280"
## [21] "M282"</code></pre>
<p>Looks like a couple of sites are considered Hot-Spots. Now do that same thing for <code>nb.q25</code> and <code>nb.q75</code> and this is what you get.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-10-08-NN_HotSpot_files/figure-html/unnamed-chunk-8-1.png" alt="Soil total phosphorus hot-spots identified using the Getis-Ord $G_{i}^{*}$ spatial statistic based on different nearest neighbor bands." />
<p class="caption">
Soil total phosphorus hot-spots identified using the Getis-Ord <span class="math inline">\(G_{i}^{*}\)</span> spatial statistic based on different nearest neighbor bands.
</p>
</div>
<p>Hot-Spots are identified with <span class="math inline">\(G_{i}^{*}\)</span> > 0 and associated with significant <span class="math inline">\(\rho\)</span> values (in this cast our <span class="math inline">\(\alpha\)</span> is 0.05). Alternatively “Cold-Spots”, or areas associated with clustering of relatively low values are identified with <span class="math inline">\(G_{i}^{*}\)</span> < 0 (and significant <span class="math inline">\(\rho\)</span> values). Across the three different distance bands, you can see a potential shift in Hot-Spots and the occurrence (and shift) of Cold-Spots across the study area.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-10-08-NN_HotSpot_files/figure-html/unnamed-chunk-9-1.png" alt="Number of sites identified as Hot-Spots across the study by Nearest Neigbhor upper bound band." />
<p class="caption">
Number of sites identified as Hot-Spots across the study by Nearest Neigbhor upper bound band.
</p>
</div>
<p>An alternative to selecting distance bands is to use a different approach such as K-function or K nearest neighbors. K-function summarizes the distance between points for <em>all</em> distances <span class="citation">(Gimond <a href="#ref-gimond_intro_2020" role="doc-biblioref">2020</a>)</span>. This method can also be sensitive to distance bands but less so than above. In k-function nearest neighbor using <code>knearneigh()</code>, the function will eventually give a warning letting you know but will still compute the values anyways.</p>
<pre><code>Warning messages:
1: In knearneigh(p12.shp.wca, k = 45) :
k greater than one-third of the number of data points</code></pre>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-10-08-NN_HotSpot_files/figure-html/unnamed-chunk-10-1.png" alt="The affect of the number of nearest neighbors on average nearest neighbor distance." />
<p class="caption">
The affect of the number of nearest neighbors on average nearest neighbor distance.
</p>
</div>
<p>Based on the plot above, a <code>k=6</code> seems to be conservative enough. As suggested in the last blog post this could be done by…</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb8-1" title="1">k1<-<span class="kw">knn2nb</span>(<span class="kw">knearneigh</span>(p12.shp.wca,<span class="dt">k=</span><span class="dv">6</span>))</a></code></pre></div>
<!-- Some resources
https://daviddalpiaz.github.io/r4sl/knn-class.html
-->
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" title="1"><span class="co"># Convert nearest neighbor to a list</span></a>
<a class="sourceLine" id="cb9-2" title="2">nb_lw<-<span class="kw">nb2listw</span>(k1)</a>
<a class="sourceLine" id="cb9-3" title="3"></a>
<a class="sourceLine" id="cb9-4" title="4"><span class="co"># local G</span></a>
<a class="sourceLine" id="cb9-5" title="5">local_g<-<span class="kw">localG</span>(p12.shp.wca<span class="op">$</span>TPSDF,nb_lw)</a>
<a class="sourceLine" id="cb9-6" title="6"></a>
<a class="sourceLine" id="cb9-7" title="7"><span class="co"># convert to matrix</span></a>
<a class="sourceLine" id="cb9-8" title="8">local_g.ma=<span class="kw">as.matrix</span>(local_g)</a>
<a class="sourceLine" id="cb9-9" title="9"></a>
<a class="sourceLine" id="cb9-10" title="10"><span class="co"># column-bind the local_g data</span></a>
<a class="sourceLine" id="cb9-11" title="11">p12.shp.wca<-<span class="kw">cbind</span>(p12.shp.wca,local_g.ma)</a>
<a class="sourceLine" id="cb9-12" title="12"></a>
<a class="sourceLine" id="cb9-13" title="13"><span class="co"># change the names of the new column</span></a>
<a class="sourceLine" id="cb9-14" title="14"><span class="kw">names</span>(p12.shp.wca)[<span class="kw">ncol</span>(p12.shp.wca)]=<span class="st">"localg.k"</span></a>
<a class="sourceLine" id="cb9-15" title="15"></a>
<a class="sourceLine" id="cb9-16" title="16"><span class="co"># determine p-value of z-score</span></a>
<a class="sourceLine" id="cb9-17" title="17">p12.shp.wca<span class="op">$</span>pval.k<-<span class="st"> </span><span class="dv">2</span><span class="op">*</span><span class="kw">pnorm</span>(<span class="op">-</span><span class="kw">abs</span>(p12.shp.wca<span class="op">$</span>localg.k))</a></code></pre></div>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-10-08-NN_HotSpot_files/figure-html/unnamed-chunk-13-1.png" alt="Soil total phosphorus hot-spots identified using the Getis-Ord $G_{i}^{*}$ spatial statistic with k-function nearest neighbor spatial weighting." />
<p class="caption">
Soil total phosphorus hot-spots identified using the Getis-Ord <span class="math inline">\(G_{i}^{*}\)</span> spatial statistic with k-function nearest neighbor spatial weighting.
</p>
</div>
<p>using K-function nearest neighbor we have the occurrence of Hot-Spots in the general area of the other evaluations presented above. As suggested in the original Hot-Spot blog, the selection of spatial weights is important and the test is sensitive to the weights assigned.</p>
<p>The next post will cover how spatial aggregation can play a role in Hot-Spot detection. Until then I’ll leave you with this quote that helps put spatial statistical analysis into perspective.</p>
<blockquote>
<p>“The first law of geography: Everything is related to everything else, but near things are more related than distant things.” <span class="citation">(Tobler <a href="#ref-tobler_computer_1970" role="doc-biblioref">1970</a>)</span></p>
</blockquote>
<hr />
<div id="refs" class="references">
<div id="ref-cox_role_1977">
<p>Cox, D. R. 1977. “The Role of Significance Tests.” <em>Scandinavian Journal of Statistics</em> 4 (2): 49–63. <a href="https://www.jstor.org/stable/4615652">https://www.jstor.org/stable/4615652</a>.</p>
</div>
<div id="ref-diggle_spatio-temporal_2006">
<p>Diggle, Peter J. 2006. “Spatio-Temporal Point Processes: Methods and Applications.” <em>Monographs on Statistics and Applied Probability</em> 107: 1–45.</p>
</div>
<div id="ref-fortin_spatial_2005">
<p>Fortin, Marie-Josée, and Mark R. T. Dale. 2005. <em>Spatial Analysis: A Guide for Ecologists</em>. Cambridge University Press.</p>
</div>
<div id="ref-gimond_intro_2020">
<p>Gimond, Manuel. 2020. <em>Intro to GIS and Spatial Analysis</em>. <a href="https://mgimond.github.io/Spatial/index.html">https://mgimond.github.io/Spatial/index.html</a>.</p>
</div>
<div id="ref-tobler_computer_1970">
<p>Tobler, W. R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” <em>Economic Geography</em> 46 (June): 234. <a href="https://doi.org/10.2307/143141">https://doi.org/10.2307/143141</a>.</p>
</div>
</div>
</div>
</section>Keywords: geostatistics, R, nearest neighbor, Getis-Ord As promised here is another follow-up to our Geospatial data analysis blog series. So far we have covered interpolaiton, spatial auto-correlation and the basics of Hot-Spot (Getis-Ord) analysis. Part I: Interpolation Part 2: Spatial Autocorrelation Part 3: Hot Spot Analysis In this post we will discuss nearest neighbor estimates and how it can affect hot spot detection. In essence this is “Getis-Ord Strikes Back” (sorry my Star Wars nerd is showing). Let’s take a step back before jumping back into nearest neighbor (see my post on Moran’s I). Most spatial statistics compare a test statistic estimated from the data then compared to an expected value given the null hypothesis of complete spatial randomness (CSR; Fortin and Dale (2005); not to be confused with CRS(...) coordinate reference system). This is a point process model that can be estimated from a particular distribution, in most cases a Poisson (Diggle 2006). A theme in the analysis of spatial point patterns such as Moran’s I, Getis-Ord G or Ripley’s K provides a distinction between spatial patterns where CSR is a dividing hypothesis (Cox 1977), which leads to classification of random (complete spatial randomness), under-dispersed (clumped or aggregated), or over-dispersed (spaced or regular) patterns. <!-- resource: https://joparga3.github.io/spatial_point_pattern/ https://www.seas.upenn.edu/~ese502/NOTEBOOK/Part_I/2_Models_of_Spatial_Randomness.pdf https://training.fws.gov/courses/references/tutorials/geospatial/CSP7304/documents/PointPatterTutorial.pdfHot Spot Analysis - Geospatial data analysis in #rstats. Part 32020-09-18T00:00:00+00:002020-09-18T00:00:00+00:00https://swampthingecology.org/blog/hot-spot-analysis---geospatial-data-analysis-in-#rstats.-part-3<section class="main-content">
<p><strong>Keywords:</strong> geostatistics, R, hot-spot, Getis-Ord</p>
<hr />
<p>Continuing our series on geospatial analysis we are diving deeper into spatial statistics Hot-spot analysis. In my prior posts I presented spatial interpolation techniques such as <a href="https://swampthingpaul.github.io/blog/geospatial-data-analysis-in-rstats.-part-1/" target="_blank">kriging</a> and spatial auto-correlation with <a href="https://swampthingpaul.github.io/blog/geospatial-data-analysis-in-rstats.-part-2/" target="_blank">Moran’s <em>I</em></a>.</p>
<p>Kriging is a value tool to detect spatial structure and patterns across a particular area. These spatial models rely on understanding the spatial correction and auto-correlation. A common component of spatial correlation/auto-correlation analyses is they are applied on a global scale (entire dataset). In some cases, it may be warranted to examine patterns at a more local (fine) scale. The Getis-Ord <em>G</em> statistic provides information on local spatial structures and can identify areas of high (or low) clustering. This clustering is operationally defined as <strong>hot-spots</strong> and is done by comparing the sum in a particular variable within a local neighborhood network relative to the global sum of the area-of-interest extent <span class="citation">(Getis and Ord <a href="#ref-getis_analysis_2010" role="doc-biblioref">2010</a>)</span>.</p>
<p><span class="citation">Getis and Ord (<a href="#ref-getis_analysis_2010" role="doc-biblioref">2010</a>)</span> introduced a family of measures of spatial associated called <em>G</em> statistics. When used with spatial auto-correlation statistics such as Moran’s <em>I</em>, the <em>G</em> family of statistics can expand our understanding of processes that give rise to spatial association, in detecting local hot-spots (in their original paper they used the term “pockets”). The Getis-Ord statistic can be used in the global (<span class="math inline">\(G\)</span>) and local (<span class="math inline">\(G_{i}^{*}\)</span>) scales. The global statistic (<span class="math inline">\(G\)</span>) identifies high or low values across the entire study area (i.e. forest, wetland, city, etc.), meanwhile the local (<span class="math inline">\(G_{i}^{*}\)</span>) statistic evaluates the data for each feature within the dataset and determining where features with high or low values (“pockets” or hot/cold) cluster spatially.</p>
<p>At this point I would probably throw some equations around and give you the mathematical nitty gritty. Given I am not a maths wiz and <span class="citation">Getis and Ord (<a href="#ref-getis_analysis_2010" role="doc-biblioref">2010</a>)</span> provides all the detail (lots of nitty and a little gritty) in such an eloquent fashion I’ll leave it up to you if you want to peruse the manuscript. The Getis-Ord statistic has been applied across several different fields including crime analysis, epidemiology and a couple of forays into biogeochemistry and ecology.</p>
<div id="play-time" class="section level2">
<h2>Play Time</h2>
<p>For this example I will be using a dataset from the United States Environmental Protection Agency (USEPA) as part of the Everglades Regional Environmental Monitoring Program (<a href="https://www.epa.gov/everglades/environmental-monitoring-everglades" target="_blank">R-EMAP</a>).</p>
<div id="some-on-the-dataset" class="section level3">
<h3>Some on the dataset</h3>
<p>The Everglades R-EMAP program has been monitoring the Everglades ecosystem since 1993 in a probability-based sampling approach covering ~5000 km<sup>2</sup> from a multi-media aspect (water, sediment, fish, etc.). This large scale sampling has occurred in four phases, Phase I (1995 - 1996), Phase II (1999), Phase III (2005) and Phase IV (2013 - 2014). For the purposes of this post, we will be focusing on sediment/soil total phosphorus concentrations collected during the wet-season sampling during Phase I (April 1995 & May 1996).</p>
</div>
<div id="analysis-time" class="section level3">
<h3>Analysis time!!</h3>
<p>Here are the necessary packages.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" title="1"><span class="co">## Libraries</span></a>
<a class="sourceLine" id="cb1-2" title="2"><span class="co"># read xlsx files</span></a>
<a class="sourceLine" id="cb1-3" title="3"><span class="kw">library</span>(readxl)</a>
<a class="sourceLine" id="cb1-4" title="4"></a>
<a class="sourceLine" id="cb1-5" title="5"><span class="co"># Geospatial </span></a>
<a class="sourceLine" id="cb1-6" title="6"><span class="kw">library</span>(rgdal)</a>
<a class="sourceLine" id="cb1-7" title="7"><span class="kw">library</span>(rgeos)</a>
<a class="sourceLine" id="cb1-8" title="8"><span class="kw">library</span>(raster)</a>
<a class="sourceLine" id="cb1-9" title="9"><span class="kw">library</span>(spdep)</a></code></pre></div>
<p>Incase you are not sure if you have these packages installed here is a quick function that will check for the packages and install if needed from CRAN.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb2-1" title="1"><span class="co"># Function</span></a>
<a class="sourceLine" id="cb2-2" title="2">check.packages <-<span class="st"> </span><span class="cf">function</span>(pkg){</a>
<a class="sourceLine" id="cb2-3" title="3"> new.pkg <-<span class="st"> </span>pkg[<span class="op">!</span>(pkg <span class="op">%in%</span><span class="st"> </span><span class="kw">installed.packages</span>()[, <span class="st">"Package"</span>])]</a>
<a class="sourceLine" id="cb2-4" title="4"> <span class="cf">if</span> (<span class="kw">length</span>(new.pkg)) </a>
<a class="sourceLine" id="cb2-5" title="5"> <span class="kw">install.packages</span>(new.pkg, <span class="dt">dependencies =</span> <span class="ot">TRUE</span>)</a>
<a class="sourceLine" id="cb2-6" title="6"> <span class="kw">sapply</span>(pkg, require, <span class="dt">character.only =</span> <span class="ot">TRUE</span>)</a>
<a class="sourceLine" id="cb2-7" title="7">}</a>
<a class="sourceLine" id="cb2-8" title="8"></a>
<a class="sourceLine" id="cb2-9" title="9">pkg<-<span class="kw">c</span>(<span class="st">"openxlsx"</span>,<span class="st">"readxl"</span>,<span class="st">"rgdal"</span>,<span class="st">"rgeos"</span>,<span class="st">"raster"</span>,<span class="st">"spdep"</span>)</a>
<a class="sourceLine" id="cb2-10" title="10"><span class="kw">check.packages</span>(pkg)</a></code></pre></div>
<p>Download the data (as a zip file) <a href="https://www.epa.gov/sites/production/files/2014-03/sf1data.zip" target="_blank">here</a>!</p>
<p>Download the Water Conservation Area shapefile <a href="%22https://www.swampthingecology.org/blog/data/hotspot/WCAs.zip%22" target=""_blank">here</a>!</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb3-1" title="1"><span class="co"># Define spatial datum</span></a>
<a class="sourceLine" id="cb3-2" title="2">utm17<-<span class="kw">CRS</span>(<span class="st">"+proj=utm +zone=17 +datum=WGS84 +units=m"</span>)</a>
<a class="sourceLine" id="cb3-3" title="3">wgs84<-<span class="kw">CRS</span>(<span class="st">"+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"</span>)</a>
<a class="sourceLine" id="cb3-4" title="4"></a>
<a class="sourceLine" id="cb3-5" title="5"><span class="co"># Read shapefile</span></a>
<a class="sourceLine" id="cb3-6" title="6">wcas<-<span class="kw">readOGR</span>(GISdata,<span class="st">"WCAs"</span>)</a>
<a class="sourceLine" id="cb3-7" title="7">wcas<-<span class="kw">spTransform</span>(wcas,utm17)</a>
<a class="sourceLine" id="cb3-8" title="8"></a>
<a class="sourceLine" id="cb3-9" title="9"><span class="co"># Read the spreadsheet</span></a>
<a class="sourceLine" id="cb3-10" title="10">p12<-readxl<span class="op">::</span><span class="kw">read_xls</span>(<span class="st">"data/P12join7FINAL.xls"</span>,<span class="dt">sheet=</span><span class="dv">2</span>)</a>
<a class="sourceLine" id="cb3-11" title="11"></a>
<a class="sourceLine" id="cb3-12" title="12"><span class="co"># Clean up the headers</span></a>
<a class="sourceLine" id="cb3-13" title="13"><span class="kw">colnames</span>(p12)<-<span class="kw">sapply</span>(<span class="kw">strsplit</span>(<span class="kw">names</span>(p12),<span class="st">"</span><span class="ch">\\</span><span class="st">$"</span>),<span class="st">"["</span>,<span class="dv">1</span>)</a>
<a class="sourceLine" id="cb3-14" title="14">p12<-<span class="kw">data.frame</span>(p12)</a>
<a class="sourceLine" id="cb3-15" title="15">p12[p12<span class="op">==-</span><span class="dv">9999</span>]<-<span class="ot">NA</span></a>
<a class="sourceLine" id="cb3-16" title="16">p12[p12<span class="op">==-</span><span class="fl">3047.6952</span>]<-<span class="ot">NA</span></a>
<a class="sourceLine" id="cb3-17" title="17"></a>
<a class="sourceLine" id="cb3-18" title="18"><span class="co"># Convert the data.frame() to SpatialPointsDataFrame</span></a>
<a class="sourceLine" id="cb3-19" title="19">vars<-<span class="kw">c</span>(<span class="st">"STA_ID"</span>,<span class="st">"CYCLE"</span>,<span class="st">"SUBAREA"</span>,<span class="st">"DECLONG"</span>,<span class="st">"DECLAT"</span>,<span class="st">"DATE"</span>,<span class="st">"TPSDF"</span>)</a>
<a class="sourceLine" id="cb3-20" title="20">p12.shp<-<span class="kw">SpatialPointsDataFrame</span>(<span class="dt">coords=</span>p12[,<span class="kw">c</span>(<span class="st">"DECLONG"</span>,<span class="st">"DECLAT"</span>)],</a>
<a class="sourceLine" id="cb3-21" title="21"> <span class="dt">data=</span>p12[,vars],<span class="dt">proj4string =</span>wgs84)</a>
<a class="sourceLine" id="cb3-22" title="22"><span class="co"># transform to UTM (something I like to do...but not necessary)</span></a>
<a class="sourceLine" id="cb3-23" title="23">p12.shp<-<span class="kw">spTransform</span>(p12.shp,utm17)</a>
<a class="sourceLine" id="cb3-24" title="24"></a>
<a class="sourceLine" id="cb3-25" title="25"><span class="co"># Subset the data for wet season data only</span></a>
<a class="sourceLine" id="cb3-26" title="26">p12.shp.wca2<-<span class="kw">subset</span>(p12.shp,CYCLE<span class="op">%in%</span><span class="kw">c</span>(<span class="dv">0</span>,<span class="dv">2</span>))</a>
<a class="sourceLine" id="cb3-27" title="27">p12.shp.wca2<-p12.shp.wca2[<span class="kw">subset</span>(wcas,Name<span class="op">==</span><span class="st">"WCA 2A"</span>),]</a></code></pre></div>
<p>Here is a quick map the of data</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb4-1" title="1"><span class="kw">par</span>(<span class="dt">mar=</span><span class="kw">c</span>(<span class="fl">0.1</span>,<span class="fl">0.1</span>,<span class="fl">0.1</span>,<span class="fl">0.1</span>),<span class="dt">oma=</span><span class="kw">c</span>(<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>,<span class="dv">0</span>))</a>
<a class="sourceLine" id="cb4-2" title="2"><span class="kw">plot</span>(p12.shp,<span class="dt">pch=</span><span class="dv">21</span>,<span class="dt">bg=</span><span class="st">"grey"</span>,<span class="dt">cex=</span><span class="fl">0.5</span>)</a>
<a class="sourceLine" id="cb4-3" title="3"><span class="kw">plot</span>(wcas,<span class="dt">add=</span>T)</a></code></pre></div>
<p><img src="https://swampthingecology.org/blog/knitr_files/2020-09-17-HotSpot_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /></p>
<p>Much like other spatial statistics (i.e. Moran’s <em>I</em>) the <em>G</em> statistics relies on spatially weighting the data. In my last <a href="https://swampthingpaul.github.io/blog/geospatial-data-analysis-in-rstats.-part-2/" target="_blank">post</a> we discussed average nearest neighbor (ANN). Average nearest neighbor analysis measures the average distance from each point in the area of interest to it nearest point. As a reminder here is changes in ANN versus the degree of clustering. Here is a quick reminder.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-09-17-HotSpot_files/figure-html/f11-diff-patterns-1.png" alt="Three different point patterns: a single cluster (top left), a dual cluster (top center) and a randomly scattered pattern (top right). Three different ANN vs. neighbor order plots. The black ANN line is for the first point pattern (single cluster); the blue line is for the second point pattern (double cluster) and the red line is for the third point pattern." />
<p class="caption">
Three different point patterns: a single cluster (top left), a dual cluster (top center) and a randomly scattered pattern (top right). Three different ANN vs. neighbor order plots. The black ANN line is for the first point pattern (single cluster); the blue line is for the second point pattern (double cluster) and the red line is for the third point pattern.
</p>
</div>
<p>For demonstration purposes we are going to look at a subset of the entire datsaset. We are going to look at data within Water Conservation Area 2A.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-09-17-HotSpot_files/figure-html/unnamed-chunk-5-1.png" alt="Soil total phosphorus concentration within Water Conservation Area 2A during Phase I sampling." />
<p class="caption">
Soil total phosphorus concentration within Water Conservation Area 2A during Phase I sampling.
</p>
</div>
<p>Most examples of Getis-Ord analysis across the interest looks at polygon type data (i.e. city block, counties, watersheds, etc.). For this example, we are evaluating the data based on point data.</p>
<p>Let’s determine the spatial weight (nearest neighbor distances). Since we are looking at point data, we are going to need to do something slightly different than what was done with Moran’s <em>I</em> in the prior post. The <code>dnearneigh()</code> uses a matrix of point coordinates combined with distance thresholds. To work with the function coordinates will need to be extracted from the data using <code>coordinates()</code>. To find the distance range in the data we can use <code>pointDistance()</code> function. We don’t want to include all possible connections so setting the upper distance bound in the <code>dnearneigh()</code> to the mean distance across the site.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" title="1"><span class="co"># Find distance range</span></a>
<a class="sourceLine" id="cb5-2" title="2">ptdist=<span class="kw">pointDistance</span>(p12.shp.wca2)</a>
<a class="sourceLine" id="cb5-3" title="3"></a>
<a class="sourceLine" id="cb5-4" title="4">min.dist<-<span class="kw">min</span>(ptdist); <span class="co"># Minimum</span></a>
<a class="sourceLine" id="cb5-5" title="5">mean.dist<-<span class="kw">mean</span>(ptdist); <span class="co"># Mean</span></a>
<a class="sourceLine" id="cb5-6" title="6"></a>
<a class="sourceLine" id="cb5-7" title="7">nb<-<span class="kw">dnearneigh</span>(<span class="kw">coordinates</span>(p12.shp.wca2),min.dist,mean.dist)</a></code></pre></div>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-09-17-HotSpot_files/figure-html/unnamed-chunk-7-1.png" alt="Neighborhood network for WCA-2A sites" />
<p class="caption">
Neighborhood network for WCA-2A sites
</p>
</div>
<p>Another spatial weights approach could be to apply k-nearest neighbor distances and could be used in the <code>nb2listw()</code>. In general there are minor differences in how these spatial weights are calculated and can be data specific. For purposes of our example we will be using euclidean distance (above) but for completeness below is the k-nearest neighbor approach.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb6-1" title="1">k1<-<span class="kw">knn2nb</span>(<span class="kw">knearneigh</span>(p12.shp.wca2))</a></code></pre></div>
<div id="global-g" class="section level4">
<h4>Global <em>G</em></h4>
<p>Now that we have the nearest neighbor values we need to convert the data into a list for both the global and local <em>G</em> statistics. For the global <em>G</em> (<code>globalG.test(...)</code>), it is recommended that the spatial weights be binary, therefore in the <code>nb2listw()</code> function we need to use the <code>style="B"</code>.</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb7-1" title="1">nb_lw<-<span class="kw">nb2listw</span>(nb,<span class="dt">style=</span><span class="st">"B"</span>)</a></code></pre></div>
<p>Now to evaluate the dataset from the Global <em>G</em> statistic.</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb8-1" title="1"><span class="kw">globalG.test</span>(p12.shp.wca2<span class="op">$</span>TPSDF,nb_lw,<span class="dt">alternative=</span><span class="st">"two.sided"</span>)</a></code></pre></div>
<pre><code>##
## Getis-Ord global G statistic
##
## data: p12.shp.wca2$TPSDF
## weights: nb_lw
##
## standard deviate = 0.092001, p-value = 0.9267
## alternative hypothesis: two.sided
## sample estimates:
## Global G statistic Expectation Variance
## 0.48775147 0.48333333 0.00230619</code></pre>
<p>In the output it <code>standard deviate</code> is the standard deviation of Moran’s <em>I</em> or the <span class="math inline">\(z_{G}\)</span>-score and <span class="math inline">\(\rho\)</span>-value of the test. Other information in the output include the observed statistic, its expectation and variance.</p>
<p>Based on the Global <em>G</em> results it suggests that there is no high/low clustering globally across the dataset.</p>
<p>If you want more information on the Global test, <a href="https://www.esri.com/en-us/home" target="_blank">ESRI</a> provides a <a href="https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/h-how-high-low-clustering-getis-ord-general-g-spat.htm" target="_blank">robust review</a> including all the additional <a href="https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/h-general-g-additional-math.htm" target="_blank">maths</a> that is behind the statistic.</p>
</div>
<div id="local-g" class="section level4">
<h4>Local <em>G</em></h4>
<p>Similar to the Global test, the local <em>G</em> test uses nearest neighbors. Unlike the Global test, the nearest neighbor can be row standardized (default setting).</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb10-1" title="1">nb_lw<-<span class="kw">nb2listw</span>(nb)</a>
<a class="sourceLine" id="cb10-2" title="2"></a>
<a class="sourceLine" id="cb10-3" title="3"><span class="co"># local G</span></a>
<a class="sourceLine" id="cb10-4" title="4">local_g<-<span class="kw">localG</span>(p12.shp.wca2<span class="op">$</span>TPSDF,nb_lw)</a></code></pre></div>
<p>The output of the function is a list of <span class="math inline">\(z_{G_{i}^{*}}\)</span>-scores for each site. A little extra coding to determine <span class="math inline">\(\rho\)</span>-values and hot/cold spots. Essentially values need to be extracted from the <code>local_g</code> object and <span class="math inline">\(\rho\)</span>-value based on the z-score.</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" title="1"><span class="co"># convert to matrix</span></a>
<a class="sourceLine" id="cb11-2" title="2">local_g.ma=<span class="kw">as.matrix</span>(local_g)</a>
<a class="sourceLine" id="cb11-3" title="3"></a>
<a class="sourceLine" id="cb11-4" title="4"><span class="co"># column-bind the local_g data</span></a>
<a class="sourceLine" id="cb11-5" title="5">p12.shp.wca2<-<span class="kw">cbind</span>(p12.shp.wca2,local_g.ma)</a>
<a class="sourceLine" id="cb11-6" title="6"></a>
<a class="sourceLine" id="cb11-7" title="7"><span class="co"># change the names of the new column</span></a>
<a class="sourceLine" id="cb11-8" title="8"><span class="kw">names</span>(p12.shp.wca2)[<span class="kw">ncol</span>(p12.shp.wca2)]=<span class="st">"localg"</span></a></code></pre></div>
<p>Lets determine the <code>two.side</code> <span class="math inline">\(\rho\)</span>-value.</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb12-1" title="1">p12.shp.wca2<span class="op">$</span>pval<-<span class="st"> </span><span class="dv">2</span><span class="op">*</span><span class="kw">pnorm</span>(<span class="op">-</span><span class="kw">abs</span>(p12.shp.wca2<span class="op">$</span>localg))</a></code></pre></div>
<p>Based on the <span class="math inline">\(z_{G_{i}^{*}}\)</span>-scores and <span class="math inline">\(\rho\)</span>-value we operationally define a hot-spot as <span class="math inline">\(z_{G_{i}^{*}}\)</span>-scores > 0 and <span class="math inline">\(\rho\)</span>-value < <span class="math inline">\(\alpha\)</span> (usually 0.05). Let see if we have hot-spots.</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" title="1"><span class="kw">subset</span>(p12.shp.wca2<span class="op">@</span>data,localg<span class="op">></span><span class="dv">0</span><span class="op">&</span>pval<span class="op"><</span><span class="fl">0.05</span>)<span class="op">$</span>STA_ID</a></code></pre></div>
<pre><code>## [1] "M258"</code></pre>
<p>We have one site identified as a hot-spot. Lets maps it out too.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-09-17-HotSpot_files/figure-html/unnamed-chunk-15-1.png" alt="Soil total phosphorus hot-spots identified using the Getis-Ord $G_{i}^{*}$ spatial statistic." />
<p class="caption">
Soil total phosphorus hot-spots identified using the Getis-Ord <span class="math inline">\(G_{i}^{*}\)</span> spatial statistic.
</p>
</div>
<p>For context, this soil TP hot-spot occurs near discharge locations into Water Conservation Area 2A. Historically run-off from the upstream agricultural area would be diverted to the area to protection both the agricultural area and the downstream urban areas. Currently restoration activities has eliminated these direct discharge and water quality has improved. However, we still see the legacy affect from past water management. If your interested in how the system is responding check out the <a href="https://www.sfwmd.gov/science-data/scientific-publications-sfer" target="_blank">South Florida Environmental Report</a> here is last years <a href="https://apps.sfwmd.gov/sfwmd/SFER/2020_sfer_final/v1/chapters/v1_ch3a.pdf" target="_blank">Everglades Water Quality</a> chapter.</p>
<p>If you would like more background on hot-spot analysis, ESRI produces a pretty good resource on <a href="https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/h-how-hot-spot-analysis-getis-ord-gi-spatial-stati.htm" target="_blank">Getis-Ord <span class="math inline">\(G_{i}^{*}\)</span></a>.</p>
<p>This analysis can also be spatially aggregated (from ESRI) in the R by creating a grid, aggregating the data, estimate the nearest neighbor and evaluating on a local or global scale (maybe we will get to that another time).</p>
<p><img src="https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/GUID-D66FFAA9-4DA8-4883-960F-A807F32CF89D-web.png" width="75%" style="display: block; margin: auto;" /></p>
<hr />
<div id="refs" class="references">
<div id="ref-getis_analysis_2010">
<p>Getis, Arthur, and J. K. Ord. 2010. “The Analysis of Spatial Association by Use of Distance Statistics.” <em>Geographical Analysis</em> 24 (3): 189–206. <a href="https://doi.org/10.1111/j.1538-4632.1992.tb00261.x">https://doi.org/10.1111/j.1538-4632.1992.tb00261.x</a>.</p>
</div>
</div>
</div>
</div>
</div>
</section>Keywords: geostatistics, R, hot-spot, Getis-Ord Continuing our series on geospatial analysis we are diving deeper into spatial statistics Hot-spot analysis. In my prior posts I presented spatial interpolation techniques such as kriging and spatial auto-correlation with Moran’s I. Kriging is a value tool to detect spatial structure and patterns across a particular area. These spatial models rely on understanding the spatial correction and auto-correlation. A common component of spatial correlation/auto-correlation analyses is they are applied on a global scale (entire dataset). In some cases, it may be warranted to examine patterns at a more local (fine) scale. The Getis-Ord G statistic provides information on local spatial structures and can identify areas of high (or low) clustering. This clustering is operationally defined as hot-spots and is done by comparing the sum in a particular variable within a local neighborhood network relative to the global sum of the area-of-interest extent (Getis and Ord 2010). Getis and Ord (2010) introduced a family of measures of spatial associated called G statistics. When used with spatial auto-correlation statistics such as Moran’s I, the G family of statistics can expand our understanding of processes that give rise to spatial association, in detecting local hot-spots (in their original paper they used the term “pockets”). The Getis-Ord statistic can be used in the global (\(G\)) and local (\(G_{i}^{*}\)) scales. The global statistic (\(G\)) identifies high or low values across the entire study area (i.e. forest, wetland, city, etc.), meanwhile the local (\(G_{i}^{*}\)) statistic evaluates the data for each feature within the dataset and determining where features with high or low values (“pockets” or hot/cold) cluster spatially. At this point I would probably throw some equations around and give you the mathematical nitty gritty. Given I am not a maths wiz and Getis and Ord (2010) provides all the detail (lots of nitty and a little gritty) in such an eloquent fashion I’ll leave it up to you if you want to peruse the manuscript. The Getis-Ord statistic has been applied across several different fields including crime analysis, epidemiology and a couple of forays into biogeochemistry and ecology. Play Time For this example I will be using a dataset from the United States Environmental Protection Agency (USEPA) as part of the Everglades Regional Environmental Monitoring Program (R-EMAP). Some on the dataset The Everglades R-EMAP program has been monitoring the Everglades ecosystem since 1993 in a probability-based sampling approach covering ~5000 km2 from a multi-media aspect (water, sediment, fish, etc.). This large scale sampling has occurred in four phases, Phase I (1995 - 1996), Phase II (1999), Phase III (2005) and Phase IV (2013 - 2014). For the purposes of this post, we will be focusing on sediment/soil total phosphorus concentrations collected during the wet-season sampling during Phase I (April 1995 & May 1996). Analysis time!! Here are the necessary packages. ## Libraries # read xlsx files library(readxl) # Geospatial library(rgdal) library(rgeos) library(raster) library(spdep) Incase you are not sure if you have these packages installed here is a quick function that will check for the packages and install if needed from CRAN. # Function check.packages <- function(pkg){ new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])] if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE) sapply(pkg, require, character.only = TRUE) } pkg<-c("openxlsx","readxl","rgdal","rgeos","raster","spdep") check.packages(pkg) Download the data (as a zip file) here! Download the Water Conservation Area shapefile here! # Define spatial datum utm17<-CRS("+proj=utm +zone=17 +datum=WGS84 +units=m") wgs84<-CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0") # Read shapefile wcas<-readOGR(GISdata,"WCAs") wcas<-spTransform(wcas,utm17) # Read the spreadsheet p12<-readxl::read_xls("data/P12join7FINAL.xls",sheet=2) # Clean up the headers colnames(p12)<-sapply(strsplit(names(p12),"\\$"),"[",1) p12<-data.frame(p12) p12[p12==-9999]<-NA p12[p12==-3047.6952]<-NA # Convert the data.frame() to SpatialPointsDataFrame vars<-c("STA_ID","CYCLE","SUBAREA","DECLONG","DECLAT","DATE","TPSDF") p12.shp<-SpatialPointsDataFrame(coords=p12[,c("DECLONG","DECLAT")], data=p12[,vars],proj4string =wgs84) # transform to UTM (something I like to do...but not necessary) p12.shp<-spTransform(p12.shp,utm17) # Subset the data for wet season data only p12.shp.wca2<-subset(p12.shp,CYCLE%in%c(0,2)) p12.shp.wca2<-p12.shp.wca2[subset(wcas,Name=="WCA 2A"),] Here is a quick map the of data par(mar=c(0.1,0.1,0.1,0.1),oma=c(0,0,0,0)) plot(p12.shp,pch=21,bg="grey",cex=0.5) plot(wcas,add=T) Much like other spatial statistics (i.e. Moran’s I) the G statistics relies on spatially weighting the data. In my last post we discussed average nearest neighbor (ANN). Average nearest neighbor analysis measures the average distance from each point in the area of interest to it nearest point. As a reminder here is changes in ANN versus the degree of clustering. Here is a quick reminder. Three different point patterns: a single cluster (top left), a dual cluster (top center) and a randomly scattered pattern (top right). Three different ANN vs. neighbor order plots. The black ANN line is for the first point pattern (single cluster); the blue line is for the second point pattern (double cluster) and the red line is for the third point pattern. For demonstration purposes we are going to look at a subset of the entire datsaset. We are going to look at data within Water Conservation Area 2A. Soil total phosphorus concentration within Water Conservation Area 2A during Phase I sampling. Most examples of Getis-Ord analysis across the interest looks at polygon type data (i.e. city block, counties, watersheds, etc.). For this example, we are evaluating the data based on point data. Let’s determine the spatial weight (nearest neighbor distances). Since we are looking at point data, we are going to need to do something slightly different than what was done with Moran’s I in the prior post. The dnearneigh() uses a matrix of point coordinates combined with distance thresholds. To work with the function coordinates will need to be extracted from the data using coordinates(). To find the distance range in the data we can use pointDistance() function. We don’t want to include all possible connections so setting the upper distance bound in the dnearneigh() to the mean distance across the site. # Find distance range ptdist=pointDistance(p12.shp.wca2) min.dist<-min(ptdist); # Minimum mean.dist<-mean(ptdist); # Mean nb<-dnearneigh(coordinates(p12.shp.wca2),min.dist,mean.dist) Neighborhood network for WCA-2A sites Another spatial weights approach could be to apply k-nearest neighbor distances and could be used in the nb2listw(). In general there are minor differences in how these spatial weights are calculated and can be data specific. For purposes of our example we will be using euclidean distance (above) but for completeness below is the k-nearest neighbor approach. k1<-knn2nb(knearneigh(p12.shp.wca2)) Global G Now that we have the nearest neighbor values we need to convert the data into a list for both the global and local G statistics. For the global G (globalG.test(...)), it is recommended that the spatial weights be binary, therefore in the nb2listw() function we need to use the style="B". nb_lw<-nb2listw(nb,style="B") Now to evaluate the dataset from the Global G statistic. globalG.test(p12.shp.wca2$TPSDF,nb_lw,alternative="two.sided") ## ## Getis-Ord global G statistic ## ## data: p12.shp.wca2$TPSDF ## weights: nb_lw ## ## standard deviate = 0.092001, p-value = 0.9267 ## alternative hypothesis: two.sided ## sample estimates: ## Global G statistic Expectation Variance ## 0.48775147 0.48333333 0.00230619 In the output it standard deviate is the standard deviation of Moran’s I or the \(z_{G}\)-score and \(\rho\)-value of the test. Other information in the output include the observed statistic, its expectation and variance. Based on the Global G results it suggests that there is no high/low clustering globally across the dataset. If you want more information on the Global test, ESRI provides a robust review including all the additional maths that is behind the statistic. Local G Similar to the Global test, the local G test uses nearest neighbors. Unlike the Global test, the nearest neighbor can be row standardized (default setting). nb_lw<-nb2listw(nb) # local G local_g<-localG(p12.shp.wca2$TPSDF,nb_lw) The output of the function is a list of \(z_{G_{i}^{*}}\)-scores for each site. A little extra coding to determine \(\rho\)-values and hot/cold spots. Essentially values need to be extracted from the local_g object and \(\rho\)-value based on the z-score. # convert to matrix local_g.ma=as.matrix(local_g) # column-bind the local_g data p12.shp.wca2<-cbind(p12.shp.wca2,local_g.ma) # change the names of the new column names(p12.shp.wca2)[ncol(p12.shp.wca2)]="localg" Lets determine the two.side \(\rho\)-value. p12.shp.wca2$pval<- 2*pnorm(-abs(p12.shp.wca2$localg)) Based on the \(z_{G_{i}^{*}}\)-scores and \(\rho\)-value we operationally define a hot-spot as \(z_{G_{i}^{*}}\)-scores > 0 and \(\rho\)-value < \(\alpha\) (usually 0.05). Let see if we have hot-spots. subset(p12.shp.wca2@data,localg>0&pval<0.05)$STA_ID ## [1] "M258" We have one site identified as a hot-spot. Lets maps it out too. Soil total phosphorus hot-spots identified using the Getis-Ord \(G_{i}^{*}\) spatial statistic. For context, this soil TP hot-spot occurs near discharge locations into Water Conservation Area 2A. Historically run-off from the upstream agricultural area would be diverted to the area to protection both the agricultural area and the downstream urban areas. Currently restoration activities has eliminated these direct discharge and water quality has improved. However, we still see the legacy affect from past water management. If your interested in how the system is responding check out the South Florida Environmental Report here is last years Everglades Water Quality chapter. If you would like more background on hot-spot analysis, ESRI produces a pretty good resource on Getis-Ord \(G_{i}^{*}\). This analysis can also be spatially aggregated (from ESRI) in the R by creating a grid, aggregating the data, estimate the nearest neighbor and evaluating on a local or global scale (maybe we will get to that another time). Getis, Arthur, and J. K. Ord. 2010. “The Analysis of Spatial Association by Use of Distance Statistics.” Geographical Analysis 24 (3): 189–206. https://doi.org/10.1111/j.1538-4632.1992.tb00261.x.Too much outside the box - Outliers and Boxplots2020-01-24T00:00:00+00:002020-01-24T00:00:00+00:00https://swampthingecology.org/blog/too-much-outside-the-box---outliers-and-boxplots<section class="main-content">
<p><strong>Keywords:</strong> boxplots, outlier, data analysis</p>
<hr />
<p>In a recent commentary due out in <a href="https://www.springer.com/journal/227" target="_blank">Marine Biology</a> soon (hopefully) I argue against the use of boxplots as a method of outlier detection. Also seems that boxplots are very popular with people having strong opinons …</p>
<p><img src="https://swampthingecology.org/blog\images\20200124_Boxplot\tweet.png" width="50%" style="display: block; margin: auto;" /></p>
<p>Before we get too into the weeds lets present the classical definition of what an outlier is, here I use <span class="citation">Gotelli and Ellison (<a href="#ref-gotelli_primer_2013" role="doc-biblioref">2013</a>)</span> but across statistical literature outliers are generally defined/described similarly.</p>
<blockquote>
<p>“…extreme data points that are not characteristic of the distribution they were sampled…” <span class="citation">(Gotelli and Ellison <a href="#ref-gotelli_primer_2013" role="doc-biblioref">2013</a>)</span>.</p>
</blockquote>
<p>What would a classic example of this definition look like in “real data” (below is generated data…technically not real data)?</p>
<p>Here is how the data was generated for demonstration purposes</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" title="1"><span class="kw">set.seed</span>(<span class="dv">123</span>)</a>
<a class="sourceLine" id="cb1-2" title="2"><span class="co"># "Data</span></a>
<a class="sourceLine" id="cb1-3" title="3">N.val<-<span class="dv">100</span></a>
<a class="sourceLine" id="cb1-4" title="4">x.val<-<span class="kw">seq</span>(<span class="dv">0</span>,<span class="dv">1</span>,<span class="dt">length.out=</span>N.val)</a>
<a class="sourceLine" id="cb1-5" title="5">m<-<span class="dv">5</span></a>
<a class="sourceLine" id="cb1-6" title="6">b<-<span class="dv">1</span></a>
<a class="sourceLine" id="cb1-7" title="7">error.val<-<span class="dv">1</span></a>
<a class="sourceLine" id="cb1-8" title="8">y.val<-((m<span class="op">*</span>x.val)<span class="op">+</span>b)<span class="op">+</span><span class="kw">rnorm</span>(N.val,<span class="dv">0</span>,error.val)</a>
<a class="sourceLine" id="cb1-9" title="9"></a>
<a class="sourceLine" id="cb1-10" title="10"><span class="co"># Outlier</span></a>
<a class="sourceLine" id="cb1-11" title="11">y.val.out<-y.val[<span class="dv">95</span>]<span class="op">+</span><span class="fl">2.5</span></a></code></pre></div>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-01-24-Boxplot_files/figure-html/unnamed-chunk-4-1.png" alt="Visual example of an outlier based on the definition above." />
<p class="caption">
Visual example of an outlier based on the definition above.
</p>
</div>
<p>Clearly, based on the example above it seems like the <span style="red">red</span> point in the plot to the left looks like it doesn’t really belong. A quick density plot of the data with and without the point (use <code>plot(density(...))</code>) gives you a sense of if the extreme data point is outside of the data distribution. The plot to the right demonstrates the data distribution and mean (dashed) without the extreme value relative to the extreme value (<span style="red">red</span> line).</p>
<p>The next step to really determine if its an outlier would be to conduct an outlier test on your data. Outliers in data can distort the data distribution, affect predictions (if used in a model) and affect the overall accuracy of estimates if they are not detected and handled, especially in bi-variate analysis (such as linear modeling). Most of the information you will see on the internet and in some textbooks is that boxplots are good way to identify outliers. I fully endorse using boxplots as a first looks at the data, just to get a sense of things as they were intended by <span class="citation">Tukey (<a href="#ref-tukey_exploratory_1977" role="doc-biblioref">1977</a>)</span>. Thats right <a href="https://en.wikipedia.org/wiki/John_Tukey" target="_blank">Dr. John W Tukey</a> was the mastermind behind the boxplot…you may remember him from such statistical analyses as <a href="https://en.wikipedia.org/wiki/Tukey%27s_range_test" target="_blank">Tukey’s range test/HSD</a> or <a href="https://en.wikipedia.org/wiki/Tukey_lambda_distribution" target="blank">Tukey lambda distribution</a>.</p>
<p>Overall, boxplots are extremely helpful in quickly visualization of the central tendency and spread of the data. Don’t confuse the central tendency and spread for mean and standard deviation, as these values are not usually displayed in boxplots.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-01-24-Boxplot_files/figure-html/unnamed-chunk-5-1.png" alt="Components of a classic Tukey boxplot." />
<p class="caption">
Components of a classic Tukey boxplot.
</p>
</div>
<p>At its root, boxplots providing no information on the underlying data distribution and provide a somewhat arbitrary detection of extreme values especially for non-normal data distributions <span class="citation">(Kampstra <a href="#ref-kampstra_beanplot:_2008" role="doc-biblioref">2008</a>; Krzywinski and Altman <a href="#ref-krzywinski_visualizing_2014" role="doc-biblioref">2014</a>)</span>. Extreme values are identified using a univariate boxplot simply identifies values that fall outside of 1.5 time the inter-quartile range (IQR) of the first or third quartile <span class="citation">(Tukey <a href="#ref-tukey_exploratory_1977" role="doc-biblioref">1977</a>)</span>. As discussed above, outliers are extreme values outside the distribution of the data. Since IQR (i.e. median, 25th quantile, 75th quantile, etc.) calculations are distributionless calculations, values outside the IQR therefore are not based on any distribution. Below are four examples of data pulled from different distributions with a mean of zero (<span class="math inline">\(\mu = 0\)</span>) and standard deviation of one (<span class="math inline">\(\sigma = 1\)</span>). In these cases, especially for normally and skewed normal distributions, median, 25<sup>th</sup> quantile and 75<sup>th</sup> quantile values do not differ greatly, but the number of outliers do differ.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-01-24-Boxplot_files/figure-html/unnamed-chunk-7-1.png" alt="Boxplot and distribution plots of uniform, normal and skewed normal distributions with μ = 0 and σ = 1 (mean and standard deviation) and an N = 10,000." />
<p class="caption">
Boxplot and distribution plots of uniform, normal and skewed normal distributions with μ = 0 and σ = 1 (mean and standard deviation) and an N = 10,000.
</p>
</div>
<p>The boxplot examples above show the span of over 10,000 values pulled from uniform, normal and skewed normal distribtuions. A directly obvious observations is that the uniform distribition does not generate any extreme values while the others generate some depending on the skewness of the distributions. <span class="citation">Kampstra (<a href="#ref-kampstra_beanplot:_2008" role="doc-biblioref">2008</a>)</span> suggests that even for normal distributions the number of extreme values identified will increase concurrently with sample size. This is demonstrated below where as sample size increases, the number of extreme values identified also increases. Furthermore, as sample size increases the IQR estimates narrows which you would expect given the central limit theorem. This sample size dependance ultimately makes individual “outlier” detection problematic.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2020-01-24-Boxplot_files/figure-html/unnamed-chunk-9-1.png" alt="Number of potential outliers detected using a univariate boxplot (top) and inter-quartile range as a function of sample size (bottom) from a normally distributed simulated dataset with a mean of zero and a standard deviation of one (μ = 0; σ = 1)." />
<p class="caption">
Number of potential outliers detected using a univariate boxplot (top) and inter-quartile range as a function of sample size (bottom) from a normally distributed simulated dataset with a mean of zero and a standard deviation of one (μ = 0; σ = 1).
</p>
</div>
<p>Bottom line, a boxplot is not a suitable outlier detection test but rather an exploratory data analysis to understand the data. While boxplots do identify extreme values, these extreme values are not truely outliers, they are just values that outside a <em>distribution-less</em> metric on the near extremes of the IQR. Outlier tests such as the Grubbs test, Cochran test or even the Dixon test all can be used to idenify outliers. These tests and more can be found in the <code>outlier</code> <code>R</code> package. Outlier identification and culling is a tricky situtation and requires a strong and rigirous justification and validation that data points identified as an outlier is truely an outlier otherwise you can run afoul of type I and/or type II errors.</p>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-gotelli_primer_2013">
<p>Gotelli, Nicholas J., and Aaron M. Ellison. 2013. <em>A Primer of Ecological Statistics</em>. Sunderland, MA: Sinauer Associates, Inc.</p>
</div>
<div id="ref-kampstra_beanplot:_2008">
<p>Kampstra, Peter. 2008. “Beanplot: A Boxplot Alternative for Visual Comparison of Distributions.” <em>Journal of Statistical Software</em> 28 (Code Snippet 1). <a href="https://doi.org/10.18637/jss.v028.c01">https://doi.org/10.18637/jss.v028.c01</a>.</p>
</div>
<div id="ref-krzywinski_visualizing_2014">
<p>Krzywinski, Martin, and Naomi Altman. 2014. “Visualizing Samples with Box Plots.” <em>Nature Methods</em> 11 (2): 119–20. <a href="https://doi.org/10.1038/nmeth.2813">https://doi.org/10.1038/nmeth.2813</a>.</p>
</div>
<div id="ref-tukey_exploratory_1977">
<p>Tukey, John Wilder. 1977. “Exploratory Data Analysis.” In <em>Statistics and Public Policy</em>, edited by Frederick Mosteller, 1st ed. Addison-Wesley Series in Behavioral Science. Quantitative Methods. Addison-Wesley.</p>
</div>
</div>
</div>
</section>Keywords: boxplots, outlier, data analysis In a recent commentary due out in Marine Biology soon (hopefully) I argue against the use of boxplots as a method of outlier detection. Also seems that boxplots are very popular with people having strong opinons … Before we get too into the weeds lets present the classical definition of what an outlier is, here I use Gotelli and Ellison (2013) but across statistical literature outliers are generally defined/described similarly. “…extreme data points that are not characteristic of the distribution they were sampled…” (Gotelli and Ellison 2013). What would a classic example of this definition look like in “real data” (below is generated data…technically not real data)? Here is how the data was generated for demonstration purposes set.seed(123) # "Data N.val<-100 x.val<-seq(0,1,length.out=N.val) m<-5 b<-1 error.val<-1 y.val<-((m*x.val)+b)+rnorm(N.val,0,error.val) # Outlier y.val.out<-y.val[95]+2.5 Visual example of an outlier based on the definition above. Clearly, based on the example above it seems like the red point in the plot to the left looks like it doesn’t really belong. A quick density plot of the data with and without the point (use plot(density(...))) gives you a sense of if the extreme data point is outside of the data distribution. The plot to the right demonstrates the data distribution and mean (dashed) without the extreme value relative to the extreme value (red line). The next step to really determine if its an outlier would be to conduct an outlier test on your data. Outliers in data can distort the data distribution, affect predictions (if used in a model) and affect the overall accuracy of estimates if they are not detected and handled, especially in bi-variate analysis (such as linear modeling). Most of the information you will see on the internet and in some textbooks is that boxplots are good way to identify outliers. I fully endorse using boxplots as a first looks at the data, just to get a sense of things as they were intended by Tukey (1977). Thats right Dr. John W Tukey was the mastermind behind the boxplot…you may remember him from such statistical analyses as Tukey’s range test/HSD or Tukey lambda distribution. Overall, boxplots are extremely helpful in quickly visualization of the central tendency and spread of the data. Don’t confuse the central tendency and spread for mean and standard deviation, as these values are not usually displayed in boxplots. Components of a classic Tukey boxplot. At its root, boxplots providing no information on the underlying data distribution and provide a somewhat arbitrary detection of extreme values especially for non-normal data distributions (Kampstra 2008; Krzywinski and Altman 2014). Extreme values are identified using a univariate boxplot simply identifies values that fall outside of 1.5 time the inter-quartile range (IQR) of the first or third quartile (Tukey 1977). As discussed above, outliers are extreme values outside the distribution of the data. Since IQR (i.e. median, 25th quantile, 75th quantile, etc.) calculations are distributionless calculations, values outside the IQR therefore are not based on any distribution. Below are four examples of data pulled from different distributions with a mean of zero (\(\mu = 0\)) and standard deviation of one (\(\sigma = 1\)). In these cases, especially for normally and skewed normal distributions, median, 25th quantile and 75th quantile values do not differ greatly, but the number of outliers do differ. Boxplot and distribution plots of uniform, normal and skewed normal distributions with μ = 0 and σ = 1 (mean and standard deviation) and an N = 10,000. The boxplot examples above show the span of over 10,000 values pulled from uniform, normal and skewed normal distribtuions. A directly obvious observations is that the uniform distribition does not generate any extreme values while the others generate some depending on the skewness of the distributions. Kampstra (2008) suggests that even for normal distributions the number of extreme values identified will increase concurrently with sample size. This is demonstrated below where as sample size increases, the number of extreme values identified also increases. Furthermore, as sample size increases the IQR estimates narrows which you would expect given the central limit theorem. This sample size dependance ultimately makes individual “outlier” detection problematic. Number of potential outliers detected using a univariate boxplot (top) and inter-quartile range as a function of sample size (bottom) from a normally distributed simulated dataset with a mean of zero and a standard deviation of one (μ = 0; σ = 1). Bottom line, a boxplot is not a suitable outlier detection test but rather an exploratory data analysis to understand the data. While boxplots do identify extreme values, these extreme values are not truely outliers, they are just values that outside a distribution-less metric on the near extremes of the IQR. Outlier tests such as the Grubbs test, Cochran test or even the Dixon test all can be used to idenify outliers. These tests and more can be found in the outlier R package. Outlier identification and culling is a tricky situtation and requires a strong and rigirous justification and validation that data points identified as an outlier is truely an outlier otherwise you can run afoul of type I and/or type II errors. References Gotelli, Nicholas J., and Aaron M. Ellison. 2013. A Primer of Ecological Statistics. Sunderland, MA: Sinauer Associates, Inc. Kampstra, Peter. 2008. “Beanplot: A Boxplot Alternative for Visual Comparison of Distributions.” Journal of Statistical Software 28 (Code Snippet 1). https://doi.org/10.18637/jss.v028.c01. Krzywinski, Martin, and Naomi Altman. 2014. “Visualizing Samples with Box Plots.” Nature Methods 11 (2): 119–20. https://doi.org/10.1038/nmeth.2813. Tukey, John Wilder. 1977. “Exploratory Data Analysis.” In Statistics and Public Policy, edited by Frederick Mosteller, 1st ed. Addison-Wesley Series in Behavioral Science. Quantitative Methods. Addison-Wesley.PCA basics in #Rstats2019-12-10T00:00:00+00:002019-12-10T00:00:00+00:00https://swampthingecology.org/blog/pca-basics-in-#rstats<script src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/htmlwidgets-1.3/htmlwidgets.js"></script>
<script src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/jquery-1.12.4/jquery.min.js"></script>
<script src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/leaflet-1.3.1/leaflet.js"></script>
<script src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/Proj4Leaflet-1.0.1/proj4-compressed.js"></script>
<script src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/Proj4Leaflet-1.0.1/proj4leaflet.js"></script>
<script src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/leaflet-binding-2.0.2/leaflet.js"></script>
<script src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/leaflet-providers-1.1.17/leaflet-providers.js"></script>
<script src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/leaflet-providers-plugin-2.0.2/leaflet-providers-plugin.js"></script>
<section class="main-content">
<p><strong>Keywords:</strong> ordination, R, PCA</p>
<hr />
<p>The masses have spoken!!</p>
<p><img src="https://swampthingecology.org/blog\images\20191210_PCA\twitterpoll.png" width="50%" style="display: block; margin: auto;" /></p>
<p>Also I got a wise piece of advice from <a href="https://twitter.com/coolbutuseless" target="_blank">mikefc</a> regarding <code>R</code> blog posts.</p>
<p><img src="https://swampthingecology.org/blog\images\20191210_PCA\GhostBuster_meme.png" width="75%" style="display: block; margin: auto;" /></p>
<hr />
<p>This post was partly motivated by an article by the <a href="https://medium.com/@bioturing" target="_blank">BioTuring Team</a> regarding <a href="https://medium.com/@bioturing/how-to-read-pca-biplots-and-scree-plots-186246aae063?" target="_blank">PCA</a>. In their article the authors provide the basic concepts behind interpreting a Principal Component Analysis (PCA) plot. Before rehashing PCA plots in <code>R</code> I would like to cover some basics.</p>
<p>Ordination analysis, which PCA is part of, is used to order (or ordinate…hence the name) multivariate data. Ultimately ordination makes new variables called principal axes along which samples are scored and/or ordered <span class="citation">(Gotelli and Ellison <a href="#ref-gotelli_primer_2004" role="doc-biblioref">2004</a>)</span>. There are at least five routinely used ordination analyses, here I intend to cover just PCA. Maybe in the future I cover the other four as it relates to ecological data analysis.</p>
<div id="principal-component-analysis" class="section level2">
<h2>Principal Component Analysis</h2>
<p>I have heard PCA call lots of things in my day including but not limiting to magic, statistical hand waving, mass plotting, statistical guesstimate, etc. When you have a multivariate dataset (data with more than one variable) it can be tough to figure out what matters. Think water quality data with a whole suite of nutrients or fish study with biological, habitat and water chemistry data for several sites along a stream/river. PCA is the best way to reduce the dimesionality of multivariate data to determine what <em>statistically</em> and practically matters. But its also beyond a data winnowing technique it can also be used to demonstrate similarity (or difference) between groups and relationships between variables. A major disadvantage of PCA is that it is a data hungry analysis (see assumptions below).</p>
<div id="assumptions-of-pca" class="section level3">
<h3>Assumptions of PCA</h3>
<p>Finding a single source related to the assumptions of PCA is rare. Below is combination of several sources including seminars, webpages, course notes, etc. Therefore this is not an exhaustive list of all assumptions and I could have missed some. I put this together for my benefit as well as your. Proceed with caution!!</p>
<ul>
<li><p><strong>Multiple Variables:</strong> This one is obvious. Ideally, given the nature of the analysis, multiple variables are required to perform the analysis. Moreover, variables should be measured at the continuous level, although ordinal variable are frequently used.</p></li>
<li><p><strong>Sample adequacy:</strong> Much like most (if not all) statistical analyses to produce a reliable result large enough sample sizes are required. Generally a minimum of 150 cases (i.e. rows), or 5 to 10 cases per variable is recommended for PCA analysis. Some have suggested to perform a sampling adequacy analysis such as Kaiser-Meyer-Olkin Measure (KMO) Measure of Sampling Adequacy. However, KMO is less a function of sample size adequacy as its a measure of the suitability of the data for factor analysis, which leads to the next point.</p></li>
<li><p><strong>Linearity relationships:</strong> It is assumed that the relationships between variables are linearly related. The basis of this assumption is rooted in the fact that PCA is based on Pearson correlation coefficients and therefore the assumptions of Pearson’s correlation also hold true. Generally, this assumption is somewhat relaxed…even though it shouldn’t be…with the use of ordinal data for variable.</p></li>
</ul>
<p>The <code>KMOS</code> and <code>bart_spher</code> functions in the <code>REdaS</code> <code>R</code> library can be used to check the measure of sampling adequacy and if the data is different from an identity matrix , below is a quick example.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" title="1"><span class="kw">library</span>(REdaS)</a>
<a class="sourceLine" id="cb1-2" title="2"><span class="kw">library</span>(vegan)</a>
<a class="sourceLine" id="cb1-3" title="3"><span class="kw">library</span>(reshape)</a>
<a class="sourceLine" id="cb1-4" title="4"></a>
<a class="sourceLine" id="cb1-5" title="5"><span class="kw">data</span>(varechem);<span class="co">#from vegan package</span></a>
<a class="sourceLine" id="cb1-6" title="6"></a>
<a class="sourceLine" id="cb1-7" title="7"><span class="co"># KMO</span></a>
<a class="sourceLine" id="cb1-8" title="8"><span class="kw">KMOS</span>(varechem)</a></code></pre></div>
<pre><code>##
## Kaiser-Meyer-Olkin Statistics
##
## Call: KMOS(x = varechem)
##
## Measures of Sampling Adequacy (MSA):
## N P K Ca Mg S Al
## 0.2770880 0.7943090 0.6772451 0.7344827 0.6002924 0.7193302 0.4727618
## Fe Mn Zn Mo Baresoil Humdepth pH
## 0.5066961 0.6029551 0.6554475 0.4362350 0.7007942 0.5760349 0.4855293
##
## KMO-Criterion: 0.6119355</code></pre>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb3-1" title="1"><span class="co"># Bartlett's Test Of Sphericity</span></a>
<a class="sourceLine" id="cb3-2" title="2"><span class="kw">bart_spher</span>(varechem)</a></code></pre></div>
<pre><code>## Bartlett's Test of Sphericity
##
## Call: bart_spher(x = varechem)
##
## X2 = 260.217
## df = 91
## p-value < 2.22e-16</code></pre>
<p>The <code>varechem</code> dataset appears to be suitable for factor analysis. The KMO value for the entire dataset is 0.61, above the suggested 0.5 threshold. Furthermore, the data is significantly different from an identity matrix (<em>H<sub>0</sub> :</em> all off-diagonal correlations are zero). <!--http://minato.sip21c.org/swtips/factor-in-R.pdf--></p>
<ul>
<li><strong>No significant outliers:</strong> Like most statistical analyses, outliers can skew any analysis/ In PCA, outliers can have a disproportionate influence on the resulting component computation. Since principal components are estimated by essentially re-scaling the data retaining the variance outlier could skew the estimate of each component within a PCA. Another way to visualize how PCA is performed is that it uses rotation of the original axes to derive a new axes, which maximizes the variance in the data set. In 2D this looks like this:</li>
</ul>
<p><img src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /></p>
<p>You would expect that if true outliers are present that the newly derived axes will be skewed. Outlier analysis and issues associated with identifying outliers is a whole other ball game that I will not cover here other than saying box-plots are not a suitable outlier identification analysis, see <span class="citation">Tukey (<a href="#ref-mosteller_exploratory_1977" role="doc-biblioref">1977</a>)</span> for more detail on boxplots (I have a manuscript <em>In Prep</em> focusing on this exact issue).</p>
</div>
</div>
<div id="terminology" class="section level2">
<h2>Terminology</h2>
<p>Before moving forward I wanted to dedicate some additional time to some terms specific to component analysis. By now we know the general gist of PCA … incase you were paying attention PCA is essentially a dimensionality reduction or data compression method to understand how multiple variable correlate in a given dataset. Typically when people discuss PCA they also use the terms loading, eigenvectors and eigenvalues.</p>
<ul>
<li><p><strong>Eigenvectors</strong> are unit-scaled loadings. Mathematically, they are the column sum of squared loadings for a factor. It conceptually represents the amount of variance accounted for by a given factor.</p></li>
<li><p><strong>Eigenvalues</strong> also called characteristic roots is the measure of variation in the total sample accounted for by each factor. Computationally, a factor’s eigenvalues are determined as the sum of its squared factor loadings for all the variables. The ratio of eigenvalues is the ratio of explanatory importance of the factors with respect to the variables (remember this for later).</p></li>
<li><p><strong>Factor Loadings</strong> is the correlation between the original variables and the factors. Analogous to Pearson’s r, the squared factor loadings is the percent of variance in that variable explained by the factor (…again remember this for later).</p></li>
</ul>
</div>
<div id="analysis" class="section level2">
<h2>Analysis</h2>
<p>Now that we have the basic terminology laid out and we know the general assumptions lets do an example analysis. Since I am an aquatic biogeochemist I am going to use some limnological data. Here we have a subset of long-term monitoring locations from six lakes within south Florida monitored by the <a href="https://www.sfwmd.gov/" target="_blank">South Florida Water Management District</a> (SFWMD). To retrieve the data we will use the <code>AnalystHelper</code> package (<a href="https://github.com/SwampThingPaul/AnalystHelper" target="_blank">link</a>), which has a function to retrieve data from the SFWMD online environmental database <a href="https://my.sfwmd.gov/dbhydroplsql/show_dbkey_info.main_menu" target="_blank">DBHYDRO</a>.</p>
<!--
Here is a quick map of the sites.
<div id="htmlwidget-797e4eb1d26f83a92458" style="width:100%;height:480px;" class="leaflet html-widget"></div>
<script type="application/json" data-for="htmlwidget-797e4eb1d26f83a92458">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addProviderTiles","args":["Esri.WorldImagery",null,"Esri.WorldImagery",{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":0.9,"zIndex":1,"detectRetina":false,"pane":"tilePane"}]},{"method":"createMapPane","args":["overlayPane01",401]},{"method":"addCircleMarkers","args":[[27.402596321192,26.9018100439508,28.2870827720898,28.1958880739981,28.0381131133281,27.8932745321714],[-81.3028497283324,-80.7889996146427,-81.3069727538396,-81.3903444000254,-81.4328368922772,-81.2284383459143],[4,4,4,4,4,4],["X1124","X1254","X731","X744","X854","X863"],"SFWMD WQ Monitoring (Active)",{"interactive":true,"className":"","pane":"overlayPane01","stroke":true,"color":"#666666","weight":1,"opacity":0.5,"fill":true,"fillColor":["#1E90FF","#1E90FF","#1E90FF","#1E90FF","#1E90FF","#1E90FF"],"fillOpacity":[1,1,1,1,1,1]},null,null,["<div style=\"max-height:10em;overflow:auto;\"><table>\n\t\t\t <thead><tr><th colspan=\"2\"><b>1124<\/b><\/th><\/thead><\/tr><tr><td style=\"color: #888888;\">OBJECTID<\/td><td>1124<\/td><\/tr><tr><td style=\"color: #888888;\">STATUS<\/td><td>Active<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_T<\/td><td>Chemistry<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_S<\/td><td>Surface Water Grab<\/td><\/tr><tr><td style=\"color: #888888;\">STATION<\/td><td>ISTK2S<\/td><\/tr><tr><td style=\"color: #888888;\">SITE<\/td><td>ISTK2S<\/td><\/tr><tr><td style=\"color: #888888;\">LAT<\/td><td>27.40<\/td><\/tr><tr><td style=\"color: #888888;\">LON<\/td><td>-81.303<\/td><\/tr><tr><td style=\"color: #888888;\">START_DATE<\/td><td>2005-03-21T10:12:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">END_DATE<\/td><td>2019-07-24T11:26:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">STATION_DE<\/td><td>IN LAKE ISTOKPOGA NORTHWEST QUADRANT<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_X<\/td><td>557,906<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_Y<\/td><td>1,115,642<\/td><\/tr><\/table><\/div>","<div style=\"max-height:10em;overflow:auto;\"><table>\n\t\t\t <thead><tr><th colspan=\"2\"><b>1254<\/b><\/th><\/thead><\/tr><tr><td style=\"color: #888888;\">OBJECTID<\/td><td>1254<\/td><\/tr><tr><td style=\"color: #888888;\">STATUS<\/td><td>Active<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_T<\/td><td>Chemistry<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_S<\/td><td>Surface Water Grab<\/td><\/tr><tr><td style=\"color: #888888;\">STATION<\/td><td>LZ40<\/td><\/tr><tr><td style=\"color: #888888;\">SITE<\/td><td>LZ40<\/td><\/tr><tr><td style=\"color: #888888;\">LAT<\/td><td>26.90<\/td><\/tr><tr><td style=\"color: #888888;\">LON<\/td><td>-80.789<\/td><\/tr><tr><td style=\"color: #888888;\">START_DATE<\/td><td>1978-06-12T10:08:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">END_DATE<\/td><td>2019-07-09T11:34:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">STATION_DE<\/td><td>LZ40 WEATHER STATION ON LAKE OKEECHOBEE<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_X<\/td><td>724,932<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_Y<\/td><td>933,536<\/td><\/tr><\/table><\/div>","<div style=\"max-height:10em;overflow:auto;\"><table>\n\t\t\t <thead><tr><th colspan=\"2\"><b>731<\/b><\/th><\/thead><\/tr><tr><td style=\"color: #888888;\">OBJECTID<\/td><td>731<\/td><\/tr><tr><td style=\"color: #888888;\">STATUS<\/td><td>Active<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_T<\/td><td>Chemistry<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_S<\/td><td>Surface Water Grab<\/td><\/tr><tr><td style=\"color: #888888;\">STATION<\/td><td>A03<\/td><\/tr><tr><td style=\"color: #888888;\">SITE<\/td><td>A03<\/td><\/tr><tr><td style=\"color: #888888;\">LAT<\/td><td>28.29<\/td><\/tr><tr><td style=\"color: #888888;\">LON<\/td><td>-81.307<\/td><\/tr><tr><td style=\"color: #888888;\">START_DATE<\/td><td>1981-08-27T13:29:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">END_DATE<\/td><td>2019-06-11T10:17:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">STATION_DE<\/td><td>Open water site located at southwest side of East Lake Tohopekaliga.<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_X<\/td><td>557,373<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_Y<\/td><td>1,437,203<\/td><\/tr><\/table><\/div>","<div style=\"max-height:10em;overflow:auto;\"><table>\n\t\t\t <thead><tr><th colspan=\"2\"><b>744<\/b><\/th><\/thead><\/tr><tr><td style=\"color: #888888;\">OBJECTID<\/td><td>744<\/td><\/tr><tr><td style=\"color: #888888;\">STATUS<\/td><td>Active<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_T<\/td><td>Chemistry<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_S<\/td><td>Surface Water Grab<\/td><\/tr><tr><td style=\"color: #888888;\">STATION<\/td><td>B06<\/td><\/tr><tr><td style=\"color: #888888;\">SITE<\/td><td>B06<\/td><\/tr><tr><td style=\"color: #888888;\">LAT<\/td><td>28.20<\/td><\/tr><tr><td style=\"color: #888888;\">LON<\/td><td>-81.390<\/td><\/tr><tr><td style=\"color: #888888;\">START_DATE<\/td><td>1981-08-26T12:00:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">END_DATE<\/td><td>2019-06-11T12:01:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">STATION_DE<\/td><td>Open water site located in the central area of Lake Tohopekaliga.<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_X<\/td><td>530,434<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_Y<\/td><td>1,404,124<\/td><\/tr><\/table><\/div>","<div style=\"max-height:10em;overflow:auto;\"><table>\n\t\t\t <thead><tr><th colspan=\"2\"><b>854<\/b><\/th><\/thead><\/tr><tr><td style=\"color: #888888;\">OBJECTID<\/td><td>854<\/td><\/tr><tr><td style=\"color: #888888;\">STATUS<\/td><td>Active<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_T<\/td><td>Chemistry<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_S<\/td><td>Surface Water Grab<\/td><\/tr><tr><td style=\"color: #888888;\">STATION<\/td><td>D02<\/td><\/tr><tr><td style=\"color: #888888;\">SITE<\/td><td>D02<\/td><\/tr><tr><td style=\"color: #888888;\">LAT<\/td><td>28.04<\/td><\/tr><tr><td style=\"color: #888888;\">LON<\/td><td>-81.433<\/td><\/tr><tr><td style=\"color: #888888;\">START_DATE<\/td><td>1982-04-06T13:05:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">END_DATE<\/td><td>2019-06-11T14:14:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">STATION_DE<\/td><td>Open water site located at west side of Lake Hatchineha<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_X<\/td><td>516,543<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_Y<\/td><td>1,346,808<\/td><\/tr><\/table><\/div>","<div style=\"max-height:10em;overflow:auto;\"><table>\n\t\t\t <thead><tr><th colspan=\"2\"><b>863<\/b><\/th><\/thead><\/tr><tr><td style=\"color: #888888;\">OBJECTID<\/td><td>863<\/td><\/tr><tr><td style=\"color: #888888;\">STATUS<\/td><td>Active<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_T<\/td><td>Chemistry<\/td><\/tr><tr><td style=\"color: #888888;\">ACTIVITY_S<\/td><td>Surface Water Grab<\/td><\/tr><tr><td style=\"color: #888888;\">STATION<\/td><td>E04<\/td><\/tr><tr><td style=\"color: #888888;\">SITE<\/td><td>E04<\/td><\/tr><tr><td style=\"color: #888888;\">LAT<\/td><td>27.89<\/td><\/tr><tr><td style=\"color: #888888;\">LON<\/td><td>-81.228<\/td><\/tr><tr><td style=\"color: #888888;\">START_DATE<\/td><td>1982-04-06T15:00:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">END_DATE<\/td><td>2019-06-11T15:09:00.000Z<\/td><\/tr><tr><td style=\"color: #888888;\">STATION_DE<\/td><td>Open water site located at east central end of Lake Kissimmee at Marker #7<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_X<\/td><td>582,379<\/td><\/tr><tr><td style=\"color: #888888;\">POINT_Y<\/td><td>1,293,972<\/td><\/tr><\/table><\/div>"],{"maxWidth":500,"minWidth":100,"autoPan":true,"keepInView":false,"closeButton":true,"className":""},["1124","1254","731","744","854","863"],{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]},{"method":"addLayersControl","args":[[],"SFWMD WQ Monitoring (Active)",{"collapsed":true,"autoZIndex":true,"position":"topleft"}]}],"limits":{"lat":[26.9018100439508,28.2870827720898],"lng":[-81.4328368922772,-80.7889996146427]},"fitBounds":[26.9018100439508,-81.4328368922772,28.2870827720898,-80.7889996146427,[]]},"evals":[],"jsHooks":[]}</script>
-->
<p>Let retrieve and format the data for PCA analysis.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" title="1"><span class="co">#Libraries/packages needed</span></a>
<a class="sourceLine" id="cb5-2" title="2"><span class="kw">library</span>(AnalystHelper)</a>
<a class="sourceLine" id="cb5-3" title="3"><span class="kw">library</span>(reshape)</a>
<a class="sourceLine" id="cb5-4" title="4"></a>
<a class="sourceLine" id="cb5-5" title="5"><span class="co">#Date Range of data</span></a>
<a class="sourceLine" id="cb5-6" title="6">sdate=<span class="kw">as.Date</span>(<span class="st">"2005-05-01"</span>)</a>
<a class="sourceLine" id="cb5-7" title="7">edate=<span class="kw">as.Date</span>(<span class="st">"2019-05-01"</span>)</a>
<a class="sourceLine" id="cb5-8" title="8"></a>
<a class="sourceLine" id="cb5-9" title="9"><span class="co">#Site list with lake name (meta-data)</span></a>
<a class="sourceLine" id="cb5-10" title="10">sites=<span class="kw">data.frame</span>(<span class="dt">Station.ID=</span><span class="kw">c</span>(<span class="st">"LZ40"</span>,<span class="st">"ISTK2S"</span>,<span class="st">"E04"</span>,<span class="st">"D02"</span>,<span class="st">"B06"</span>,<span class="st">"A03"</span>),</a>
<a class="sourceLine" id="cb5-11" title="11"> <span class="dt">LAKE=</span><span class="kw">c</span>(<span class="st">"Okeechobee"</span>,<span class="st">"Istokpoga"</span>,<span class="st">"Kissimmee"</span>,<span class="st">"Hatchineha"</span>,</a>
<a class="sourceLine" id="cb5-12" title="12"> <span class="st">"Tohopekaliga"</span>,<span class="st">"East Tohopekaliga"</span>))</a>
<a class="sourceLine" id="cb5-13" title="13"></a>
<a class="sourceLine" id="cb5-14" title="14"><span class="co">#Water Quality parameters (meta-data)</span></a>
<a class="sourceLine" id="cb5-15" title="15">parameters=<span class="kw">data.frame</span>(<span class="dt">Test.Number=</span><span class="kw">c</span>(<span class="dv">67</span>,<span class="dv">20</span>,<span class="dv">32</span>,<span class="dv">179</span>,<span class="dv">112</span>,<span class="dv">8</span>,<span class="dv">10</span>,<span class="dv">23</span>,<span class="dv">25</span>,<span class="dv">80</span>,<span class="dv">18</span>,<span class="dv">21</span>),</a>
<a class="sourceLine" id="cb5-16" title="16"> <span class="dt">param=</span><span class="kw">c</span>(<span class="st">"Alk"</span>,<span class="st">"NH4"</span>,<span class="st">"Cl"</span>,<span class="st">"Chla"</span>,<span class="st">"Chla"</span>,<span class="st">"DO"</span>,<span class="st">"pH"</span>,</a>
<a class="sourceLine" id="cb5-17" title="17"> <span class="st">"SRP"</span>,<span class="st">"TP"</span>,<span class="st">"TN"</span>,<span class="st">"NOx"</span>,<span class="st">"TKN"</span>))</a>
<a class="sourceLine" id="cb5-18" title="18"></a>
<a class="sourceLine" id="cb5-19" title="19"><span class="co"># Retrieve the data</span></a>
<a class="sourceLine" id="cb5-20" title="20">dat=<span class="kw">DBHYDRO_WQ</span>(sdate,edate,sites<span class="op">$</span>Station.ID,parameters<span class="op">$</span>Test.Number)</a>
<a class="sourceLine" id="cb5-21" title="21"></a>
<a class="sourceLine" id="cb5-22" title="22"><span class="co"># Merge metadata with dataset</span></a>
<a class="sourceLine" id="cb5-23" title="23">dat=<span class="kw">merge</span>(dat,sites,<span class="st">"Station.ID"</span>)</a>
<a class="sourceLine" id="cb5-24" title="24">dat=<span class="kw">merge</span>(dat,parameters,<span class="st">"Test.Number"</span>)</a>
<a class="sourceLine" id="cb5-25" title="25"></a>
<a class="sourceLine" id="cb5-26" title="26"><span class="co"># Cross tabulate the data based on parameter name</span></a>
<a class="sourceLine" id="cb5-27" title="27">dat.xtab=<span class="kw">cast</span>(dat,Station.ID<span class="op">+</span>LAKE<span class="op">+</span>Date.EST<span class="op">~</span>param,<span class="dt">value=</span><span class="st">"HalfMDL"</span>,mean)</a>
<a class="sourceLine" id="cb5-28" title="28"></a>
<a class="sourceLine" id="cb5-29" title="29"><span class="co"># Cleaning up/calculating parameters</span></a>
<a class="sourceLine" id="cb5-30" title="30">dat.xtab<span class="op">$</span>TN=<span class="kw">with</span>(dat.xtab,<span class="kw">TN_Combine</span>(NOx,TKN,TN))</a>
<a class="sourceLine" id="cb5-31" title="31">dat.xtab<span class="op">$</span>DIN=<span class="kw">with</span>(dat.xtab, NOx<span class="op">+</span>NH4)</a>
<a class="sourceLine" id="cb5-32" title="32"></a>
<a class="sourceLine" id="cb5-33" title="33"><span class="co"># More cleaning of the dataset </span></a>
<a class="sourceLine" id="cb5-34" title="34">vars=<span class="kw">c</span>(<span class="st">"Alk"</span>,<span class="st">"Cl"</span>,<span class="st">"Chla"</span>,<span class="st">"DO"</span>,<span class="st">"pH"</span>,<span class="st">"SRP"</span>,<span class="st">"TP"</span>,<span class="st">"TN"</span>,<span class="st">"DIN"</span>)</a>
<a class="sourceLine" id="cb5-35" title="35">dat.xtab=dat.xtab[,<span class="kw">c</span>(<span class="st">"Station.ID"</span>,<span class="st">"LAKE"</span>,<span class="st">"Date.EST"</span>,vars)]</a>
<a class="sourceLine" id="cb5-36" title="36"></a>
<a class="sourceLine" id="cb5-37" title="37"><span class="kw">head</span>(dat.xtab)</a></code></pre></div>
<pre><code>## Station.ID LAKE Date.EST Alk Cl Chla DO pH SRP
## 1 A03 East Tohopekaliga 2005-05-17 17 19.7 4.00 7.90 6.10 0.0015
## 2 A03 East Tohopekaliga 2005-06-21 22 15.4 4.70 6.90 6.40 0.0015
## 3 A03 East Tohopekaliga 2005-07-19 16 15.1 5.10 7.10 NaN 0.0015
## 4 A03 East Tohopekaliga 2005-08-16 17 14.0 3.00 6.90 6.30 0.0015
## 5 A03 East Tohopekaliga 2005-08-30 NaN NaN 6.00 7.07 7.44 NaN
## 6 A03 East Tohopekaliga 2005-09-20 17 16.3 0.65 7.30 6.70 0.0010
## TP TN DIN
## 1 0.024 0.710 0.040
## 2 0.024 0.680 0.030
## 3 0.020 0.630 0.020
## 4 0.021 0.550 0.030
## 5 NaN NA NaN
## 6 0.018 0.537 0.017</code></pre>
<p>If you are playing the home game with this dataset you’ll notice some <code>NA</code> values, this is because that data was either not collected or removed due to fatal laboratory or field QA/QC. PCA doesn’t work with NA values, unfortunately this means that the whole row needs to be excluded from the analysis.</p>
<p>Lets actually get down to doing a PCA analysis. First off, you have several different flavors (funcations) of PCA to choose from. Each have there own nuisances and come from different packages.</p>
<ul>
<li><p><code>prcomp()</code> and <code>princomp()</code> are from the base <code>stats</code> package. The quickest, easiest and most stable version since its in base.</p></li>
<li><p><code>PCA()</code> in the <code>FactoMineR</code> package.</p></li>
<li><p><code>dubi.pca()</code> in the <code>ade4</code> package.</p></li>
<li><p><code>acp()</code> in the <code>amap</code> package.</p></li>
<li><p><code>rda()</code> in the <code>vegan</code> package. More on this later.</p></li>
</ul>
<p>Personally, I only have experience working with <code>prcomp</code>, <code>princomp</code> and <code>rda</code> functions for PCA. The information shown here in this post can be extracted or calculated from any of these functions. Some are straight forward others are more sinuous. Above I mentioned using the <code>rda</code> function for PCA analysis. <code>rda()</code> is a function in the <code>vegan</code> <code>R</code> package for redundancy analysis (RDA) and the function I am most familiar with to perform PCA analysis. Redundancy analysis is a technique used to explain a dataset Y using a dataset X. Normally RDA is used for “constrained ordination” (ordination with covariates or predictors). Without predictors, RDA is the same as PCA.</p>
<p>As I mentioned above, <code>NA</code>s are a no go in PCA analysis so lets format/clean the data and we can see how much the data is reduced by the <code>na.omit</code> action.</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb7-1" title="1">dat.xtab2=<span class="kw">na.omit</span>(dat.xtab)</a>
<a class="sourceLine" id="cb7-2" title="2"></a>
<a class="sourceLine" id="cb7-3" title="3"><span class="kw">nrow</span>(dat.xtab)</a></code></pre></div>
<pre><code>## [1] 725</code></pre>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" title="1"><span class="kw">nrow</span>(dat.xtab2)</a></code></pre></div>
<pre><code>## [1] 515</code></pre>
<p>Also its a good idea as with most data, is to look at your data. Granted when the number of variables get really big…imagine trying to looks at a combination of more than eight or nine parameters. Here we have a scatterplot of water quality data within our six lakes. The parameters in this analysis is Alkalinity (ALK), Chloride (Cl), chlorophyll-<em>a</em> (Chl-a), dissolved oxygen (DO), pH, soluble reactive phosphorus (SRP), total phosphorus (TP), total nitrogen (TN) and dissolved inorganic nitrogen (DIN).</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/figure-html/unnamed-chunk-9-1.png" alt="Scatterplot of all data for the example `dat.xtab2` dataset." />
<p class="caption">
Scatterplot of all data for the example <code>dat.xtab2</code> dataset.
</p>
</div>
<p>Alright, now the data is formatted and we have done some general data exploration. Lets check the adequacy of the data for component analysis…remember the KMO analysis?</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" title="1"><span class="kw">KMOS</span>(dat.xtab2[,vars])</a></code></pre></div>
<pre><code>##
## Kaiser-Meyer-Olkin Statistics
##
## Call: KMOS(x = dat.xtab2[, vars])
##
## Measures of Sampling Adequacy (MSA):
## Alk Cl Chla DO pH SRP TP
## 0.7274872 0.7238120 0.5096832 0.3118529 0.6392602 0.7777460 0.7524428
## TN DIN
## 0.6106997 0.7459682
##
## KMO-Criterion: 0.6972786</code></pre>
<p>Based on the KMO analysis, the KMO-Criterion of the dataset is 0.7, well above the suggested 0.5 threshold.</p>
<p>Lets also check if the data is significantly different from an identity matrix.</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" title="1"><span class="kw">bart_spher</span>(dat.xtab2[,vars])</a></code></pre></div>
<pre><code>## Bartlett's Test of Sphericity
##
## Call: bart_spher(x = dat.xtab2[, vars])
##
## X2 = 4616.865
## df = 36
## p-value < 2.22e-16</code></pre>
<p>Based on Sphericity test (<code>bart_spher()</code>) the results looks good to move forward with a PCA analysis. The actual PCA analysis is pretty straight forward after the data is formatted and <em>“cleaned”</em>.</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb15-1" title="1"><span class="kw">library</span>(vegan)</a>
<a class="sourceLine" id="cb15-2" title="2"></a>
<a class="sourceLine" id="cb15-3" title="3">dat.xtab2.pca=<span class="kw">rda</span>(dat.xtab2[,vars],<span class="dt">scale=</span>T)</a></code></pre></div>
<p>Before we even begin to plot out the typical PCA plot…try <code>biplot()</code> if your interested. Lets first look at the importance of each component and the variance explained by each component.</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb16-1" title="1"><span class="co">#Extract eigenvalues (see definition above)</span></a>
<a class="sourceLine" id="cb16-2" title="2">eig <-<span class="st"> </span>dat.xtab2.pca<span class="op">$</span>CA<span class="op">$</span>eig</a>
<a class="sourceLine" id="cb16-3" title="3"></a>
<a class="sourceLine" id="cb16-4" title="4"><span class="co"># Percent of variance explained by each compoinent</span></a>
<a class="sourceLine" id="cb16-5" title="5">variance <-<span class="st"> </span>eig<span class="op">*</span><span class="dv">100</span><span class="op">/</span><span class="kw">sum</span>(eig)</a>
<a class="sourceLine" id="cb16-6" title="6"></a>
<a class="sourceLine" id="cb16-7" title="7"><span class="co"># The cumulative variance of each component (should sum to 1)</span></a>
<a class="sourceLine" id="cb16-8" title="8">cumvar <-<span class="st"> </span><span class="kw">cumsum</span>(variance)</a>
<a class="sourceLine" id="cb16-9" title="9"></a>
<a class="sourceLine" id="cb16-10" title="10"><span class="co"># Combine all the data into one data.frame</span></a>
<a class="sourceLine" id="cb16-11" title="11">eig.pca <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">eig =</span> eig, <span class="dt">variance =</span> variance,<span class="dt">cumvariance =</span> cumvar)</a></code></pre></div>
<p>As with most things in <code>R</code> there are always more than one way to do things. This same information can be extract using the <code>summary(dat.xtab2.pca)$cont</code>.</p>
<p>What does the component eigenvalue and percent variance mean…and what does it tell us. This information helps tell us how much variance is explained by the components. It also helps identify which components should be used moving forward.</p>
<p>Generally there are two general rules:</p>
<ol style="list-style-type: decimal">
<li>Pick components with eignvalues of at least 1.
<ul>
<li>This is called the Kaiser rule. A variation of this method has been created where the confidence intervals of each eigenvalue is calculated and only factors which have the entire confidence interval great than 1.0 is retained <span class="citation">(Beran and Srivastava <a href="#ref-beran_bootstrap_1985" role="doc-biblioref">1985</a>, <a href="#ref-beran_correction:_1987" role="doc-biblioref">1987</a>; Larsen and Warne <a href="#ref-larsen_estimating_2010" role="doc-biblioref">2010</a>)</span>. There is an <code>R</code> package that can calculate eignvalue confidence intervals through bootstrapping, I’m not going to cover this in this post but below is an example if you wanted to explore it for yourself.</li>
</ul></li>
</ol>
<div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb17-1" title="1"><span class="kw">library</span>(eigenprcomp)</a>
<a class="sourceLine" id="cb17-2" title="2"></a>
<a class="sourceLine" id="cb17-3" title="3"><span class="kw">boot_pr_comp</span>(<span class="kw">as.matrix</span>(dat.xtab2[,vars]))</a></code></pre></div>
<ol start="2" style="list-style-type: decimal">
<li>The selected components should be able to describe at least 80% of the variance.</li>
</ol>
<p>If you look at <code>eig.pca</code> you’ll see that based on these criteria component 1, 2 and 3 are the components to focus on as they are enough to describe the data. While looking at the raw numbers are good, nice visualizations are a bonus. A scree plot displays these data and shows how much variation each component captures from the data.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/figure-html/unnamed-chunk-15-1.png" alt="Scree plot of eigenvalues for each prinicipal component of `dat.xtab2.pca` with the Kaiser threshold identified." />
<p class="caption">
Scree plot of eigenvalues for each prinicipal component of <code>dat.xtab2.pca</code> with the Kaiser threshold identified.
</p>
</div>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/figure-html/unnamed-chunk-16-1.png" alt="Scree plot of the variance and cumulative variance for each priniciple component from `dat.xtab2.pca`." />
<p class="caption">
Scree plot of the variance and cumulative variance for each priniciple component from <code>dat.xtab2.pca</code>.
</p>
</div>
<p>Now that we know which components are important, lets put together our biplot and extract components (if needed). To extract out components and specific loadings we can use the <code>scores()</code> function in the <code>vegan</code> package. It is a generic function to extract scores from <code>vegan</code> oridination objects such as RDA, CCA, etc. This function also seems to work with <code>prcomp</code> and <code>princomp</code> PCA functions in <code>stats</code> package.</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb18-1" title="1">scrs=<span class="kw">scores</span>(dat.xtab2.pca,<span class="dt">display=</span><span class="kw">c</span>(<span class="st">"sites"</span>,<span class="st">"species"</span>),<span class="dt">choices=</span><span class="kw">c</span>(<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>));</a></code></pre></div>
<p><code>scrs</code> is a list of two item, species and sites. Species corresponds to the columns of the data and sites correspond to the rows. Use <code>choices</code> to extract the components you want, in this case we want the first three components. Now we can plot the scores.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/figure-html/unnamed-chunk-18-1.png" alt="PCA biplot of two component comparisons from the `data.xtab2.pca` analysis." />
<p class="caption">
PCA biplot of two component comparisons from the <code>data.xtab2.pca</code> analysis.
</p>
</div>
<p>Typically when you see a PCA biplot, you also see arrows of each variable. This is commonly called loadings and can interpreted as:</p>
<ul>
<li><p>When two vectors are close, forming a small angle, the variables are typically positively correlated.</p></li>
<li><p>If two vectors are at an angle 90<span class="math inline">\(^\circ\)</span> they are typically not correlated.</p></li>
<li><p>If two vectors are at a large angle say in the vicinity of 180<span class="math inline">\(^\circ\)</span> they are typically negatively correlated.</p></li>
</ul>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/figure-html/unnamed-chunk-19-1.png" alt="PCA biplot of two component comparisons from the `data.xtab2.pca` analysis with rescaled loadings." />
<p class="caption">
PCA biplot of two component comparisons from the <code>data.xtab2.pca</code> analysis with rescaled loadings.
</p>
</div>
<p>You can take this one even further with by showing how each lake falls in the ordination space by joining the <code>sites</code> to the original data frame. This is also how you use the derived components for further analysis.</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb19-1" title="1">dat.xtab2=<span class="kw">cbind</span>(dat.xtab2,scrs<span class="op">$</span>sites)</a>
<a class="sourceLine" id="cb19-2" title="2"></a>
<a class="sourceLine" id="cb19-3" title="3"><span class="kw">head</span>(dat.xtab2)</a></code></pre></div>
<pre><code>## Station.ID LAKE Date.EST Alk Cl Chla DO pH SRP
## 1 A03 East Tohopekaliga 2005-05-17 17 19.7 4.00 7.9 6.1 0.0015
## 2 A03 East Tohopekaliga 2005-06-21 22 15.4 4.70 6.9 6.4 0.0015
## 4 A03 East Tohopekaliga 2005-08-16 17 14.0 3.00 6.9 6.3 0.0015
## 6 A03 East Tohopekaliga 2005-09-20 17 16.3 0.65 7.3 6.7 0.0010
## 8 A03 East Tohopekaliga 2005-10-19 15 14.3 2.60 7.8 6.8 0.0010
## 9 A03 East Tohopekaliga 2005-11-15 13 15.8 3.70 8.6 6.7 0.0020
## TP TN DIN PC1 PC2 PC3
## 1 0.024 0.710 0.040 -0.3901117 -0.2240239 -0.5666993
## 2 0.024 0.680 0.030 -0.3912797 -0.2083258 -0.6284024
## 4 0.021 0.550 0.030 -0.4290627 -0.2486860 -0.6599207
## 6 0.018 0.537 0.017 -0.4045084 -0.2775129 -0.4566961
## 8 0.017 0.454 0.014 -0.4194518 -0.2718903 -0.3418373
## 9 0.010 0.437 0.017 -0.4232014 -0.2807803 -0.2434219</code></pre>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-12-10-PCA_files/figure-html/unnamed-chunk-21-1.png" alt="PCA biplot of two component comparisons from the `data.xtab2.pca` analysis with rescaled loadings and Lakes identified." />
<p class="caption">
PCA biplot of two component comparisons from the <code>data.xtab2.pca</code> analysis with rescaled loadings and Lakes identified.
</p>
</div>
<p>You can extract a lot of great information from these plots and the underlying component data but immediately we see how the different lakes are group (i.e. Lake Okeechobee is obviously different than the other lakes) and how differently the lakes are loaded with respect to the different variables. Generally this grouping makes sense especially for the lakes to the left of the plot (i.e. East Tohopekaliga, Tohopekaliga, Hatchineha and Kissimmee), these lakes are connected, similar geomorphology, managed in a similar fashion and generally have similar upstream characteristics with shared watersheds.</p>
<p>I hope this blog post has provided a better appreciation of component analysis in <code>R</code>. This is by no means a comprehensive workflow of component analysis and lots of factors need to be considered during this type of analysis but this only scratches the surface.</p>
<!--
some of the different background that motivated this post.
https://www.statisticssolutions.com/principal-component-analysis-pca/
https://statistics.laerd.com/spss-tutorials/principal-components-analysis-pca-using-spss-statistics.php
https://rpubs.com/jaelison/135029
https://medium.com/@bioturing/how-to-read-pca-biplots-and-scree-plots-186246aae063?
https://medium.com/@bioturing/principal-component-analysis-explained-simply-894e8f6f4bfb
https://ourcodingclub.github.io/2018/05/04/ordination.html
https://www.xlstat.com/en/solutions/features/redundancy-analysis-rda interesting explaination of RDA
-->
</div>
<div id="references" class="section level2 unnumbered">
<h2>References</h2>
<div id="refs" class="references">
<div id="ref-beran_bootstrap_1985">
<p>Beran, Rudolf, and Muni S. Srivastava. 1985. “Bootstrap Tests and Confidence Regions for Functions of a Covariance Matrix.” <em>The Annals of Statistics</em> 13 (1): 95–115. <a href="https://doi.org/10.1214/aos/1176346579">https://doi.org/10.1214/aos/1176346579</a>.</p>
</div>
<div id="ref-beran_correction:_1987">
<p>———. 1987. “Correction: Bootstrap Tests and Confidence Regions for Functions of a Covariance Matrix.” <em>The Annals of Statistics</em> 15 (1): 470–71. <a href="https://doi.org/10.1214/aos/1176350284">https://doi.org/10.1214/aos/1176350284</a>.</p>
</div>
<div id="ref-gotelli_primer_2004">
<p>Gotelli, Nicholas J., and Aaron M. Ellison. 2004. <em>A Primer of Ecological Statistics</em>. Sinauer Associates Publishers.</p>
</div>
<div id="ref-larsen_estimating_2010">
<p>Larsen, Ross, and Russell T. Warne. 2010. “Estimating Confidence Intervals for Eigenvalues in Exploratory Factor Analysis.” <em>Behavior Research Methods</em> 42 (3): 871–76. <a href="https://doi.org/10.3758/BRM.42.3.871">https://doi.org/10.3758/BRM.42.3.871</a>.</p>
</div>
<div id="ref-mosteller_exploratory_1977">
<p>Tukey, John Wilder. 1977. “Exploratory Data Analysis.” In <em>Statistics and Public Policy</em>, edited by Frederick Mosteller, 1st ed. Addison-Wesley Series in Behavioral Science. Quantitative Methods. Addison-Wesley.</p>
</div>
</div>
</div>
</section>July 9, 2019 Eco DataViz2019-07-09T00:00:00+00:002019-07-09T00:00:00+00:00https://swampthingecology.org/blog/july-9-2019-eco-dataviz<section class="main-content">
<p><strong>Keywords:</strong> dataviz, R, Sea Ice</p>
<p>Following the progression of my data viz journey, I decided to tackle some Arctic sea-ice data after checking out <a href="https://twitter.com/ZLabe" target="_blank">Zack Labe’s</a> Arctic Ice <a href="https://sites.uci.edu/zlabe/arctic-sea-ice-figures/" target="_blank">figures</a>. The data this week is modeled sea-ice volume and thickness from the <a href="http://psc.apl.uw.edu" target="_blank">Polar Science Center</a> Pan-Arctic Ice Ocean Modeling and Assimilation System (<a href="http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/" target="_blank">PIOMAS</a>). Sea ice volume is an important climate indicator. It depends on both ice thickness and extent and therefore more directly tied to climate forcing than extent alone.</p>
<p>Each data viz endeavor I try to learn something new or explore existing technique. Dates in <code>R</code> can be stressful to say the least. For anyone who has worked with time-series data would agree. Dates can be formatted as a date format using <code>as.Date()</code>, <code>format()</code>, <code>as.POSIXct()</code> or <code>as.POSIXlt()</code>…most of my time in <code>R</code> is spent formatting dates. Here is a useful <a href="https://www.stat.berkeley.edu/~s133/dates.html" target="_blank">page</a> on working with dates in <code>R</code>. The PIOMAS data has three variables…Year, Day of Year (1 - 365) and Thickness (or Volume). I downloaded the data from <a href="http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/data/" target="_blank">webpage</a> and unzipped the gzipped tar file using a third party to extract the data, but this can also be done in <code>R</code>. The two data sets volume and thickness data s ASCII files.</p>
<p>Lets load our libraries/packages.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" title="1"><span class="co">#Libraries</span></a>
<a class="sourceLine" id="cb1-2" title="2"><span class="co">#devtools::install_github("SwampThingPaul/AnalystHelper")</span></a>
<a class="sourceLine" id="cb1-3" title="3"><span class="kw">library</span>(AnalystHelper);</a>
<a class="sourceLine" id="cb1-4" title="4"><span class="kw">library</span>(plyr)</a>
<a class="sourceLine" id="cb1-5" title="5"><span class="kw">library</span>(reshape)</a></code></pre></div>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb2-1" title="1">thick.dat=<span class="kw">read.table</span>(<span class="st">"PIOMAS.thick.daily.1979.2019.Current.v2.1.dat"</span>,</a>
<a class="sourceLine" id="cb2-2" title="2"><span class="dt">header=</span>F,<span class="dt">skip=</span><span class="dv">1</span>,<span class="dt">col.names=</span><span class="kw">c</span>(<span class="st">"Year"</span>,<span class="st">"Day"</span>,<span class="st">"Thickness_m"</span>))</a></code></pre></div>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb3-1" title="1"><span class="kw">head</span>(thick.dat,5L)</a></code></pre></div>
<pre><code>## Year Day Thickness_m
## 1 1979 1 1.951
## 2 1979 2 1.955
## 3 1979 3 1.962
## 4 1979 4 1.965
## 5 1979 5 1.973</code></pre>
<p>The sea-ice volume data is in the same format.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb5-1" title="1">vol.dat=<span class="kw">read.table</span>(<span class="st">"PIOMAS.vol.daily.1979.2019.Current.v2.1.dat"</span>,</a>
<a class="sourceLine" id="cb5-2" title="2"> <span class="dt">header=</span>F,<span class="dt">skip=</span><span class="dv">1</span>,<span class="dt">col.names=</span><span class="kw">c</span>(<span class="st">"Year"</span>,<span class="st">"Day"</span>,<span class="st">"Vol_km3"</span>))</a>
<a class="sourceLine" id="cb5-3" title="3">vol.dat<span class="op">$</span>Vol_km3=vol.dat<span class="op">$</span>Vol_km3<span class="op">*</span><span class="fl">1E+3</span>;<span class="co">#To convert data </span></a></code></pre></div>
<div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb6-1" title="1"><span class="kw">head</span>(vol.dat,5L)</a></code></pre></div>
<pre><code>## Year Day Vol_km3
## 1 1979 1 26405
## 2 1979 2 26496
## 3 1979 3 26582
## 4 1979 4 26672
## 5 1979 5 26770</code></pre>
<p>The sea-ice thickness data are expressed in meters, and volume in x10<sup>3</sup> km<sup>3</sup>. Understanding what the data represent and how they are derived is most of the job of a scientist especially in data visualization. Inherently all data has its limits.</p>
<p>Currently we have two different data files <code>vol.dat</code> and <code>thick.dat</code>, lets get them into one single <code>data.frame</code> and sort the data accordingly (just in case).</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb8-1" title="1">dat=<span class="kw">merge</span>(thick.dat,vol.dat,<span class="kw">c</span>(<span class="st">"Year"</span>,<span class="st">"Day"</span>))</a>
<a class="sourceLine" id="cb8-2" title="2">dat=dat[<span class="kw">order</span>(dat<span class="op">$</span>Year,dat<span class="op">$</span>Day),]</a></code></pre></div>
<p>Alright here come the fun part…dates in <code>R</code>. Remember the data is Year and Day of Year, which mean no month or day (i.e. Date). Essentially you have to back calculate day of the year to an actual date. Thankfully this is pretty easy. Check out <code>?strptime</code> and <code>?format</code>!!</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb9-1" title="1">dat<span class="op">$</span>month.day=<span class="kw">format</span>(<span class="kw">strptime</span>(dat<span class="op">$</span>Day,<span class="st">"%j"</span>),<span class="st">"%m-%d"</span>)</a></code></pre></div>
<p>This gets us Month-Day from day of the year. Now for some tricky. Lets actually make this a date by using paste and leveraging <code>date.fun()</code> from <code>AnalystHelper</code>.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb10-1" title="1">dat<span class="op">$</span>Date=<span class="kw">with</span>(dat,<span class="kw">date.fun</span>(<span class="kw">paste</span>(Year,month.day,<span class="dt">sep=</span><span class="st">"-"</span>),<span class="dt">tz=</span><span class="st">"GMT"</span>))</a></code></pre></div>
<p>Viola!! We have a <code>POSIXct</code> formatted field that has Year-Month-Day…in-case you wanted to check the sea-ice volume on your birthday, wedding anniversary, etc. …no one? Just me? …OK moving on!!</p>
<p>Some more tricky which comes in handy when aggregating data is to determine the month and year (for monthly summary statistics). Also we can determine what decade the data is from, it wasn’t used in this analysis but something interesting I discovered in my data musings.</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb11-1" title="1">dat<span class="op">$</span>month.yr=<span class="kw">with</span>(dat,<span class="kw">date.fun</span>(<span class="kw">paste</span>(Year,<span class="kw">format</span>(Date,<span class="st">"%m"</span>),<span class="dv">01</span>,<span class="dt">sep=</span><span class="st">"-"</span>),<span class="dt">tz=</span><span class="st">"GMT"</span>))</a>
<a class="sourceLine" id="cb11-2" title="2">dat<span class="op">$</span>decade=((dat<span class="op">$</span>Year)<span class="op">%/%</span><span class="dv">10</span>)<span class="op">*</span><span class="dv">10</span></a></code></pre></div>
<p>Now that we have the data put together lets start plotting.</p>
<p>Here we have just daily (modeled) sea-ice thickness data from PIOMAS.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-07-09-DataViz_files/figure-html/unnamed-chunk-11-1.png" alt="Pan Arctic Sea-Ice thickness from 1979 to present. Data source: Polar Science Center - ([PIOMAS](http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/))." />
<p class="caption">
Pan Arctic Sea-Ice thickness from 1979 to present. Data source: Polar Science Center - (<a href="http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/">PIOMAS</a>).
</p>
</div>
<p>Now we can estimate annual mean and some confidence interval around the mean…lets say 95%.</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb12-1" title="1"><span class="co">#Calculate annual mean, sd and N. Excluding 2019 (partial year)</span></a>
<a class="sourceLine" id="cb12-2" title="2">period.mean=<span class="kw">ddply</span>(<span class="kw">subset</span>(dat,Year<span class="op">!=</span><span class="dv">2019</span>),<span class="st">"Year"</span>,summarise,</a>
<a class="sourceLine" id="cb12-3" title="3"> <span class="dt">mean.val=</span><span class="kw">mean</span>(Thickness_m,<span class="dt">na.rm=</span>T),</a>
<a class="sourceLine" id="cb12-4" title="4"> <span class="dt">sd.val=</span><span class="kw">sd</span>(Thickness_m,<span class="dt">na.rm=</span>T),</a>
<a class="sourceLine" id="cb12-5" title="5"> <span class="dt">N.val=</span><span class="kw">N</span>(Thickness_m))</a>
<a class="sourceLine" id="cb12-6" title="6"><span class="co">#Degrees of freedom</span></a>
<a class="sourceLine" id="cb12-7" title="7">period.mean<span class="op">$</span>Df=period.mean<span class="op">$</span>N.val<span class="dv">-1</span></a>
<a class="sourceLine" id="cb12-8" title="8"><span class="co">#Student-T statistic</span></a>
<a class="sourceLine" id="cb12-9" title="9">period.mean<span class="op">$</span>Tp=<span class="kw">abs</span>(<span class="kw">qt</span>(<span class="dv">1</span><span class="fl">-0.95</span>,period.mean<span class="op">$</span>Df))</a>
<a class="sourceLine" id="cb12-10" title="10"><span class="co">#Lower and Upper CI calculation</span></a>
<a class="sourceLine" id="cb12-11" title="11">period.mean<span class="op">$</span>LCI=<span class="kw">with</span>(period.mean,mean.val<span class="op">-</span>sd.val<span class="op">*</span>(Tp<span class="op">/</span><span class="kw">sqrt</span>(N.val)))</a>
<a class="sourceLine" id="cb12-12" title="12">period.mean<span class="op">$</span>UCI=<span class="kw">with</span>(period.mean,mean.val<span class="op">+</span>sd.val<span class="op">*</span>(Tp<span class="op">/</span><span class="kw">sqrt</span>(N.val)))</a></code></pre></div>
<p>Now lets add that to the plot with some additional trickery to plot annual mean <span class="math inline">\(\pm\)</span> 95% CI stating on Jan 1st of every year.</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb13-1" title="1"><span class="kw">with</span>(period.mean,<span class="kw">lines</span>(<span class="kw">date.fun</span>(<span class="kw">paste</span>(Year,<span class="st">"01-01"</span>,<span class="dt">sep=</span><span class="st">"-"</span>),<span class="dt">tz=</span><span class="st">"GMT"</span>),mean.val,<span class="dt">lty=</span><span class="dv">1</span>,<span class="dt">col=</span><span class="st">"red"</span>))</a>
<a class="sourceLine" id="cb13-2" title="2"><span class="kw">with</span>(period.mean,<span class="kw">lines</span>(<span class="kw">date.fun</span>(<span class="kw">paste</span>(Year,<span class="st">"01-01"</span>,<span class="dt">sep=</span><span class="st">"-"</span>),<span class="dt">tz=</span><span class="st">"GMT"</span>),LCI,<span class="dt">lty=</span><span class="dv">2</span>,<span class="dt">col=</span><span class="st">"red"</span>))</a>
<a class="sourceLine" id="cb13-3" title="3"><span class="kw">with</span>(period.mean,<span class="kw">lines</span>(<span class="kw">date.fun</span>(<span class="kw">paste</span>(Year,<span class="st">"01-01"</span>,<span class="dt">sep=</span><span class="st">"-"</span>),<span class="dt">tz=</span><span class="st">"GMT"</span>),UCI,<span class="dt">lty=</span><span class="dv">2</span>,<span class="dt">col=</span><span class="st">"red"</span>))</a></code></pre></div>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-07-09-DataViz_files/figure-html/unnamed-chunk-14-1.png" alt="Pan Arctic Sea-Ice thickness from 1979 to present with annual mean and 95% confidence interval. Data source: Polar Science Center - ([PIOMAS](http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/))." />
<p class="caption">
Pan Arctic Sea-Ice thickness from 1979 to present with annual mean and 95% confidence interval. Data source: Polar Science Center - (<a href="http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/">PIOMAS</a>).
</p>
</div>
<p>What does sea-ice volume look like?</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-07-09-DataViz_files/figure-html/unnamed-chunk-15-1.png" alt="Pan Arctic Sea-Ice volume from 1979 to present with annual mean and 95% confidence interval. Data source: Polar Science Center - ([PIOMAS](http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/))." />
<p class="caption">
Pan Arctic Sea-Ice volume from 1979 to present with annual mean and 95% confidence interval. Data source: Polar Science Center - (<a href="http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/">PIOMAS</a>).
</p>
</div>
<p>Some interesting and alarming trends in both thickness and volume for sure! There is an obvious seasonal trend in the data…one way to look at this is to look at the period of record daily change.</p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-07-09-DataViz_files/figure-html/unnamed-chunk-16-1.png" alt="Period of record mean (1979 - 2018) daily mean and 95% confidence interval sea-ice volume and thickness. Data source: Polar Science Center - ([PIOMAS](http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/))." />
<p class="caption">
Period of record mean (1979 - 2018) daily mean and 95% confidence interval sea-ice volume and thickness. Data source: Polar Science Center - (<a href="http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/">PIOMAS</a>).
</p>
</div>
<p><br /></p>
<p>Now how does the thickness versus volume relationship look? Since the volume of data is so much we can do some interesting color coding for the different years. Here I use a color ramp <code>colorRampPalette(c("dodgerblue1","indianred1"))</code> with each year getting a color along the color ramp.</p>
<p>Here is how I set up the color ramp.</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb14-1" title="1">N.yrs=<span class="kw">length</span>(<span class="kw">unique</span>(dat<span class="op">$</span>Year))</a>
<a class="sourceLine" id="cb14-2" title="2">cols=<span class="kw">colorRampPalette</span>(<span class="kw">c</span>(<span class="st">"dodgerblue1"</span>,<span class="st">"indianred1"</span>))(N.yrs)</a></code></pre></div>
<p>In the plot I use a loop to plot each year with a different color.</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb15-1" title="1"><span class="kw">plot</span>(...)</a>
<a class="sourceLine" id="cb15-2" title="2"></a>
<a class="sourceLine" id="cb15-3" title="3"><span class="cf">for</span>(i <span class="cf">in</span> <span class="dv">1</span><span class="op">:</span>N.yrs){</a>
<a class="sourceLine" id="cb15-4" title="4"> <span class="kw">with</span>(<span class="kw">subset</span>(dat,Year<span class="op">==</span>yrs.val[i]),</a>
<a class="sourceLine" id="cb15-5" title="5"> <span class="kw">points</span>(Vol_km3,Thickness_m,<span class="dt">pch=</span><span class="dv">21</span>,</a>
<a class="sourceLine" id="cb15-6" title="6"> <span class="dt">bg=</span><span class="kw">adjustcolor</span>(cols[i],<span class="fl">0.2</span>),</a>
<a class="sourceLine" id="cb15-7" title="7"> <span class="dt">col=</span><span class="kw">adjustcolor</span>(cols[i],<span class="fl">0.4</span>),</a>
<a class="sourceLine" id="cb15-8" title="8"> <span class="dt">lwd=</span><span class="fl">0.1</span>,<span class="dt">cex=</span><span class="fl">1.25</span>))</a>
<a class="sourceLine" id="cb15-9" title="9">}</a></code></pre></div>
<p>As is with most data viz, especially in base <code>R</code> is some degree of tricking and layering. To build the color ramp legend I used the following (I adapted a version of <a href="https://stackoverflow.com/questions/13355176/gradient-legend-in-base/13355440#13355440" target="_blank">this</a>.).</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb16-1" title="1"><span class="co"># A raster of the color ramp</span></a>
<a class="sourceLine" id="cb16-2" title="2">legend_image=<span class="kw">as.raster</span>(<span class="kw">matrix</span>(cols,<span class="dt">ncol=</span><span class="dv">1</span>))</a>
<a class="sourceLine" id="cb16-3" title="3"><span class="co"># Empty plot</span></a>
<a class="sourceLine" id="cb16-4" title="4"><span class="kw">plot</span>(<span class="kw">c</span>(<span class="dv">0</span>,<span class="dv">1</span>),<span class="kw">c</span>(<span class="dv">0</span>,<span class="dv">1</span>),<span class="dt">type =</span> <span class="st">'n'</span>, <span class="dt">axes =</span> F,<span class="dt">xlab =</span> <span class="st">''</span>, <span class="dt">ylab =</span> <span class="st">''</span>)</a>
<a class="sourceLine" id="cb16-5" title="5"><span class="co"># Gradient labels</span></a>
<a class="sourceLine" id="cb16-6" title="6"><span class="kw">text</span>(<span class="dt">x=</span><span class="fl">0.6</span>, <span class="dt">y =</span> <span class="kw">c</span>(<span class="fl">0.5</span>,<span class="fl">0.8</span>), <span class="dt">labels =</span> <span class="kw">c</span>(<span class="dv">2019</span>,<span class="dv">1979</span>),<span class="dt">cex=</span><span class="fl">0.8</span>,<span class="dt">xpd=</span><span class="ot">NA</span>,<span class="dt">adj=</span><span class="dv">0</span>)</a>
<a class="sourceLine" id="cb16-7" title="7"><span class="co"># Put the color ramp on the legend</span></a>
<a class="sourceLine" id="cb16-8" title="8"><span class="kw">rasterImage</span>(legend_image, <span class="fl">0.25</span>, <span class="fl">0.5</span>, <span class="fl">0.5</span>,<span class="fl">0.8</span>)</a>
<a class="sourceLine" id="cb16-9" title="9"><span class="co"># Label to legend</span></a>
<a class="sourceLine" id="cb16-10" title="10"><span class="kw">text</span>(<span class="fl">0.25</span><span class="op">+</span>(<span class="fl">0.5-0.25</span>)<span class="op">/</span><span class="dv">2</span>,<span class="fl">0.85</span>,<span class="st">"Year"</span>,<span class="dt">xpd=</span><span class="ot">NA</span>)</a></code></pre></div>
<p><br /></p>
<div class="figure" style="text-align: center">
<img src="https://swampthingecology.org/blog/knitr_files/2019-07-09-DataViz_files/figure-html/unnamed-chunk-20-1.png" alt="Sea-ice thickness versus volume for the 41 year period. Minimum ice thickness and volume identified for 1980, 1990, 2000 and 2010. Data source: Polar Science Center - ([PIOMAS](http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/))." />
<p class="caption">
Sea-ice thickness versus volume for the 41 year period. Minimum ice thickness and volume identified for 1980, 1990, 2000 and 2010. Data source: Polar Science Center - (<a href="http://psc.apl.uw.edu/research/projects/arctic-sea-ice-volume-anomaly/">PIOMAS</a>).
</p>
</div>
<p>Hope you found this data visualization exercise interesting and thought provoking. Happy data trails!</p>
<hr />
</section>Keywords: dataviz, R, Sea Ice Following the progression of my data viz journey, I decided to tackle some Arctic sea-ice data after checking out Zack Labe’s Arctic Ice figures. The data this week is modeled sea-ice volume and thickness from the Polar Science Center Pan-Arctic Ice Ocean Modeling and Assimilation System (PIOMAS). Sea ice volume is an important climate indicator. It depends on both ice thickness and extent and therefore more directly tied to climate forcing than extent alone. Each data viz endeavor I try to learn something new or explore existing technique. Dates in R can be stressful to say the least. For anyone who has worked with time-series data would agree. Dates can be formatted as a date format using as.Date(), format(), as.POSIXct() or as.POSIXlt()…most of my time in R is spent formatting dates. Here is a useful page on working with dates in R. The PIOMAS data has three variables…Year, Day of Year (1 - 365) and Thickness (or Volume). I downloaded the data from webpage and unzipped the gzipped tar file using a third party to extract the data, but this can also be done in R. The two data sets volume and thickness data s ASCII files. Lets load our libraries/packages. #Libraries #devtools::install_github("SwampThingPaul/AnalystHelper") library(AnalystHelper); library(plyr) library(reshape) thick.dat=read.table("PIOMAS.thick.daily.1979.2019.Current.v2.1.dat", header=F,skip=1,col.names=c("Year","Day","Thickness_m")) head(thick.dat,5L) ## Year Day Thickness_m ## 1 1979 1 1.951 ## 2 1979 2 1.955 ## 3 1979 3 1.962 ## 4 1979 4 1.965 ## 5 1979 5 1.973 The sea-ice volume data is in the same format. vol.dat=read.table("PIOMAS.vol.daily.1979.2019.Current.v2.1.dat", header=F,skip=1,col.names=c("Year","Day","Vol_km3")) vol.dat$Vol_km3=vol.dat$Vol_km3*1E+3;#To convert data head(vol.dat,5L) ## Year Day Vol_km3 ## 1 1979 1 26405 ## 2 1979 2 26496 ## 3 1979 3 26582 ## 4 1979 4 26672 ## 5 1979 5 26770 The sea-ice thickness data are expressed in meters, and volume in x103 km3. Understanding what the data represent and how they are derived is most of the job of a scientist especially in data visualization. Inherently all data has its limits. Currently we have two different data files vol.dat and thick.dat, lets get them into one single data.frame and sort the data accordingly (just in case). dat=merge(thick.dat,vol.dat,c("Year","Day")) dat=dat[order(dat$Year,dat$Day),] Alright here come the fun part…dates in R. Remember the data is Year and Day of Year, which mean no month or day (i.e. Date). Essentially you have to back calculate day of the year to an actual date. Thankfully this is pretty easy. Check out ?strptime and ?format!! dat$month.day=format(strptime(dat$Day,"%j"),"%m-%d") This gets us Month-Day from day of the year. Now for some tricky. Lets actually make this a date by using paste and leveraging date.fun() from AnalystHelper. dat$Date=with(dat,date.fun(paste(Year,month.day,sep="-"),tz="GMT")) Viola!! We have a POSIXct formatted field that has Year-Month-Day…in-case you wanted to check the sea-ice volume on your birthday, wedding anniversary, etc. …no one? Just me? …OK moving on!! Some more tricky which comes in handy when aggregating data is to determine the month and year (for monthly summary statistics). Also we can determine what decade the data is from, it wasn’t used in this analysis but something interesting I discovered in my data musings. dat$month.yr=with(dat,date.fun(paste(Year,format(Date,"%m"),01,sep="-"),tz="GMT")) dat$decade=((dat$Year)%/%10)*10 Now that we have the data put together lets start plotting. Here we have just daily (modeled) sea-ice thickness data from PIOMAS. Pan Arctic Sea-Ice thickness from 1979 to present. Data source: Polar Science Center - (PIOMAS). Now we can estimate annual mean and some confidence interval around the mean…lets say 95%. #Calculate annual mean, sd and N. Excluding 2019 (partial year) period.mean=ddply(subset(dat,Year!=2019),"Year",summarise, mean.val=mean(Thickness_m,na.rm=T), sd.val=sd(Thickness_m,na.rm=T), N.val=N(Thickness_m)) #Degrees of freedom period.mean$Df=period.mean$N.val-1 #Student-T statistic period.mean$Tp=abs(qt(1-0.95,period.mean$Df)) #Lower and Upper CI calculation period.mean$LCI=with(period.mean,mean.val-sd.val*(Tp/sqrt(N.val))) period.mean$UCI=with(period.mean,mean.val+sd.val*(Tp/sqrt(N.val))) Now lets add that to the plot with some additional trickery to plot annual mean \(\pm\) 95% CI stating on Jan 1st of every year. with(period.mean,lines(date.fun(paste(Year,"01-01",sep="-"),tz="GMT"),mean.val,lty=1,col="red")) with(period.mean,lines(date.fun(paste(Year,"01-01",sep="-"),tz="GMT"),LCI,lty=2,col="red")) with(period.mean,lines(date.fun(paste(Year,"01-01",sep="-"),tz="GMT"),UCI,lty=2,col="red")) Pan Arctic Sea-Ice thickness from 1979 to present with annual mean and 95% confidence interval. Data source: Polar Science Center - (PIOMAS). What does sea-ice volume look like? Pan Arctic Sea-Ice volume from 1979 to present with annual mean and 95% confidence interval. Data source: Polar Science Center - (PIOMAS). Some interesting and alarming trends in both thickness and volume for sure! There is an obvious seasonal trend in the data…one way to look at this is to look at the period of record daily change. Period of record mean (1979 - 2018) daily mean and 95% confidence interval sea-ice volume and thickness. Data source: Polar Science Center - (PIOMAS). Now how does the thickness versus volume relationship look? Since the volume of data is so much we can do some interesting color coding for the different years. Here I use a color ramp colorRampPalette(c("dodgerblue1","indianred1")) with each year getting a color along the color ramp. Here is how I set up the color ramp. N.yrs=length(unique(dat$Year)) cols=colorRampPalette(c("dodgerblue1","indianred1"))(N.yrs) In the plot I use a loop to plot each year with a different color. plot(...) for(i in 1:N.yrs){ with(subset(dat,Year==yrs.val[i]), points(Vol_km3,Thickness_m,pch=21, bg=adjustcolor(cols[i],0.2), col=adjustcolor(cols[i],0.4), lwd=0.1,cex=1.25)) } As is with most data viz, especially in base R is some degree of tricking and layering. To build the color ramp legend I used the following (I adapted a version of this.). # A raster of the color ramp legend_image=as.raster(matrix(cols,ncol=1)) # Empty plot plot(c(0,1),c(0,1),type = 'n', axes = F,xlab = '', ylab = '') # Gradient labels text(x=0.6, y = c(0.5,0.8), labels = c(2019,1979),cex=0.8,xpd=NA,adj=0) # Put the color ramp on the legend rasterImage(legend_image, 0.25, 0.5, 0.5,0.8) # Label to legend text(0.25+(0.5-0.25)/2,0.85,"Year",xpd=NA) Sea-ice thickness versus volume for the 41 year period. Minimum ice thickness and volume identified for 1980, 1990, 2000 and 2010. Data source: Polar Science Center - (PIOMAS). Hope you found this data visualization exercise interesting and thought provoking. Happy data trails!