This vignette showcases the ability to perform principal component analysis (PCA, also known as empirical orthogonal function (EOF) analysis. Data are from the Arctic 2k compilation, which we load here:
FD <- lipdR::readLipd("http://lipdverse.org/geoChronR-examples/arc2k/Arctic2k.zip")
## [1] "Loading 19 datasets from /var/folders/py/lcjn3y352g1106vf1rqk521r0000gn/T//Rtmpg7iwEF/lpdDownload..."
## [1] "reading: Arc-Agassiz.Vinther.2008.lpd"
## [1] "reading: Arc-Austfonna.Isaksson.2005.lpd"
## [1] "reading: Arc-CampCentury.Fisher.1969.lpd"
## [1] "reading: Arc-Crete.Vinther.2010.lpd"
## [1] "reading: Arc-DevonIceCap.Fisher.1983.lpd"
## [1] "reading: Arc-Dye.Vinther.2010.lpd"
## [1] "reading: Arc-GISP2.Grootes.1997.lpd"
## [1] "reading: Arc-GRIP.Vinther.2010.lpd"
## [1] "reading: Arc-Hvtrvatn.Larsen.2011.lpd"
## [1] "reading: Arc-LakeC2.Lamoureux.1996.lpd"
## [1] "reading: Arc-LakeDonardBaffinIsland.Moore.2001.lpd"
## [1] "reading: Arc-LakeLehmilampi.Haltia-Hovi.2007.lpd"
## [1] "reading: Arc-LakeNataujrvi.Ojala.2005.lpd"
## [1] "reading: Arc-LowerMurrayLake.Cook.2008.lpd"
## [1] "reading: Arc-NGRIP1.Vinther.2006.lpd"
## [1] "reading: Arc-NGTB16.Schwager.1998.lpd"
## [1] "reading: Arc-NGTB18.Schwager.1998.lpd"
## [1] "reading: Arc-NGTB21.Schwager.1998.lpd"
## [1] "reading: Arc-Renland.Vinther.2008.lpd"
##
##
## [1] "Successfully loaded 19 datasets, with 0 failure(s)."
First, let’s take a quick look at where these records are located.
geoChronR’s mapLipd
function can create quick maps:
mapLipd(FD,map.type = "line",projection = "stereo",f = 0.1)
More map projections are available too. A list is available here:
?mapproject
Now we need to “map” (I know, a different kind of mapping) the age ensembles to paleo for all of these datasets. We’ll use purrr::map for this, but you could also do it with sapply(). In this case we’re going to specify that all of the age ensembles are named “ageEnsemble”, and that they don’t have a depth variable because they’re layer counted.
FD2 = purrr::map(FD,
mapAgeEnsembleToPaleoData,
strict.search = TRUE,
age.var = "ageEnsemble",
paleo.age.var = "year",
chron.depth.var = NULL,
paleo.depth.var = NULL)
Now extract all the “timeseries” into at “TS object” that will facilitate working with multiple records.
TS <- extractTs(FD2)
and filter the TS object to only include variables that have been
interpreted as temperature. Here we’ll use lipdR::filterTs
to filter the TS object.
TS.filtered <- filterTs(TS,"interpretation1_variable == T")
OK, let’s make a quick plot stack to see what we’re dealing with.
The lipdR::tidyTs
function will convert a TS into a
long, “tidy” data.frame, where each observation has a row in a
data.frame. This can be verbose, but is also useful for data analysis in
the tidyverse framework.
tidyDf <- lipdR::tidyTs(TS.filtered,age.var = "year")
A tidy data.frame is also the input for the
plotTimeseriesStack
function in geoChronR.
plotTimeseriesStack(tidyDf,
color.var = "paleoData_variableName",
color.ramp = c("DarkBlue","Orange","Black","Dark Green"),
line.size = .1,
fill.alpha = .05,
lab.size = 2,
lab.space = 3)
Now bin all the data in the TS from 1400 to 2000, an interval of pretty good data coverage, into 5 year bins.
We’re now ready to calculate the PCA!
Calculate PCA on each ensemble member:
pcout <- pcaEns(binned.TS)
That was easy (because of all the work we did beforehand). But before we look at the results let’s take a look at a scree plot to get a sense of how many significant components we should expect.
plotScreeEns(pcout)
It looks like the first two components, shown in black with gray uncertainty shading, stand out above the null model (in red), but the third and beyond look marginal to insignficant. Let’s focus on the first two components.
Now let’s visualize the results. The plotPcaEns
function
will create multiple diagnostic figures of the results, and stitch them
together.
plotPCA <- plotPcaEns(pcout,TS = TS.filtered,map.type = "line",projection = "stereo",bound.circ = T,restrict.map.range = T,f=.1,legend.position = c(0.5,.6),which.pcs = 1:2,which.leg = 2)
Nice! A summary plot that combines the major features is produced, but all of the components, are included in the “plotPCA” list that was exported.
For comparison with other datasets it can be useful to export
quantile timeseries shown in the figures.
plotTimeseriesEnsRibbons()
can optionally be used to export
the data rather than plotting them. The following will export the PC1
timeseries:
quantileData <- plotTimeseriesEnsRibbons(X = pcout$age,Y = pcout$PCs[,1,],export.quantiles = TRUE)
print(quantileData)
## # A tibble: 200 × 6
## ages `0.025` `0.25` `0.5` `0.75` `0.975`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1408. -0.0517 0.0477 0.103 0.155 0.259
## 2 1410. 0.0241 0.0934 0.132 0.167 0.241
## 3 1413. 0.0493 0.123 0.163 0.207 0.274
## 4 1416. 0.100 0.178 0.208 0.233 0.294
## 5 1419. 0.0828 0.148 0.176 0.201 0.248
## 6 1422. -0.0246 0.0531 0.0946 0.132 0.212
## 7 1425. -0.0394 0.0170 0.0455 0.0729 0.126
## 8 1428. -0.0937 -0.0268 0.00969 0.0442 0.117
## 9 1431. -0.0710 -0.00951 0.0267 0.0610 0.135
## 10 1434. -0.0529 0.00741 0.0427 0.0832 0.159
## # ℹ 190 more rows
Let’s repeat much of this analysis, but this time we’re only going to analyze the data, keep them in their native values, and use a covariance matrix in the PCA analysis.
First, it looks like let’s look at all the names in the TS
var.names <- pullTsVariable(TS, "variableName")
Oops - looks like we didn’t use quite the correct name. Next time use:
var.names <- pullTsVariable(TS, "paleoData_variableName")
and take a look at the unique variableNames in the TS
unique(var.names)
## [1] "d18O" "year"
## [3] "ageEnsemble" "ageMedian"
## [5] "thickness" "X_radiograph_dark_layer"
## [7] "massacum"
OK. Let’s filter the timeseries again, this time pulling all the data.
d18OTS = filterTs(TS,"paleoData_variableName == d18O")
And we’ll tidy up the data for a plotStack
tidyd18O <- tidyTs(d18OTS,age.var = "year")
Before plotting, let’s do some tidyverse gymnastics to reorder the data by the length of the observations
#arrange the tidy dataframe by record length
tidyd18O <- tidyd18O %>% #use the magrittr pipe for clarity
group_by(paleoData_TSid) %>% #group the data by column
mutate(duration = max(year) - min(year)) %>% #create a new column for the duration
arrange(duration) # and arrange the data by duration
Again we’ll use plotTimeseriesStack
to show all the
data, now nicely arranged.
plotTimeseriesStack(tidyd18O,
color.var = "paleoData_variableName", # Color the data by the variable name (all the same in the case)
color.ramp = "DarkBlue", #colors to use
line.size = .1,
fill.alpha = .05,
lab.size = 2,
lab.space = 2,
lab.buff = 0.03)
Well that sure is nice and tidy. Again, it looks like binning from 1400-2000 will give us good data coverage.
And calculate the ensemble PCA, this time using a covariance matrix. By using a covariance, we’ll allow records that have larger variability in to influence the PCA more, and those that have little variability will have little impact. This may or may not be a good idea, but it’s an important option to consider when possible.
pcout2 <- pcaEns(binned.TS2,pca.type = "cov")
Once again, let’s take a look at the scree plot:
plotScreeEns(pcout2)
Once again, the first two components look good, but the third also looks to be likely above the 95% significance level, so let's include the third PC in our plot this time as well.
``` r
plotPCA2 <- plotPcaEns(pcout2,
TS = d18OTS,
which.pcs = 1:3,
map.type = "line",
projection = "stereo",
bound.circ = T,
restrict.map.range = T,
f=.2)
Using only the data and a covariance matrix, we get somewhat different results. The first PC timeseries looks similar to the first one, but the spatial pattern is somewhat difference, with stronger loadings in the northeast. The second PC looks to be a new pattern, although the third resembles PC2 from the first analysis.