LiPD Utilities - R

DOI
R
R Studio
license

Welcome to LiPD Utilities in R . This guide will provide everything you need to get up and running with the LiPD Utilities in R, and show you how to use the core functions in the LiPD package.

Table of contents



Requirements

R, the language, is availalble from CRAN. R Studio is an IDE that will work using R and make your workflow much easier. Software version numbers are listed at the top of this README file. If you do not have any datasets in LiPD format yet, feel free to experiment with the one below.

R

https://cran.r-project.org

R Studio

https://www.rstudio.com

LiPD file

ODP1098B13.lpd


Installation

Create a new project in R Studio and start with a fresh workspace:

Workspace

Install remotes in the console window:

install.packages("remotes")

Use remotes to install the LiPD Utilities package from github:

s
remotes::install_github("nickmckay/LiPD-Utilities", subdir = "R", quiet = TRUE)

Load the lipdR package:

library(lipdR)

And that’s it! You have successfully installed the LiPD utilities and are ready to start working.

Core functions

Notation:

This guide uses notation that may be new to you. In case you are unfamiliar with these terms, the list below provides an explanation for each. Please feel free to name your own variables as you move through the guide.

D

Represents multiple datasets read into a single variable. Each dataset is organized by its dataset name. D[["ODP1098B13"]][["paleoData"]]

L

Represents a single dataset. The dataset does not need to be organized by name. L[["paleoData"]]

ts

A time series. The ts notation is used both in variable names and time series related functions, like extractTs


readLipd(path = “”)

Purpose:
Reads LiPD files from a source path into the environment. Read in a single file, or a directory of multiple files.

Parameters:

path (optional)

The path to the locally stored file or directory that you would like to read into the workspace

Example: Provide a path to a file

L = readLipd("/Users/bobsmith/Downloads/filename.lpd")

Example: Provide a path to a directory

D = readLipd("/Users/bobsmith/Desktop")

Example: Browse for file or directory

D = readLipd()

Returns:

D

Multiple datasets D or one dataset L

Example 1: Browse for file

Call readLipd as shown below:

readlipd_browse_file_prompt

Leave the path empty in this example. A prompt will ask you to choose to read a single file or a directory. Choose s and read a single file:

readlipd_browse_file_dialog

A browse dialog opens and asks you to choose the LiPD file that you want. Here, I have selected the file and clicked “Open”:

readlipd_browse_file_done

The console shows the name of the current file being read. When the file is finished reading, the > indicator appears again and the process is finished. (shown on the left)

The LiPD file loads into the environment under variable L . The environment L variable allows you to preview some of the LiPD data with the dropdown arrow (shown on the right).

readlipd_browse_file_L

A quick look at L shows that the data is at the root of the variable, as expected.


Example 2: Browse for directory

NOTE:

Reading a directory is most commonly used for reading multiple files. I have added LiPD files to my source folder and will load them into variable D.

Call readLipd as shown below:

readlipd_browse_dir_prompt

Leave the path empty in this example. A prompt will ask you to choose to read a single file or a directory. Choose d and read a directory:

readlipd_browse_dir_dialog

NOTE:

Due to a bug in R, we are not able to use the module for choosing a directory with the GUI. It causes R Studio to crash and that’s not an experience we want to give you. The instructions below are a workaround that will provide the same result.

A browse dialog opens. Please choose any LiPD file within the directory that you want. For example, I want to load all the LiPD files in the quickstart directory, so I will choose the ODP1098B13.lpd. Choosing either of the other two LiPD files has the same outcome.

readlipd_browse_dir_done

The console shows that 3 files have been read, and processing is finished.

The LiPD files load into the environment under variable D. The D variable shows that it is a list of 3, which matches the number of LiPD files in the source directory. Success!

readlipd_browse_dir_D

A quick look at D shows that the datasets are sorted by dataset names, as expected. If you look one more level down, we can see the data.

readlipd_browse_dir_D2

REMEMBER

Since D contains multiple datasets, we organize the data by dataSetName. Since L only holds one dataset, we do not use this dataSetName layer, and instead link directly to the data.

Example 3: Provide a path

If you have the path to a specific file or directory that you would like to read, you can read in data in less steps. I’ll use a path to a file on my desktop.

readlipd_path_file

NOTE:

Relative paths do not work in R. If the file you want to read is located in your ‘current working directory’ (use getwd()) then you can load it directly using the filename.

readLipd("ODP1098B13.lpd")

If a file is not in your current working directory, then you must give an explicit path to the file.

readLipd("/Users/bobsmith/Desktop/ODP1098B13.lpd")
or
readLipd("~/Desktop/ODP1098B13.lpd")


View Lipd

Purpose:
Browsing LiPD data structure

Returns: None

This is not a LiPD function. This is an R Studio feature that allows you to browse nested data

After using readLipd() to load the LiPD file into the environment, you can browse through the data via the R Studio 'magnifying glass' icon.

viewLipd

NOTE:

Matlab and R have built-in features to view data structures. Python has a custom viewLipd() function to mimic these features. If your R Studio does not have this feature, make sure you are on version 1.1.453 or newer.


writeLipd(D, path = “”)

Purpose:
Writes LiPD data from the environment as a LiPD file.

Parameters:

D

Multiple datasets D or one dataset L


path (optional)

The directory path that you would like to write the LiPD file(s) to.

Provide a destination directory path:

writeLipd(D, "/Users/bobsmith/Desktop")

Or, omit the path to browse for a destination:

writeLipd(D)

Returns:

This function does not return data

Call writeLipd as shown below. Pass your LiPD data. In this case, I pass L to the function, which represents one LiPD dataset.
writelipd_file_call

writelipd_file_choose

A dialog opens and asks you to choose a directory. Choose a file within the directory that you to write to. (Reference Example 2 for readLipd for further explanation)

writelipd_file_done

The console window will show each data file as it is compressed into the LiPD file being written. The list should contain four .txt, at least one .csv file, and one .jsonld file.


extractTs(D)

Purpose:
Creates a time series from LiPD datasets.
What is a time series?

Parameters:

D

Multiple datasets D or one dataset L

Returns:

ts

A time series

Call extractTs as shown below:

extractts_call

The time series is created and placed in the ts variable. Click the arrow next to the ts variable in the environment to see the what the contents look like.

extractts_done


collapseTs(ts)

Purpose:
Collapse a time series back into LiPD datasets. This function is lossless and will return the data back to its original form. If you made and changes or edits to the time series, they will persist. (This is the opposite function of extractTs)
What is a time series?

Parameters:

ts

A time series

Returns:

D

Multiple datasets D or one dataset L

Call collapseTs as shown below:

collapsets_start

The goal of collapseTs is to recreate the same data (without losing anything) that you had before calling extractTs. This is most useful if you have edited the time series in some way.

collapsets_done

In this example, L represents your original dataset, ts represents the time series, and L2 represents the new dataset. Note how the number of elements, and the size of L and L2 are equal.

collapsets_expand

Expanding L2 in the environment shows that the data has returned to the LiPD dataset hierarchy as before.


queryTs(ts, expression)

Purpose:
Find all the objects in a time series that match your specified criteria. For example, if you only want data located at a certain latitude, or if you only want data of a certain archive type.

How are queryTs and filterTs different?

Parameters:

ts

A time series

expression

The criteria that you choose to find data. Only data that matches this expression will be returned.

Returns:

matches

The index numbers for all time series objects that match the expression.

For this example, I’ve read and created a time series using multiple datasets to emphasize the point of querying. All the steps are shown below:

filterts_prestart

I’ll use the expression paleoData_variableName == d18O to get the index numbers of all time series objects that represent d18O measurements.

Call queryTs as shown below:

queryts_call

A list of index numbers is returned as matches and shows you all the time series object index numbers that match the paleoData_variableName == d18O expression. The ts time series variable has 9 objects (denoted as elements in the R Studio environment), and 2 of those objects are d18O entries. Index number 1 and index number 3.

queryts_done


filterTs(ts, expression)

Purpose:
Create a new, smaller, time series that only contains objects that match your specified criteria. For example, if you only want data located at a certain latitude, or if you only want data of a certain archive type.

How are queryTs and filterTs different?

Parameters:

ts

A time series

expression

The criteria that you choose to find data. Only data that matches this expression will be returned.

Returns:

ts

A time series

For this example, I’ve read and created a time series using multiple datasets to emphasize the point of filtering. All the steps are shown below:

filterts_prestart

NOTE:
There are many different criteria to search by, and it’s quite lengthy to list them all. However, if you look at
the time series variable in your environment (ts), you’ll be able to see the searchable terms. The image
below also shows a small sample of the ts variable that I’m using now.

I’ll use the expression paleoData_variableName == d18O to get all time series objects that represent d18O measurements.

Call filterTs as shown below:

filterts_call

A new, filtered time series is created as new_ts and the original time series remains as ts.

filterts_done


Help

What is a time series?

The LiPD dataset hierarchy is great for organization and giving context to data, but can be more difficult to sift through to find relevant information since it can often go 10+ levels deep.

A time series is a flattened set of data that makes data more approachable and is used to perform data analysis. A time series is a collection of time series objects.

1-to-1 ratio
1 time series object = 1 measurement table column

Each object within a time series is made from one column of data in a measurement table. It’s important to note that this only pertains to measurement table data. All model data (ensemble, distribution, summary) are not included when creating a time series.

Example 1: One dataset

extractTs creates a time series (ts) of 5 objects

Example 2: Multiple datasets

extractTs creates a time series (ts) of 9 objects


How are queryTs and filterTs different?

It’s easy to confuse these two functions as they are almost identical in purpose. Here’s what you need to know:

queryTs:

This function returns the index numbers of objects that match your expression.

filterTs:

This function returns the actual data of objects that match your expression.