Title: | Automated Cleaning of Fossil Occurrence Data |
---|---|
Description: | Functions to automate the detection and resolution of taxonomic and stratigraphic errors in fossil occurrence datasets. Functions were developed using data from the Paleobiology Database. |
Authors: | Joe Flannery-Sutherland [aut, cre] , Nussaïbah Raja-Schoob [aut, ctb], Ádam Kocsis [aut, ctb], Wolfgang Kiessling [aut] |
Maintainer: | Joe Flannery-Sutherland <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.4 |
Built: | 2024-10-31 21:11:58 UTC |
Source: | https://github.com/jf15558/fossilbrush |
Function to add detected peaks using the output of
add_itp(x, taxon, legend.pos = "topright", exit = TRUE)
add_itp(x, taxon, legend.pos = "topright", exit = TRUE)
x |
The list output of @seealso threshold_ranges |
taxon |
A character vector of length one, specifying one of the taxon names in x to be plotted |
legend.pos |
One of topleft, bottomleft, topright or bottomright, or a vector of length two, giving the xy coordinates of the legend. A convenience parameter so that the plot detail can remain unobscured. |
exit |
Restore base plotting parameters on function exit (default as a requirement for CRAN). Can be set to false to allow other elements to be aded to a plot |
None, the detected peaks are added to an existing density plot
threshold_ranges. This function should be used to add information to an existing plot from @seealso densify, ensuring that the same taxon name is being used
# load dataset# data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # densify ranges dens <- densify(brachios) # interpeak thresholding itp <- threshold_ranges(brachios, win = 8, thresh = 10, rank = "genus", srt = "max_ma", end = "min_ma") # append the stratigraphically thresholded taxon names to the dataset # plot the taxon, now identifying the peaks plot_dprofile(dens, "Atrypa", exit = FALSE) add_itp(itp, "Atrypa")
# load dataset# data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # densify ranges dens <- densify(brachios) # interpeak thresholding itp <- threshold_ranges(brachios, win = 8, thresh = 10, rank = "genus", srt = "max_ma", end = "min_ma") # append the stratigraphically thresholded taxon names to the dataset # plot the taxon, now identifying the peaks plot_dprofile(dens, "Atrypa", exit = FALSE) add_itp(itp, "Atrypa")
Convenience function to add in a kingdom column to a PBDB dataset. This relies on the dataset having a column of phylum-level assignments for occurrences. The kingdom column is a useful addition for filtering very large taxonomically diverse datasets, and adds in an additional level of data which can inform taxonomic cleaning routines like those called by @seealso check_taxonomy
add_kingdoms(x, phylum = "phylum", insert.left = TRUE)
add_kingdoms(x, phylum = "phylum", insert.left = TRUE)
x |
A dataframe containing, minimally, phylum-level assignments of the data |
phylum |
A character of length 1 specifing the column in x with the phylum level assignments |
insert.left |
A convenience argument which will make sure that the kingdom column will be inserted in dataframe left immediately to the left of the phylum column |
The dataframe x, with the kingdom column inserted
# load dataset data("brachios") # add kingdoms to dataset brachios <- add_kingdoms(brachios)
# load dataset data("brachios") # add kingdoms to dataset brachios <- add_kingdoms(brachios)
age_ranges
age_ranges( data, taxonomy = "genus", srt = "max_ma", end = "min_ma", mode = "max" )
age_ranges( data, taxonomy = "genus", srt = "max_ma", end = "min_ma", mode = "max" )
data |
A three column dataframe comprising one or more character columns of taxonomic names, a numeric column of FADs and a numeric column of LADs |
taxonomy |
A character vector corresponding to one or more of the taxonomic name columns in data |
srt |
A character vector of length one specifying the FAD column in data |
end |
A character vector of length one specifying the LAD column in data |
mode |
A character vector of length one specifying the type of range table to return: one of max, min or bounds. If not specified by the user, the function behaviour will default to max |
Function to derive a range table of taxon names from a stratigraphic occurrence dataset. The default behaviour is to return a total range table - the oldest FAD and youngest LAD for each taxon (max), but the function can also return the minimum range - youngest FAD and oldest LAD (min), or the uncertainty bounds on each FAD and LAD - the two oldest FADs and two youngest LADs (bounds). The names for which ranges are derived are specified by the taxononmy argument, but multiple elements can be given here, allowing taxonomic range for higher clades to also be returned.
A dataframe containing at least four columns: taxon name, FAD, LAD and the taxonomic rank. If taxonomy is of length one, taxonomic rank will be a vector of identical names. If mode = "bounds", there will be two pairs of age columns, denoting the upper and lower bounds on the FAD and LAD for each taxon name
# load dataset data("brachios") # derive age ranges rng <- age_ranges(brachios)
# load dataset data("brachios") # derive age ranges rng <- age_ranges(brachios)
Function to assess and resolve elements with multiple higher classifications in a tgraph object. Assessment is performed based on the topology of the graph they form. Linear paths (i.e. two totally separate paths diverging from the a shared node), rings (divergent paths which only reunite at the highest rank in the tgraph) or more than two divergent paths are treated as distinct. If not any of these cases, the distance between the focal element and the reunion of the divergent paths, along with their subtopologies are assessed and a consensus or preferred path based on the frequency of each path in the tgraph or their completeness returned, or the element judged as having multiple distinct classifications
assess_duplicates( x, node, mode = c("frequency", "completeness"), jump = 3, plot = FALSE )
assess_duplicates( x, node, mode = c("frequency", "completeness"), jump = 3, plot = FALSE )
x |
A tgraph object |
node |
character vector of elements with multiple higher classifications in x, or a tvertseq object with those same elements as focal |
mode |
The rule to be used in selecting between multiple higher classifications. It is possible for the most complete pathway to also be the most frequent |
jump |
The maximum number of levels between the point of divergence and the point of reunion (if present) for a given path, below which the divergence will be taken as conflicting |
plot |
A logical speciying if the divergent paths should be plotted |
A list with as many items as elements with multiple classifications, each recording the assessment for a given element
An example dataset of Palaeozoic brachiopods downloaded from the Paleobiology Database.
brachios
brachios
An object of class data.frame
with 151473 rows and 10 columns.
Wrapper functions to implement a multi-step cleaning routine for hierarchically structured taxonomic data. The first part of the routine calls @seealso format_check to perform a few presumptive checks on all columns, scanning for non-letter characters and checking the number of words in each string. By default, @seealso clean_name is called to ensure correct formatting as this improves downstream checking. The second part of the routine calls @seealso spell_check to flag spelling discrepancies between names within a given taxonomic group. If chosen, the function can automatically impose the more frequent spelling. The third part of the routine calls @seealso discrete_ranks to flag name re-use at different taxonomic levels. Some of these cases may arise when a name has been unfortunately, (although permissibly) used to refer to groups at different taxonomic levels, or where a higher classification may have been inserted as a placeholder for a missing lower classification. The fourth part of the routine calls @seealso find_duplicates to flag variable higher classifications for a given taxon, including cases where a higher classification is missing for one instance of a taxon, but present for the others. If chosen, @seealso resolve_duplicates is called to ensure a consistent classification is imposed. For cases where a name has been re-used at the same rank for genuinely different taxa (not permissible, unlike name re-use at different ranks) suffixes are added as capital letters, e.g. TaxonA, TaxonB. If any of the automatic cleaning routines are employed (again the default behaviour as clean_name is TRUE by default), the function will return are a cleaned version of the dataset. If the use of suffixes from @seealso resolve_duplicates is not desirable, the function behaviour can be altered so that any suffixes are dropped before returning.
check_taxonomy( x, ranks = c("phylum", "class", "order", "family", "genus"), species = FALSE, species_sep = NULL, routine = c("format_check", "spell_check", "discrete_ranks", "find_duplicates"), report = TRUE, verbose = TRUE, clean_name = FALSE, clean_spell = FALSE, thresh = NULL, resolve_duplicates = FALSE, append = TRUE, term_set = NULL, collapse_set = NULL, jw = 0.1, str = 1, str2 = NULL, alternative = "jaccard", q = 1, pref_set = NULL, suff_set = NULL, exclude_set = NULL, jump = 3, plot = FALSE )
check_taxonomy( x, ranks = c("phylum", "class", "order", "family", "genus"), species = FALSE, species_sep = NULL, routine = c("format_check", "spell_check", "discrete_ranks", "find_duplicates"), report = TRUE, verbose = TRUE, clean_name = FALSE, clean_spell = FALSE, thresh = NULL, resolve_duplicates = FALSE, append = TRUE, term_set = NULL, collapse_set = NULL, jw = 0.1, str = 1, str2 = NULL, alternative = "jaccard", q = 1, pref_set = NULL, suff_set = NULL, exclude_set = NULL, jump = 3, plot = FALSE )
x |
A dataframe with hierarchically organised taxonomic information. If x only comprises the taxonomic information, @param ranks does not need to be specified, but the columns must be in order of decreasing taxonomic rank |
ranks |
The column names of the taxonomic data fields in x. These must be provided in order of decreasing taxonomic rank |
species |
A logical indicating if x contains a species column. As the data must be supplied in hierarchical order, this column will naturally be the last column in x and species-specific spell checks will be performed on this column. NOTE that for the function to work, the species name must be the full species name rather than just the specific epithet, e.g., 'Tyto_alba' rather than just 'alba'. |
species_sep |
A character vector of length one specifying the genus name and specific epithet in the species column
|
routine |
A character vector determining the flagging and cleaning routines to employ. Valid values are format_check (check for non letter characters and the number of words in names), spell_check (flag potential spelling errors), discrete_rank (check that taxonomic names are unique to their rank), duplicate_tax (flag conflicting higher classifications of a given taxon) |
report |
A logical of length one determining if the flagging outputs of each cleaning routine should be returned to the user for inspection. This is different to @param verbose, which controls whether flagging should additionally be reported to the user on the console |
verbose |
A logical determining if function progress and flagged errors should be reported to the console
|
clean_name |
If TRUE, the function will return cleaned versions of the columns in x using the routines in @seealso clean_name. These routines can be altered using the 'term_set' and 'collapse_set' arguments. |
clean_spell |
If TRUE, the function will return a cleaned version of the supplied taxonomic dataframe, using the supplied threshold for the similarity method given by method2, to automatically update any names in pairs of flagged synonyms to the more frequent spelling. This is not recommended, however, so the argument is FALSE by default and the threshold left as NULL |
thresh |
The threshold for the similarity method given by method2, below which flagged pairs of names will be considered synonyms and resolved automatically. See @seealso spell_check for details on method2 |
resolve_duplicates |
If TRUE, the function will return a cleaned version of the supplied taxonomic dataframe, using @seealso resolve_duplicates to resolve conflicts in the way documented by the function. Both spell_clean and tax_clean can both be TRUE to return a dataset cleaned by both methods |
append |
If TRUE, any suffixes used during cleaning will be retained in the cleaned version of the data. This is preferable as it ensures that all taxonomic names are rank-discrete and uniquely classified
|
term_set |
A character vector of terms (to be used at all ranks) or a list of rank-specific terms which will be supplied, element-wise as the @param collapse argument called by @seealso clean_name. If a list, this |
collapse_set |
A character vector of character strings (to be used at all ranks) or a list of rank-specific strings which will be supplied, element-wise as the @param collapse argument called by @seealso clean_name. If a list, this should be given in descending rank order |
jw |
Called by @seealso spell_check |
str |
Called by @seealso spell_check |
str2 |
Called by @seealso spell_check |
alternative |
Called by @seealso spell_check |
q |
Called by @seealso spell_check |
pref_set |
A character vector of prefixes (which will be used at all ranks) or a list of rank-specific prefixes, which will be supplied, element-wise as the @param pref argument called by @seealso spell_check. If a list, this should be given in descending rank order. |
suff_set |
A character vector of suffixes (which will be used at all ranks) or a list of rank-specific suffixes, which will be supplied, element-wise as the @param suff argument called by @seealso spell_check. If a list, this should be given in descending rank order. |
exclude_set |
A character vector of terms to exclude (which will be used at all ranks) or a list of rank-specific exclusion terms, which will be supplied, element-wise as the @param exclude argument called by @seealso spell_check. If a list, this should be given in descending rank order. |
jump |
Called by @seealso resolve_duplicates |
plot |
Called by @seealso resolve_duplicates |
Data supply arguments *
A list with elements corresponding to the outputs of the chosen flagging routines (four by default: $formatting, $synonyms, $ranks, $duplicates), plus a cleaned verison of the data ($data) if any of clean_name, clean_spell or resolve_duplicates are TRUE. See @seealso format_check, @seealso spell_clean,
discrete_ranks and @seealso find_duplicates for details of the structure of the flagging outputs
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # define the taxonomic ranks used in the dataset (re-used elsewhere) b_ranks <- c("phylum", "class", "order", "family", "genus") # define a list of suffixes to be used at each taxonomic level when scanning for synonyms b_suff = list(NULL, NULL, NULL, NULL, c("ina", "ella", "etta")) # scan for errors brachios <- check_taxonomy(brachios, suff_set = b_suff, ranks = b_ranks)
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # define the taxonomic ranks used in the dataset (re-used elsewhere) b_ranks <- c("phylum", "class", "order", "family", "genus") # define a list of suffixes to be used at each taxonomic level when scanning for synonyms b_suff = list(NULL, NULL, NULL, NULL, c("ina", "ella", "etta")) # scan for errors brachios <- check_taxonomy(brachios, suff_set = b_suff, ranks = b_ranks)
Convenience function to apply user-specified chronostratigraphy to fossil datasets. The function relies on a lookup table generated based on the named intervals in the PBDB in early 2021. First and last interval names in the supplied dataset are matched against this lookup table, by default using 'get("GTS_2020)", to get GTS_2020 numeric ages. If the dataset contains intervals which are not present in the lookup table, they will not be matched and the user will be warned. To get around this possibility, the user can also supply the original numeric ages which will be used as default ages if an interval cannot be matched, to ensure that the returned vectors of numeric ages do not contain NAs.
chrono_scale( x, tscale = "GTS_2020", srt = "early_interval", end = "late_interval", max_ma = NULL, min_ma = NULL, verbose = TRUE )
chrono_scale( x, tscale = "GTS_2020", srt = "early_interval", end = "late_interval", max_ma = NULL, min_ma = NULL, verbose = TRUE )
x |
A data.frame containing, minimally two columns corresponding respectively to the first and last intervals of the data. Values should only be present in the second column where the minimum age interval for a row is different to the maximum age interval. Otherwise the values should be NA and the ages returned will be based on the interval specified in the first column, in line with PBDB formatting. |
tscale |
A character string specifying one of the inbuilt chronostratigraphic timescales (currently GTS 2020 only) or a data.frame supplied by the user. If the latter, this must contain columns named 'Interval', 'FAD', 'LAD', specifying the interval names to be matched and their lower and upper age in Ma |
srt |
A character of length 1 specifing the column name of the first interval field in x |
end |
A character of length 1 specifing the column name of the last interval field in x |
max_ma |
If not NULL, a character of length 1 specifing the column name of the original numeric maximum age field in x, to be used as fall back values if interval names cannot all be matched |
min_ma |
If not NULL, a character of length 1 specifing the column name of the original numeric minimum age field in x, to be used as fall back values if interval names cannot all be matched |
verbose |
A logical indicating if warning messages should be displayed or otherwise |
The dataframe, x, with two additional columns containing the revised first and last numeric ages of the data, with column names GTS_FAD and GTS_LAD respectively
# example dataset data("brachios") # add GTS_2020 dates brachios <- chrono_scale(brachios, srt = "early_interval", end = "late_interval", max_ma = "max_ma", min_ma = "min_ma")
# example dataset data("brachios") # add GTS_2020 dates brachios <- chrono_scale(brachios, srt = "early_interval", end = "late_interval", max_ma = "max_ma", min_ma = "min_ma")
clean_name
clean_name(x, terms = NULL, collapse = NULL, verbose = FALSE)
clean_name(x, terms = NULL, collapse = NULL, verbose = FALSE)
x |
a vector of names to clean. This will be coerced to class character internally |
terms |
a character vector of terms to remove from elements of x. Terms are only removed as whole words, rather than if they also happen to occur as strings within elements of x |
collapse |
a character vector of strings which should collapsed (i.e. replaced by "", rather than the default " "). If one of the collapse terms is a special regex character, it will need to be escaped, e.g. "\-" |
verbose |
A logical of length 1 determining if function progress should be reported to the console |
Function which bundles a series of cleaning routines into a single process. First any words in brackets are removed, followed by a series of user-defined terms if given. Next Roman and Arabic numerical are removed, then abbreviations up to five letters (abbreviations are matched by the following dot e.g ABFS.). By default, characters for removal are replaced by a white space to prevent accidental collapse of strings. However, there may be specific cases where a collapse is required and so terms given in collapse are dealt with next. After collapsing, rogue all rogue punctation is removed, then isolated lowercase letters, then isolated groups of capitals up to 5 characters long. Finally, white spaces greater than 1 are removed, along with trailing white space, any remaining strings longer than 2 words subsetted to the first word, the first letter of each string capitalised and zero length strings converted to NA
a character vector the same length as x. Elements which were reduced to zero characters during cleaning are returned as NA
# load dataset data("brachios") # clean genus names gen_clean <- clean_name(brachios$genus)
# load dataset data("brachios") # clean genus names gen_clean <- clean_name(brachios$genus)
Function to create a matrix of occurrence record densities through geological time from an occurrence dataset. Each column represents a taxon. Each row represents a user defined window of time, with the first row starting at the oldest FAD in the dataset and spanning to the youngest LAD stepwise by the user defined window (default of 1 Ma). Occurrence records are densified by generating a vector of time points from occurrence FAD to occurrence LAD (default step of 0.1 Ma), then tallied in two ways. The first way is a simple histogram count of points-per-window, with the same number of histogram bins as time steps between the overall taxon FAD and LAD. The second way is a kernel density estimate, using a Gaussian kernel with a equally spaced estimatopms equal to the number of timesteps between the overall taxon FAD and LAD
densify( x, rank = "genus", srt = "max_ma", end = "min_ma", step = 1, density = 0.1, method = c("histogram", "kernel"), ..., verbose = TRUE )
densify( x, rank = "genus", srt = "max_ma", end = "min_ma", step = 1, density = 0.1, method = c("histogram", "kernel"), ..., verbose = TRUE )
x |
An occurrence dataset |
rank |
The column name in x containing the taxon names for which densified columns will be generated |
srt |
A column name in x denoting the occurrence FADs |
end |
A column name in x denoting the occurrence LADs |
step |
A positive integer specifying the time window size (i.e. the duration represented by each row in the output matrix) |
density |
A positive numeric specifying the step size for densifying records. This should ideally be smaller than step |
method |
The method for quantifying occurrence density. By default both histogram and kernel density will be used |
... |
additional arguments passed to @seealso density |
verbose |
A logical determining if function progress should be reported |
A list of two sparse matrices, the first containing the histogram counts, the second the kernel density estimates
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # densify ranges dens <- densify(brachios)
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # densify ranges dens <- densify(brachios)
discrete_ranks
discrete_ranks(x, ranks = NULL)
discrete_ranks(x, ranks = NULL)
x |
A dataframe containing hierarchically structured information, for example a table of genus names and their higher taxonomic classifications |
ranks |
If not NULL, a vector of column names of x, given in rank order. This is useful if x contains columns which are not rank relevant or if columns are not in hierarchical order. If not supplied, the column order in x is used directly and is assumed to be in rank order |
Function for checking whether names in one column of a hierarchically organised dataframe re-occur at other levels. Two checks are performed. The first checks for names in adjacent column, assuming that accidental reuse of names at other levels are most likely to occur at an adjacent rank. The second compares across all columns.
A list of two lists. The first list contains names which reoccur at adjacent ranks. The second list contains names that reoccur at any rank
# load dataset data("brachios") # define ranks b_ranks <- c("phylum", "class", "order", "family", "genus") # run function flag <- discrete_ranks(brachios, ranks = b_ranks)
# load dataset data("brachios") # define ranks b_ranks <- c("phylum", "class", "order", "family", "genus") # run function flag <- discrete_ranks(brachios, ranks = b_ranks)
Function to detect and report elements with multiple higher assigments in a hierarchically structured dataframe
find_duplicates(x, ranks = NULL)
find_duplicates(x, ranks = NULL)
x |
A hierarchically organised dataframe |
ranks |
The ranks in the dataframe in which to check for elements with multiple higher classifications. The top rank is ignored by default |
A dataframe of elements with multiple higher classifications and their ranks
# load dataset data("brachios") b_ranks <- c("phylum", "class", "order", "family", "genus") # run function flag <- find_duplicates(brachios, ranks = b_ranks)
# load dataset data("brachios") b_ranks <- c("phylum", "class", "order", "family", "genus") # run function flag <- find_duplicates(brachios, ranks = b_ranks)
Function to scan, column-wise, a matrix of per-taxon observation
density time series. This can be applied to either the histogram
or the kernel density output of densify
, but the latter is
recommended. Peaks are detected as local maxima, then smoothed
within a local window and tested to distinguish if they are
noise or significant. Strict threshold is that the peak is
greater than the mean + sd of the window
find_peaks(x, win = 5, verbose = TRUE)
find_peaks(x, win = 5, verbose = TRUE)
x |
A matrix as outputed by |
win |
A positive integer specifying the window length on either side of a peak (i.e. win 5 will give a total window of 11 - -5 indices + peak index + 5 indices) |
verbose |
A logical determining if function progress should be reported |
A list of four, the first three positions containing lists of the peak indices for each taxon, under raw, mean + sd and mean detection regimes. The fourth item is a dataframe of counts of peaks per taxon, 1 row per taxon, 1 column per detection regime
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # get density matrix dens <- densify(brachios) # run function, using kernel density matrix pk <- find_peaks(dens$kdensity)
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # get density matrix dens <- densify(brachios) # run function, using kernel density matrix pk <- find_peaks(dens$kdensity)
Function to compare stratigraphic ranges in x to a set of reference ranges from y. A list of two elements is returned. The first is a dataframe summarising the overall error status, specific error counts FAD and LAD differences, and the 95% density distributions of the FAD and LAD errors for each unique taxon in the column of x denoted by the first element of xcols. If a taxon in x is not present in y, it is assigned the status 000 and its other entries in the returned dataframe will be NA. The second element of the returned list is the error code for every individual element of the column of x denoted by the first element of xcols - this will have the same number of rows as x. If x is a range table rather than an occurrence dataset, then the two list elements will have the same number of rows. Ranges for comparison may be supplied directly in y, or y may be another occurrence dataset, in which case
flag_ranges( x = NULL, y = NULL, xcols = c("genus", "max_ma", "min_ma"), ycols = NULL, flag.diff = 5, verbose = TRUE )
flag_ranges( x = NULL, y = NULL, xcols = c("genus", "max_ma", "min_ma"), ycols = NULL, flag.diff = 5, verbose = TRUE )
x |
Stratigraphic range data for taxa as a whole or for individual fossil occurrences |
y |
The same as in x. This is the dataset to which ranges will be compared |
xcols |
A character vector of length three specifying, in the following order, the taxonomic name, stratigraphic base (FAD) and stratigraphic top (LAD) columns in x. |
ycols |
An optional character vector of length three for the same column types as in xcols, but for dataset y. This is useful if the column names differ between the datasets |
flag.diff |
A vector of thresholds, given in millions of years which will be used to flag discrepancies between occurrence FADs and LADs with respect to the reference range. This is a convenience parameter so that occurrences with large discrepancies can be quickly identified. Multiple thresholds can be supplied |
verbose |
A logical of length one determining if the flagging progress should be reported to the console |
A list of two data.frames, the first recording overall error statistics, the second recording error types for each element of x. In the second data.frame, FAD or LAD differences in excess of the supplied threshold(s) are marked with 1, otherwise 0
age_ranges is called internally to generate the range table for comparison.
# load the example datasets data(brachios) data(sepkoski) # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # update brachios to GTS2020 to match Sepkoski brachios <- chrono_scale(brachios, srt = "early_interval", end = "late_interval", max_ma = "max_ma", min_ma = "min_ma", verbose = FALSE) brachios$max_ma <- brachios$newFAD brachios$min_ma <- brachios$newLAD # drop occurrences with older LADs than FADs brachios <- brachios[brachios$max_ma > brachios$min_ma,] # trim the Sepkoski Compendium to the relevant entries sepkoski <- sepkoski[which(sepkoski$PHYLUM == "Brachiopoda"),] # run flag ranges flg <- flag_ranges(x = brachios, y = sepkoski, ycols = c("GENUS", "RANGE_BASE", "RANGE_TOP"))
# load the example datasets data(brachios) data(sepkoski) # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # update brachios to GTS2020 to match Sepkoski brachios <- chrono_scale(brachios, srt = "early_interval", end = "late_interval", max_ma = "max_ma", min_ma = "min_ma", verbose = FALSE) brachios$max_ma <- brachios$newFAD brachios$min_ma <- brachios$newLAD # drop occurrences with older LADs than FADs brachios <- brachios[brachios$max_ma > brachios$min_ma,] # trim the Sepkoski Compendium to the relevant entries sepkoski <- sepkoski[which(sepkoski$PHYLUM == "Brachiopoda"),] # run flag ranges flg <- flag_ranges(x = brachios, y = sepkoski, ycols = c("GENUS", "RANGE_BASE", "RANGE_TOP"))
Function to perform a series of basic formatting checks geared towards taxonomic name data. The function very simply checks for non letter characters in the taxonomic names, that species-level names contain two words, and genus-level and above names contain one word.
format_check(x, ranks, species = FALSE, species_sep = " ", verbose = TRUE)
format_check(x, ranks, species = FALSE, species_sep = " ", verbose = TRUE)
x |
A dataframe with hierarchically organised, taxonomic information. If x only comprises the taxonomic information, |
ranks |
does not need to be specified, but the columns must be in order of decreasing taxonomic rank @param ranks The column names of the taxonomic data fields in x. These must be provided in order of decreasing taxonomic rank |
species |
A logical indicating if x contains a species column. As the data must be supplied in hierarchical order, this column will naturally be the last column in x and species-specific spell checks will be performed on this column. |
species_sep |
A character vector of length one specifying the genus name and specific epithet in the species column, if present |
verbose |
A logical determining if any flagged errors should be reported to the console |
A list of two lists. The first list flags the row indexes of columns whose elements contains non-letter characters. The second list flags the row indexes of columns whose elements do not contain the correct numbers of words
# load dataset data("brachios") # define ranks b_ranks <- c("phylum", "class", "order", "family", "genus") # run function flag <- format_check(brachios, ranks = b_ranks)
# load dataset data("brachios") # define ranks b_ranks <- c("phylum", "class", "order", "family", "genus") # run function flag <- format_check(brachios, ranks = b_ranks)
lookup table called by 'get_pbdb'
geog_lookup
geog_lookup
An object of class data.frame
with 511 rows and 2 columns.
Function for downloading Paleobiology Database (PBDB) data (saved to disk and/or imported into R) or generating PBDB API-compatible URLs. If downloading data over timespans greater than 100 Ma, the download is performed in 100 Ma chunks to better track the download progress.
get_pbdb( taxon = NULL, interval = NULL, mode = "occurrence", res = "all", fields = c("ident", "coords", "class"), ex_taxon = NULL, area = NULL, ex_area = NULL, invert_area = FALSE, litho = NULL, invert_litho = FALSE, env = NULL, ex_env = NULL, invert_env = NULL, pres = NULL, idqual = NULL, return_url = FALSE, return_data = TRUE, save_as = NULL, tscale = "ICS2013", wait = Inf )
get_pbdb( taxon = NULL, interval = NULL, mode = "occurrence", res = "all", fields = c("ident", "coords", "class"), ex_taxon = NULL, area = NULL, ex_area = NULL, invert_area = FALSE, litho = NULL, invert_litho = FALSE, env = NULL, ex_env = NULL, invert_env = NULL, pres = NULL, idqual = NULL, return_url = FALSE, return_data = TRUE, save_as = NULL, tscale = "ICS2013", wait = Inf )
taxon |
A character vector of taxon names. Prepending a taxon name with ^ will exclude it from the PBDB search. Alternatively @param ex_taxon can be used to do this |
interval |
A numeric vector of length two with positive ages in Ma, or a character vector containing one or two ICS chronostratigraphic interval names |
mode |
A character vector of length one specifying the type of data to return: one of occurrence, collection, taxa, specimen, measurement, strata, diversity, opinion or reference |
res |
A character vector of length one specifying the taxonomic resolution of the dataset: one of all, family, genus species, lump_genus or lump_subgen. The latter two lump multiple occurrences of genera or subgenera within collections into a single representative occurrence |
fields |
A character vector of PBDB vocabulary for additional data fields to download: see https://paleobiodb.org/data1.2/occs/list_doc.html |
ex_taxon |
A character vector of taxon names to exclude from the PBDB search |
area |
If not NULL, then a numeric vector of length four specifying, in order, the min lng, max lng, min lat and max lat of the area from which occurrences will be returned, in decimal degrees (equator = 0 lat, prime meridian = 0 lng). Alternatively, a character vector of regions from which occurrences will be returned: any valid country name or ISO2 code. Continent names and codes are also supported as follows: ATA Antarctica, AFR Africa, ASI Asia, AUS Australia, EUR Europe, IOC Indian Ocean, NOA North America, OCE Oceania,SOA South America |
ex_area |
If not NULL, then a character vector of valid country names or ISO2 codes, as in @param area (), from which occurrences will be excluded from a PBDB search |
invert_area |
If TRUE, then regions specified in area will be excluded from a PBDB search, except for the regions specified in ex_area |
litho |
If not NULL, a character vector of PBDB vocabulary corresponding to which lithologies a PBDB search should return |
invert_litho |
If TRUE, a character vector of PBDB vocabulary corresponding to which lithologies a PBDB search should exclude |
env |
If not NULL, a character vector of PBDB vocabulary corresponding to which environments a PBDB search should return |
ex_env |
If not NULL, a character vector of PBDB vocabulary corresponding to which environments a PBDB search should exclude |
invert_env |
If TRUE, then environments specified in env will be excluded from a PBDB search, except for the environments specified in ex_env |
pres |
A character vector of length one specifying the preservation mode of the occurrences to return: one of regular, form, ichno, or 'form,ichno' |
idqual |
A character vector of length one specifying the taxonomic certainty of the occurrences to return: one of certain, genus_certain, uncertain, new" |
return_url |
If TRUE, the function will return a correctly formatted url suitable for use with curl or similar API functions, comprising the search parameters set by the user |
return_data |
If TRUE (default), the downloaded csv will automatically be read into R (this must be assigned to an object) |
save_as |
If not NULL, the file name to which the downloaded data will be saved on the disk as a .csv |
tscale |
A character vector of length one determining what chronostratigraphic timescale will be applied to the data. "ICS2013" will retain the PBDB ICS 2013 standard. "GTS2020" will update all early and late interval ages to the GTS2020 standard, using a lookup table supplied with the function. Alternatively, the pathway to a custom .csv file with columns Interval, FAD and LAD where Interval are the names of the early and late intervals in the PBDB, and FAD and LAD are the numeric lower and upper boundaries of those intervals |
wait |
The maximum wait time for the download in milliseconds, as used by curl. This is set to no wait time by default |
either a PBDB API compatible URL or a PBDB dataset
# download Triassic dinosaurs (wait time set to meet CRAN example requirement) tdinos <- fossilbrush:::get_pbdb(taxon = "Dinosauria", interval = "Triassic", wait = 499)
# download Triassic dinosaurs (wait time set to meet CRAN example requirement) tdinos <- fossilbrush:::get_pbdb(taxon = "Dinosauria", interval = "Triassic", wait = 499)
lookup table called by 'get_pbdb'
GTS_2020
GTS_2020
An object of class data.frame
with 1534 rows and 9 columns.
changelog of periodic updates made to the GTS2020 table originally published in this package. The purpose of this changelog is to allow the user to assess how up-to-date the resource is and made any changes themselves if needed. A data.frame with date-wise rows of edits
GTS_2020_changelog
GTS_2020_changelog
An object of class data.frame
with 1 rows and 2 columns.
Function to find the maximum intersection between a set of numeric ranges, in this case first and last appearence datums on taxonomic ranges.
intersect_ranges(x, srt = NULL, end = NULL, verbose = TRUE)
intersect_ranges(x, srt = NULL, end = NULL, verbose = TRUE)
x |
A numeric data.frame or matrix of ranges. If just two columns are supplied, the first column is assumed to be the srt column |
srt |
If x contains more than two columns, srt is the name of the range base column - the FAD |
end |
If x contains more than two columns, end is the name of the range top column - the LAD |
verbose |
A logical indicating whether the function should report progress to the console |
A matrix with three columns, indicating the intersection (FAD and LAD) and the number of ranges that intersection encompasses
# plot an example df <- cbind(c(1.5, 3, 2.1, 1), c(6, 5, 3.7, 10.1)) plot(1:11, ylim = c(0, 5), col = NA) segments(x0 = c(1.5, 3, 2.1, 1), y0 = 1:4, x1 = c(6, 5, 3.7, 10.1), y1 = 1:4) abline(v = 3, col = "red", lty = 2) abline(v = 3.7, col = "red", lty = 2) # intersect function intersect_ranges(df)
# plot an example df <- cbind(c(1.5, 3, 2.1, 1), c(6, 5, 3.7, 10.1)) plot(1:11, ylim = c(0, 5), col = NA) segments(x0 = c(1.5, 3, 2.1, 1), y0 = 1:4, x1 = c(6, 5, 3.7, 10.1), y1 = 1:4) abline(v = 3, col = "red", lty = 2) abline(v = 3.7, col = "red", lty = 2) # intersect function intersect_ranges(df)
Function to apply a modification of Pacman trimming to
macrofossil data. The function generates a densified
occurrence record using the same methods as densify
then trim the upper and lower ranges by a user-defined
percentage. The full and trimmed ranges are then
compared against each other to test if the FAD and the
LAD for a taxon form a long tail in its distribution.
Multiple tail thresholds can be supplied, but all test
to see if the sum of the FAD and LAD which exceeds the
trimmed range constitute the threshold proportion of the
total range for than taxon, e.g. does the FAD and the
LAD outside of the trimmed range comprise a quarter
(tail.flag = 0.25
) of the taxon range?
pacmacro_ranges( x, rank = "genus", srt = "max_ma", end = "min_ma", step = 1, density = 0.1, top = 5, bottom = 5, tail.flag = 0.35, method = c("histogram", "kernel") )
pacmacro_ranges( x, rank = "genus", srt = "max_ma", end = "min_ma", step = 1, density = 0.1, top = 5, bottom = 5, tail.flag = 0.35, method = c("histogram", "kernel") )
x |
A stratigraphic occurrence dataset |
rank |
The column name in x containing the taxon names for which trimmed ranges will be calculated |
srt |
A column name in x denoting the occurrence FADs |
end |
A column name in x denoting the occurrence LADs |
step |
A positive integer specifying the time window size (i.e. the duration represented by each row in the output matrix) |
density |
A positive numeric specifying the step size for densifying records. This should ideally be smaller than step |
top |
The percentage by which the top of the range will be trimmed |
bottom |
The percentage by which the bottom of the range will be trimmed |
tail.flag |
a numeric vector of proportions in the range 0 > x > 1 which will be used to test for long tails |
method |
The method for quantifying occurrence density. By default both histogram and kernel density will be used |
If the user specifies a specific method (e.g. method = "kernel"), the returned value will be a data.frame containing the taxa as row names, the original taxon ranges (FAD, LAD), their ranges as trimmed by the specified value (default FAD95, LAD95), and the tail status (0 = none, 1 = tail) at the user-specified tail proportions. If method is not specified, the result will be a list of 2 data.frames, one for each method
Pacman procedure modified from https://rdrr.io/github/plannapus/CONOP9companion/src/R/pacman.R.
Lazarus et al (2012) Paleobiology
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # run pacmacro pacm <- pacmacro_ranges(brachios, tail.flag = c(0.3, 0.35, 0.4), rank = "genus", srt = "max_ma", end = "min_ma")
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # run pacmacro pacm <- pacmacro_ranges(brachios, tail.flag = c(0.3, 0.35, 0.4), rank = "genus", srt = "max_ma", end = "min_ma")
lookup table called by 'get_pbdb'
pbdb_fields
pbdb_fields
An object of class list
of length 31.
lookup table called by 'get_pbdb'
pbdb_kingdoms
pbdb_kingdoms
An object of class list
of length 3.
Function to plot density profiles of occurrences through time using the output of @seealso densify.
plot_dprofile(x, taxon, exit = TRUE)
plot_dprofile(x, taxon, exit = TRUE)
x |
The list output of @seealso densify |
taxon |
A character vector of length one, specifying one of the taxon names in x to be plotted |
exit |
Restore base plotting parameters on function exit (default as a requirement for CRAN). Can be set to false to allow other elements to be aded to a plot |
NULL, the plotted density profile
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # densify ranges dens <- densify(brachios) plot_dprofile(dens, "Atrypa")
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # densify ranges dens <- densify(brachios) plot_dprofile(dens, "Atrypa")
Function to plot the parent or child relationships of an element in a hierarchically organised dataframe. Multiple taxa can be plotted simultaneously
plot_taxa( x, taxon, trank, ranks, mode = c("parent", "child", "all"), step = NULL )
plot_taxa( x, taxon, trank, ranks, mode = c("parent", "child", "all"), step = NULL )
x |
a dataframe containing hierarchically organised data in columns |
taxon |
A character vector of element names whose relationships will be plotted (these must be of the same rank) |
trank |
A character vector of length one corresponding to the column name in x in which taxa is located |
ranks |
A character vector corresponding to the column names in x, given in hierarchical order |
mode |
The direction of the relationships to be plotted |
step |
A positive integer specifinyg the neighbourhood of the relationships to plot. Specifying a number greater than the number of ranks will not cause a failure, and will instead plot all relationships in the direction specified in mode |
A plot of the relationships of the specified elements
# load dataset data("brachios") # define ranks in dataset b_ranks <- c("phylum", "class", "order", "family", "genus") # plot taxon plot_taxa(brachios, "Atrypa", trank = "genus", ranks = b_ranks, mode = "parent")
# load dataset data("brachios") # define ranks in dataset b_ranks <- c("phylum", "class", "order", "family", "genus") # plot taxon plot_taxa(brachios, "Atrypa", trank = "genus", ranks = b_ranks, mode = "parent")
Static rip of the quantile.coef.density function and relevant internals from the BMS package as the package is archived.
quantile_coef_density_BMS( x, probs = seq(0.25, 0.75, 0.25), names = TRUE, normalize = TRUE, ... )
quantile_coef_density_BMS( x, probs = seq(0.25, 0.75, 0.25), names = TRUE, normalize = TRUE, ... )
x |
a object of class pred.density, coef.density, density, or a list of densities |
probs |
numeric vector of probabilities with values in range 0 - 1. Elements very close to the boundaries return Inf or -Inf |
names |
logical; if TRUE, the result has a names attribute, resp. a rownames and colnames attributes. Set to FALSE for speedup with many probs |
normalize |
logical if TRUE then the values in x$y are multiplied with a factor such that their integral is equal to one |
... |
further arguments passed to or from other methods. |
If x is of class density (or a list with exactly one element), a vector with quantiles. If x is a list of densities with more than one element (e.g. as resulting from pred.density or coef.density), then the output is a matrix of quantiles, with each matrix row corresponding to the respective density.
static rip from BMS package
Function for identifying and resolving alternative higher assignments in a hierarchically structured dataframe. Columns are checked from the lowest to the highest rank for elements with multiple higher assignments. These assignments are then assessed topologically to determine if they represent inadvertent use of the same name at a given rank for genuinely different entities, or whether the higher classifications are conflicting. In the case of the former, unique character suffixes are applied to each differently classified case (up to 26 currently supported), effectively splitting up the alternatively classified element. In the case of the latter, the alternative classifications are assessed and are either combined, or the more frequently used or the more complete classification scheme is taken (the more frequent pathway can also be the most complete).
resolve_duplicates(x, ranks = NULL, jump = 4, plot = FALSE, verbose = TRUE)
resolve_duplicates(x, ranks = NULL, jump = 4, plot = FALSE, verbose = TRUE)
x |
A dataframe containing hierarchically structured information, for example a table of genus names and their higher taxonomic classifications |
ranks |
If not NULL, a vector of column names of x, given in rank order. This is useful if x contains columns which are not rank relevant or if columns are not in hierarchical order. If not supplied, the column order in x is used directly and is assumed to be in rank order |
jump |
The maximum number of levels between the point of divergence and the point of reunion (if present) for a given path, below which the divergence will be taken as conflicting |
plot |
A logical speciying if the divergent paths should be plotted |
verbose |
A logical of length one which determines if the function should report the detection and resolution of elements with multiple higher classifications (if any) |
The dataframe x, with any alternative higher classifications resolved, giving the classification a strict tree structure
# load dataset data("brachios") # define ranks b_ranks <- c("phylum", "class", "order", "family", "genus") # run function res <- resolve_duplicates(brachios, ranks = b_ranks)
# load dataset data("brachios") # define ranks b_ranks <- c("phylum", "class", "order", "family", "genus") # run function res <- resolve_duplicates(brachios, ranks = b_ranks)
Function to generate a consensus age for assemblages of fossil data in x, given a table of taxonomic ranges. The need for error-checking is informed by the error codes for the individual fossil occurrences within each collection - if there is no error, then the consensus age is unchanged. If errors are present, then a consensus age for a threshold proportion of taxa is searched for using the overlap of the ranges for those taxa, as given in range table y. Taxa whose occurrences lie outside this consensus age are flagged as potential taxonomic errors. If the threshold consensus partially overlaps with the assemblage age, this overlap is returned to present overzealous alteration of the age - otherwise the complete consensus age is returned. If a consensus age cannot be found, the original assemblage age is returned, and each occurrence in the collection flagged as potential taxonomic errors.
revise_ranges( x, y, assemblage = "collection_no", srt = "max_ma", end = "min_ma", taxon = "genus", err = NULL, do.flag = FALSE, prop = 0.75, allow.zero = TRUE, verbose = TRUE )
revise_ranges( x, y, assemblage = "collection_no", srt = "max_ma", end = "min_ma", taxon = "genus", err = NULL, do.flag = FALSE, prop = 0.75, allow.zero = TRUE, verbose = TRUE )
x |
Fossil occurrence data grouped into spatiotemporally distinct assemblages |
y |
A stratigraphic range dataset from which consensus assemblage ages will be derived |
assemblage |
The column name of the assemblage groups in x |
srt |
The column name of stratigraphic bases for each element in both x and y - i.e. x and y must have this same name for that column |
end |
The column name of stratigraphic tops for each element in both x and y - i.e. x and y must have this same name for that column |
taxon |
The column name denoting the taxon names in both x and y - i.e. x and y must have this same name for that column |
err |
The column name flagging age errors for occurrences in x. This allows 100$ valid assemblages to be skipped. Age errors can be derived using @seealso flag_ranges. All error codes must be one of: "000" - unchecked, "R1R" - valid, "0R0" - both FAD and LAD exceeded, "00R" - totally older than range, "R00" - totally younger than range, "01R" - FAD exceeded, "1R0" - LAD exceeded. If not supplied, all assemblages will be checked, even if they are already valid a priori. |
do.flag |
Rather than supplying error codes, should flag_ranges be called internally to generate error codes for supply to the rest of revise_ranges? As with err, this is useful to prefilter individual occurrences, allowing assemblages contain all valid, all unchecked or a mixture of such error codes to be skipped. This can massively speed up processing time for large datasets. |
prop |
A numeric, between 0 and 1, denoting the threshold percentage of taxa in the assemblage for which a consensus age must be found |
allow.zero |
A logical determining if, in the case of a collection LAD being equal to the consensus age FAD (i.e. a pointwise overlap), that pointwise age will be taken as the revised age. The resultant collection age will have no uncertainty as a result, which may be unrealistic. The default behaviour is FALSE, in which case pointwise overlaps will be ignored and the revised age taken instead |
verbose |
A logical determining if the progress of the redating procedure should be reported |
A list of two dataframes, the first recording the results of the consensus redating procedure for each assemblage in x, the second recording any flags (if any) for each occurrence in x
# load datasets data("brachios") data("sepkoski") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # rename columns in Sepkoski to match brachios colnames(sepkoski)[4:6] <- c("genus", "max_ma", "min_ma") # flag and resolve against the Sepkoski Compendium, collection-wise revrng <- revise_ranges(x = brachios, y = sepkoski, do.flag = TRUE, verbose = TRUE, taxon = "genus", assemblage = "collection_no", srt = "max_ma", end = "min_ma") # append the revised occurrence ages and error codes to the dataset brachios$newfad <- revrng$occurrence$FAD brachios$newlad <- revrng$occurrence$LAD brachios$errcode <- revrng$occurence$status
# load datasets data("brachios") data("sepkoski") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # rename columns in Sepkoski to match brachios colnames(sepkoski)[4:6] <- c("genus", "max_ma", "min_ma") # flag and resolve against the Sepkoski Compendium, collection-wise revrng <- revise_ranges(x = brachios, y = sepkoski, do.flag = TRUE, verbose = TRUE, taxon = "genus", assemblage = "collection_no", srt = "max_ma", end = "min_ma") # append the revised occurrence ages and error codes to the dataset brachios$newfad <- revrng$occurrence$FAD brachios$newlad <- revrng$occurrence$LAD brachios$errcode <- revrng$occurence$status
Lookup table of chronostratigraphic stage abbreviations used in the Sepkoski Compendium, with interval boundaries updated to the GTS2020 standard
data(sep_code)
data(sep_code)
An object of class data.frame
with 306 rows and 8 columns.
chronosphere (fetch), Sepkoski 2002
An example dataset. A port of the Sepkoski Compendium from the chronosphere package, with a few corrections and GTS2020 dating applied
sepkoski
sepkoski
An object of class data.frame
with 35700 rows and 6 columns.
Function for checking for potential synonyms with alternate spellings. Synonyms are checked for within group using using a Jaro Winkler string distance matrix. Potential synonyms are selected using the jw threshold. These can then be further filtered by the number of shared letters at the beginning and end of the a synonym pair, and by prefixes or suffixes which may give erroneously high similarities.
spell_check( x, terms = NULL, groups = NULL, jw = 0.1, str = 1, str2 = NULL, alternative = "jaccard", q = 1, pref = NULL, suff = NULL, exclude = NULL, verbose = TRUE )
spell_check( x, terms = NULL, groups = NULL, jw = 0.1, str = 1, str2 = NULL, alternative = "jaccard", q = 1, pref = NULL, suff = NULL, exclude = NULL, verbose = TRUE )
x |
a dataframe containing a column with terms, and a further column denoting the groups within which terms will be checked against one another. If supplying a dataframe with just these columns, terms should be column 1 |
terms |
a character vector of length 1, specifying the terms column in x. This is required if x contains more than two columns. Alternatively, if x is not provided, terms can be a character vector. If groups are not specified, all elements of terms will be treated as part of the same group |
groups |
a character vector of length 1, specifying the groups column in x. This is required if x contains more than two columns. Alternatively, if terms is supplied as a character vector, groups can also be supplied in the same way to denote their groups |
jw |
a numeric greater than 0 and less than 1. This is the distance threshold below which potential synonyms will be considered |
str |
A positive integer specifying the number of matching characters at the beginning of synonym pairs. By default 1, i.e. the first letters must match |
str2 |
If not NULL, a positive integer specifying the number of matching characters at the end of synonym pairs |
alternative |
A character string of length one corresponding to one of the methods used by @seealso afind. One of "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine", "jaccard", or "soundex". |
q |
q-gram size. Only used when alternative is "qgram", "cosine" or "Jaccard". |
pref |
If not NULL, a character vector of prefixes which may result in erroneously low JW distances. Synonyms will only be considered if both terms share the same prefix |
suff |
If not NULL, a character vector of suffices which may result in erroneously low JW distances. Synonyms will only be considered if both terms share the same suffix |
exclude |
If not NULL, a character vector of group names which should be skipped - useful for groups which are known to contain potentially similar terms |
verbose |
A logical determining if function progress be reported using the pbapply progress bar |
a dataframe of synonyms (cols 1 and 2), the group in which they occur, the frequencies of each synonym in the dataset and finally the q-gram difference between the synonyms
# load dataset data("brachios") # define suffixes b_suff <- c("ina", "ella", "etta") # run function spl <- spell_check(brachios, terms = "genus", groups = "family", suff = b_suff)
# load dataset data("brachios") # define suffixes b_suff <- c("ina", "ella", "etta") # run function spl <- spell_check(brachios, terms = "genus", groups = "family", suff = b_suff)
Function to create a tgraph representation of a hierarchically organised dataframe. This is the focal object of the t* functions - the complete set of hierarchical relationships between a set of elements
tgraph(x, ranks = NULL, verbose = TRUE)
tgraph(x, ranks = NULL, verbose = TRUE)
x |
A dataframe containing a set of hierarchical relationships. The leftmost column contains the elements which will form the highest rank, followed rightwards by successive ranks |
ranks |
If not NULL, a vector of column names of x, given in rank order. This is useful if x contains columns which are not rank relevant or if columns are not in hierarchical order. If not supplied, the column order in x is used directly and is assumed to be in rank order |
verbose |
A logical indicating whether the progress of tgraph construction should be reported to the console |
a tgraph object
Function to detect if two peaks in a density spectrum can be considered separate based on a user supplied threshold. Creates a sequence of divisions from the troughs immediately preceding any significant peaks, then bins occurrences for a given taxon name by those divisions.
threshold_peaks( x, y, ycols = c("genus", "max_ma", "min_ma"), thresh = 15, verbose = TRUE )
threshold_peaks( x, y, ycols = c("genus", "max_ma", "min_ma"), thresh = 15, verbose = TRUE )
x |
A list of significant peaks as returned by |
y |
An occurrence dataset with taxon names corresponding to the list names of x |
ycols |
A character vector denoting, in order, the taxon, FAD and LAD columns in y |
thresh |
The threshold distance between peaks above which they will be considered distinct - given in Ma |
verbose |
A logical determining if function progress should be reported |
Function to detect if two peaks in a density spectrum can be considered separate based on a user supplied threshold. Creates a sequence of divisions from the troughs immediately preceding any significant peaks, then bins occurrences for a given taxon name by those divisions.
threshold_ranges( x, rank = "genus", srt = "max_ma", end = "min_ma", method = "kernel", step = 1, density = 0.1, use_sd = TRUE, win = 5, thresh = 5, ..., report = TRUE, verbose = TRUE )
threshold_ranges( x, rank = "genus", srt = "max_ma", end = "min_ma", method = "kernel", step = 1, density = 0.1, use_sd = TRUE, win = 5, thresh = 5, ..., report = TRUE, verbose = TRUE )
x |
An occurrence dataset containing taxon names, maximum ages and minimum ages |
rank |
The column name in x containing the taxon names |
srt |
A column name in x denoting the occurrence maximum ages |
end |
A column name in x denoting the occurrence minumum ages |
method |
The method for quantifying occurrence density: one histogram or kernel. Kernel is the recommended default. As called be @seealso densify |
step |
A positive integer specifying the time window size for density calculation. As called by @seealso densify |
density |
A positive numeric specifying the step size for densifying records. This should ideally be smaller than step. As called by @seealso densify |
use_sd |
A logical determining whether to use peaks detected as significant using the mean + standard deviation of its neighbourhood. If FALSE, then the peaks need only be greater than the neighbourhood mean to be significant. Thus, use_sd is more conservative, but less prone to noise. As called by @seealso find_peaks |
win |
A positive integer specifying the neighborhood window length on either side of a peak durign significance testing (i.e. win 5 will give a total window of 11: -5 indices + peak index + 5 indices). As called by @seealso find_peaks |
thresh |
The threshold distance between peaks above which they will be considered distinct - given in Ma |
... |
additional arguments passed to @seealso density |
report |
A logical determining if the analytical outputs of the function be returned to the user, as well as the revised taxon names, TRUE by default |
verbose |
A logical determining if function progress should be reported |
If report = TRUE (the default), a list of five elements. $data gives the thresholded (and potentially subdivided) taxon names. $matrix is the taxon-wise matrix of occurrence densities. $peaks is a list containing three lists of peaks (all peaks, significant by mean + sd, significant by sd only) for each taxon and a dataframe of peak counts between the three treatments. $comparison
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # interpeak thresholding itp <- threshold_ranges(brachios, win = 8, thresh = 10, rank = "genus", srt = "max_ma", end = "min_ma")
# load dataset data("brachios") # subsample brachios to make for a short example runtime set.seed(1) brachios <- brachios[sample(1:nrow(brachios), 1000),] # interpeak thresholding itp <- threshold_ranges(brachios, win = 8, thresh = 10, rank = "genus", srt = "max_ma", end = "min_ma")
Function to update the structure of a graph, given a set of modification as returned by assess_duplicates
update_graph(x, del = NULL, add = NULL, changes = NULL)
update_graph(x, del = NULL, add = NULL, changes = NULL)
x |
a tgraph object to modify |
del |
A vector of element names or numbers to delete |
add |
An edge sequence of edges to add to the graph |
changes |
Alternatively, the output of assess_duplicates, containing proposed deletions and additions |
An updated tgraph object