Skip to contents

This vignette covers the first goal of distionary: to provide a framework for creating a distribution, either one of the built-in distributions, or your own.

Built-In Distribution Families

All distribution families found in the stats package are made available with distionary, along with a few others. All families are shown in the following table.

Distribution distionary Function Has counterpart in stats
Bernoulli dst_bern() Yes
Beta dst_beta() Yes
Binomial dst_binom() Yes
Cauchy dst_cauchy() Yes
Chi Squared dst_chisq() Yes
Degenerate dst_degenerate() No
Exponential dst_exp() Yes
F dst_f() Yes
Gamma dst_gamma() Yes
Geometric dst_geom() Yes
Generalised Extreme Value (GEV) dst_gev() No
Generalised Pareto (GPD) dst_gpd() No
Hypergeometric dst_hyper() Yes
Log Normal dst_lnorm() Yes
Log Pearson Type III dst_lp3() No
Negative Binomial dst_nbinom() Yes
Normal dst_norm() Yes
Pearson Type III dst_pearson3() No
Poisson dst_pois() Yes
Student t dst_t() Yes
Uniform dst_unif() Yes
Weibull dst_weibull() Yes

In addition, there is a special “Null” distribution object akin to a missing or unknown distribution. This is useful, for example, if an algorithm fails to return a distribution: instead of throwing an error, a Null distribution can be returned.

# Make a Null distribution.
null <- dst_null()
# Inspect
null
#> Null distribution (NA)

A Null distribution always evaluates to NA:

mean(null)
#> [1] NA
eval_pmf(null, at = 1:4)
#> [1] NA NA NA NA

Distributions with a counterpart in the stats package use the same functions from the package and the same parameter names. For instance, take a look at the source code defining the Normal distribution to see that the representations wrap the stats::*norm() family of functions.

dst_norm
#> function (mean, sd) 
#> {
#>     checkmate::assert_numeric(mean, len = 1)
#>     checkmate::assert_numeric(sd, 0, len = 1)
#>     if (is.na(mean) || is.na(sd)) {
#>         return(dst_null())
#>     }
#>     distribution(.parameters = list(mean = mean, sd = sd), density = function(x) stats::dnorm(x, 
#>         mean = mean, sd = sd), cdf = function(x) stats::pnorm(x, 
#>         mean = mean, sd = sd), quantile = function(p) stats::qnorm(p, 
#>         mean = mean, sd = sd), realise = function(n) stats::rnorm(n, 
#>         mean = mean, sd = sd), survival = function(x) stats::pnorm(x, 
#>         mean = mean, sd = sd, lower.tail = FALSE), mean = mean, 
#>         median = mean, variance = sd^2, stdev = sd, skewness = 0, 
#>         kurtosis_exc = 0, range = c(-Inf, Inf), .name = "Normal", 
#>         .vtype = "continuous")
#> }
#> <bytecode: 0x5555b6a0c7b8>
#> <environment: namespace:distionary>

User-Defined Distributions

You can make your own distribution using the distribution() function. Provide it with name-value pairs that will be stored with the object in a list. Some names have a special interpretation in distionary in that they may be leveraged to compute other properties that aren’t specified in distribution(). These special names are:

  • Any distributional representation * invoked by eval_*() (e.g., quantile for eval_quantile()).
  • Any property (e.g., mean for mean())

See the [Evaluate a Distribution] vignette for more details on these evaluation functions.

For this version of distionary, it is mandatory to specify at least cdf in order for non-specified properties to be evaluated. For continuous or discrete distributions, density or pmf may also be needed.

Optionally, you can also specify an entry for .parameters, which should be a named list of parameters that define the distribution. These parameters are never used for distribution calculations, but are sometimes useful to keep track of for analysis. It is intended that a future version of distionary will refer to these parameters for its calculations, but will not break existing code.

Here is an example distribution.

# Make a continuous distribution
linear <- distribution(
  density = function(x) {
    d <- 2 * (1 - x)
    d[x < 0 | x > 1] <- 0
    d
  },
  cdf = function(x) {
    p <- 2 * x * (1 - x / 2)
    p[x < 0] <- 0
    p[x > 1] <- 1
    p
  },
  g = 9.81,
  another_representation = function(x) x^2,
  .vtype = "continuous",
  .name = "My Linear",
  .parameters = list(a = 1)
)
# Inspect
linear
#> My Linear distribution (continuous)
#> --Parameters--
#> a 
#> 1

The usual evaluation framework can now be accessed. For example, the CDF and mean can be evaluated, even though the mean has not been specified.

eval_cdf(linear, at = c(0.2, 0.5, 0.7))
#> [1] 0.36 0.75 0.91
mean(linear)
#> [1] 0.3333333

These quantities can be invoked by the more general function eval_property().

eval_property(linear, "cdf", c(0.2, 0.5, 0.7))
#> [1] 0.36 0.75 0.91
eval_property(linear, "mean")
#> [1] 0.3333333

eval_property() is a useful function for accessing quantities that are not built-in to distionary, like g and another_representation in this example.

eval_property(linear, "another_representation", 1:4)
#> [1]  1  4  9 16
eval_property(linear, "g")
#> [1] 9.81