Evaluate a Distribution • distionary

library(distionary)

This vignette covers the second goal of distionary: to evaluate probability distributions, even when that property is not specified in the distribution’s definition.

Distributional Representations

A distributional representation is a function that fully describes the distribution, such that any property can be calculated from it. Here is a list of representations recognised by distionary, and the functions for accessing them.

Representation	`distionary` Functions
Cumulative Distribution Function	`eval_cdf()`, `enframe_cdf()`
Survival Function	`eval_survival()`, `enframe_survival()`
Quantile Function	`eval_quantile()`, `enframe_quantile()`
Hazard Function	`eval_hazard()`, `enframe_hazard()`
Cumulative Hazard Function	`eval_chf()`, `enframe_chf()`
Probability density Function	`eval_density()`, `enframe_density()`
Probability mass Function (PMF)	`eval_pmf()`, `enframe_pmf()`
Odds Function	`eval_odds()`, `enframe_odds()`
Return Level Function	`eval_return()`, `enframe_return()`

All representations can either be accessed by the eval_*() family of functions, providing a vector of the evaluated representation.

d1 <- dst_geom(0.6)
eval_pmf(d1, at = 0:5)
#> [1] 0.600000 0.240000 0.096000 0.038400 0.015360 0.006144

Alternatively, the enframe_*() family of functions provides the results in a tibble or data frame paired with the inputs, useful in a data wrangling workflow.

enframe_pmf(d1, at = 0:5)
#> # A tibble: 6 × 2
#>    .arg     pmf
#>   <int>   <dbl>
#> 1     0 0.6    
#> 2     1 0.24   
#> 3     2 0.096  
#> 4     3 0.0384 
#> 5     4 0.0154 
#> 6     5 0.00614

The enframe_*() functions allow for insertion of multiple distributions, placing a column for each distribution. The column names can be changed in three ways:

The input column .arg can be renamed with the arg_name argument.
The pmf prefix on the evaluation columns can be changed with the fn_prefix argument.
The distribution names can be changed by assigning name-value pairs for the input distributions.

Let’s practice this with the addition of a second distribution.

d2 <- dst_geom(0.4)
enframe_pmf(
  model1 = d1, model2 = d2, at = 0:5,
  arg_name = "num_failures", fn_prefix = "probability"
)
#> # A tibble: 6 × 3
#>   num_failures probability_model1 probability_model2
#>          <int>              <dbl>              <dbl>
#> 1            0            0.6                 0.4   
#> 2            1            0.24                0.24  
#> 3            2            0.096               0.144 
#> 4            3            0.0384              0.0864
#> 5            4            0.0154              0.0518
#> 6            5            0.00614             0.0311

Drawing a random sample

To draw a random sample from a distribution, use the realise() or realize() function:

set.seed(42)
realise(d1, n = 5)
#> [1] 0 0 0 0 0

You can read this call as “realise distribution d five times”. By default, n is set to 1, so that realising converts a distribution to a numeric draw:

realise(d1)
#> [1] 0

While random sampling falls into the same family as the p*/d*/q*/r* functions from the stats package (e.g., rnorm()), this function is not a distributional representation, hence does not have a eval_*() or enframe_*() counterpart. This is because it’s impossible to perfectly describe a distribution based on a sample.

Properties of Distributions

distionary refers to a distribution property as any value that can be calculated from a distribution, such as the mean and variance. Whereas a distributional representation must fully define a distribution, a property need not.

Below is a table of the properties incorporated in distionary, and the corresponding functions for accessing them.

Property	`distionary` Function
Mean	`mean()`
Median	`median()`
Variance	`variance()`
Standard Deviation	`sd()`
Skewness	`skewness()`
Excess Kurtosis	`kurtosis_exc()`
Kurtosis	`kurtosis()`

Here’s the mean and variance of our original distribution.

mean(d1)
#> [1] 0.6666667
variance(d1)
#> [1] 1.111111

Some properties are easy to make yourself. Here is an example of a function that calculates interquartile range.

# Make a function that takes a distribution as input, and returns the
# interquartile range.
iqr <- function(distribution) {
  diff(eval_quantile(distribution, at = c(0.25, 0.75)))
}

Apply the function.

iqr(d2)
#> [1] 2

For properties that are not handled by distionary (e.g., extreme value index, or moment generating function), one option is to build these properties into your own distribution. A future version of distionary will make user-defined properties easier to work with.