evaluate.Rmd
library(distionary)
One purpose of distplyr
is to handle the menial distribution-related calculations for you. Just specify a distribution once, and there is no need to manage its components anymore.
Example: want to compute the variance of a Uniform(-1, 1) distribution, get the 0.25- and 0.75-quantiles, and generate a sample of size 10?
Without distplyr
:
a <- -1
b <- 1
# Look up formula for variance:
(b - a) ^ 2 / 12
#> [1] 0.3333333
# Get quantiles:
qunif(c(0.25, 0.75), min = a, max = b)
#> [1] -0.5 0.5
# Get sample of size 10:
runif(10, min = a, max = b)
#> [1] -0.48651330 0.08763024 -0.49995474 0.70864176 -0.39992650 0.44163056
#> [7] -0.22744847 -0.78707611 -0.23053899 -0.94330538
With distplyr
:
d <- dst_unif(-1, 1)
variance(d)
#> [1] 0.3333333
eval_quantile(d, at = c(0.25, 0.75))
#> [1] -0.5 0.5
realise(d, 10)
#> [1] 0.04504622 -0.19644052 0.96917160 -0.17264643 0.38818268 -0.55088825
#> [7] 0.07600935 -0.90234684 0.99492568 0.81887116
A distribution can be represented by different functions, such as a density function, a cumulative distribution function, and others. In distplyr
, you can:
eval_*
;enframe_*
; orget_*
.Here are the representations and the corresponding distplyr
functions:
Quantity |
distplyr Functions |
---|---|
Cumulative Distribution Function |
eval_cdf() , get_cdf() , enframe_cdf()
|
Survival Function |
eval_survival() , get_survival() , enframe_survival()
|
Quantile Function |
eval_quantile() , get_quantile() , enframe_quantile()
|
Hazard Function |
eval_hazard() , get_hazard() , enframe_hazard()
|
Cumulative Hazard Function |
eval_chf() , get_chf() , enframe_chf()
|
Probability density function |
eval_density() , get_density() , enframe_density()
|
Probability mass function |
eval_pmf() , get_pmf() , enframe_pmf()
|
These functions all take a distribution object as their first argument, and eval_*
and enframe_*
have a second argument named at
indicating where to evaluate the function. The at
argument is vectorized.
Here is an example of evaluating the hazard function and the random sample generator of a Uniform(-1,1) distribution, and enframing the density:
eval_hazard(d, at = 0:10)
#> [1] 1 Inf NaN NaN NaN NaN NaN NaN NaN NaN NaN
enframe_density(d, at = 0:10)
#> # A tibble: 11 × 2
#> .arg density
#> <int> <dbl>
#> 1 0 0.5
#> 2 1 0.5
#> 3 2 0
#> 4 3 0
#> 5 4 0
#> 6 5 0
#> 7 6 0
#> 8 7 0
#> 9 8 0
#> 10 9 0
#> 11 10 0
set.seed(10)
enframe()
works particularly well with tibbles and tidyr::unnest()
:
# half_marathon <- tribble(
# ~ person, ~ race_time_min,
# "Vincenzo", dst_norm(130, 25),
# "Colleen", dst_norm(110, 13),
# "Regina", dst_norm(115, 20)
# )
# half_marathon %>%
# mutate(quartiles = map(race_time_min, enframe_quantile, at = 1:3 / 4)) %>%
# unnest(quartiles)
To draw a random sample from a distribution, use the realise()
or realize()
function:
realise(d, n = 5)
#> [1] 0.01495641 -0.38646299 -0.14618467 0.38620416 -0.82972806
You can read this call as “realise distribution d
five times”. By default, n
is set to 1, so that realizing a distribution converts it to a numeric draw:
realise(d)
#> [1] -0.5491268
This default is especially useful when working with distributions in a tibble:
# half_marathon %>%
# mutate(actual_time_min = map_dbl(race_time_min, realise))
Perhaps surprisingly, distplyr does not consider realise()
as a functional representation of a distribution, even though random sampling falls into the same family as the stats::p*/d*/q*/r*
functions. This is because it’s impossible to perfectly describe a distribution based on a sample.
Distributions have various numeric properties. Common examples are the mean and variance, but there are many others as well.
Below is a table of the properties incorporated in distplyr
:
Property |
distplyr Function |
---|---|
Mean | mean() |
Median | median() |
Variance | variance() |
Standard Deviation | sd() |
Skewness | skewness() |
Excess Kurtosis | kurtosis_exc() |
Kurtosis | kurtosis_raw() |
Extreme Value (Tail) Index | evi() |
Here are some properties of our original Uniform(-1, 1) distribution: