Skip to contents

You’re able to make a wide range of probability distributions using distplyr’s manipulation functions, but you’ll need to start with more standard, basic distributions first. There are typically three use cases for building a basic distribution:

  1. Parametric families
  2. Empirical distributions
  3. Manually specified distributions

1. Parametric Families

These include distributions like Normal, Exponential, Poisson, etc.

distplyr includes distributions present in base R’s r*/p*/d*/q* selection of distributions. For example, a Normal distribution in base R has associated functions rnorm() etc. In distplyr:

dst_norm(0, 1)
#> [1] "norm"       "parametric" "dst"       
#> 
#>  name :
#> [1] "norm"

distplyr also includes other common distributions not present in base R, such as a generalized Pareto distribution:

dst_gpd(0, 1, 1)
#> [1] "gpd"        "parametric" "dst"       
#> 
#>  name :
#> [1] "gpd"

November 2020: Until this package gains some stability in its structure, there will be a limited number of these distributions – but there will be plenty available in the not-too-distant future.

2. Empirical Distributions

Whereas base R only has the ecdf() function to handle empirical distributions, distplyr provides full functionality with dst_empirical(). Empirical distribution of hp values in the mtcars dataset:

(hp <- dst_empirical(hp, data = mtcars))
#> [1] "finite" "dst"   
#> 
#>  probabilities :
#> # A tibble: 22 × 2
#>    location   size
#>       <dbl>  <dbl>
#>  1       52 0.0312
#>  2       62 0.0312
#>  3       65 0.0312
#>  4       66 0.0625
#>  5       91 0.0312
#>  6       93 0.0312
#>  7       95 0.0312
#>  8       97 0.0312
#>  9      105 0.0312
#> 10      109 0.0312
#> # ℹ 12 more rows

The “step” in the name comes from the cdf:

plot(hp, "cdf", n = 501)

You can also weigh the outcomes differently. This is useful for explicitly specifying a probability mass function, as well as for other applications such as using kernel smoothing to find a conditional distribution. Here is an estimate of the conditional distribution of hp given disp = 150, with cdf depicted as the dashed line compared o the marginal with the solid line:

K <- function(x) dnorm(x, sd = 25)
hp2 <- dst_empirical(hp, data = mtcars, weights = K(disp - 150))
plot(hp, "cdf", n = 1001)
plot(hp2, "cdf", n = 1001, lty = 2, add = TRUE)

The weighting provides us with a far more informative prediction of hp when disp = 150 compared to the loess, which just gives us the mean:

mean(hp2)
#> [1] 109.961

With a distribution, you can get much more, such as this 90% prediction interval:

eval_quantile(hp2, at = c(0.05, 0.95))
#> [1]  62 175

Here’s the proportion of variance that’s reduced compared to the marginal:

1 - variance(hp2) / variance(hp)
#> [1] 0.8031741