You’re able to make a wide range of probability distributions using distplyr’s manipulation functions, but you’ll need to start with more standard, basic distributions first. There are typically three use cases for building a basic distribution:
These include distributions like Normal, Exponential, Poisson, etc.
distplyr includes distributions present in base R’s
q* selection of distributions. For example, a Normal distribution in base R has associated functions
rnorm() etc. In distplyr:
dst_norm(0, 1) #> norm parametric dst #> #> name : #>  "norm"
distplyr also includes other common distributions not present in base R, such as a generalized Pareto distribution:
dst_gpd(0, 1, 1) #> gpd parametric dst #> #> name : #>  "gpd"
November 2020: Until this package gains some stability in its structure, there will be a limited number of these distributions – but there will be plenty available in the not-too-distant future.
Whereas base R only has the
ecdf() function to handle empirical distributions, distplyr provides full functionality with
dst_empirical(). Empirical distribution of
hp values in the
(hp <- dst_empirical(hp, data = mtcars)) #> finite dst #> #> probabilities : #> # A tibble: 22 × 2 #> location size #> <dbl> <dbl> #> 1 52 0.0312 #> 2 62 0.0312 #> 3 65 0.0312 #> 4 66 0.0625 #> 5 91 0.0312 #> 6 93 0.0312 #> 7 95 0.0312 #> 8 97 0.0312 #> 9 105 0.0312 #> 10 109 0.0312 #> # … with 12 more rows
The “step” in the name comes from the cdf:
plot(hp, "cdf", n = 501)
You can also weigh the outcomes differently. This is useful for explicitly specifying a probability mass function, as well as for other applications such as using kernel smoothing to find a conditional distribution. Here is an estimate of the conditional distribution of
disp = 150, with cdf depicted as the dashed line compared o the marginal with the solid line:
K <- function(x) dnorm(x, sd = 25) hp2 <- dst_empirical(hp, data = mtcars, weights = K(disp - 150)) plot(hp, "cdf", n = 1001) plot(hp2, "cdf", n = 1001, lty = 2, add = TRUE)
The weighting provides us with a far more informative prediction of
disp = 150 compared to the loess, which just gives us the mean:
mean(hp2) #>  109.961
With a distribution, you can get much more, such as this 90% prediction interval:
eval_quantile(hp2, at = c(0.05, 0.95)) #>  62 175
Here’s the proportion of variance that’s reduced compared to the marginal: