`specify.Rmd`

`library(distionary)`

You’re able to make a wide range of probability distributions using distplyr’s manipulation functions, but you’ll need to start with more standard, basic distributions first. There are typically three use cases for building a basic distribution:

- Parametric families
- Empirical distributions
- Manually specified distributions

These include distributions like Normal, Exponential, Poisson, etc.

distplyr includes distributions present in base R’s `r*`

/`p*`

/`d*`

/`q*`

selection of distributions. For example, a Normal distribution in base R has associated functions `rnorm()`

etc. In distplyr:

```
dst_norm(0, 1)
#> norm parametric dst
#>
#> name :
#> [1] "norm"
```

distplyr also includes other common distributions not present in base R, such as a generalized Pareto distribution:

```
dst_gpd(0, 1, 1)
#> gpd parametric dst
#>
#> name :
#> [1] "gpd"
```

**November 2020**: Until this package gains some stability in its structure, there will be a limited number of these distributions – but there will be plenty available in the not-too-distant future.

Whereas base R only has the `ecdf()`

function to handle empirical distributions, distplyr provides full functionality with `dst_empirical()`

. Empirical distribution of `hp`

values in the `mtcars`

dataset:

```
(hp <- dst_empirical(hp, data = mtcars))
#> finite dst
#>
#> probabilities :
#> # A tibble: 22 × 2
#> location size
#> <dbl> <dbl>
#> 1 52 0.0312
#> 2 62 0.0312
#> 3 65 0.0312
#> 4 66 0.0625
#> 5 91 0.0312
#> 6 93 0.0312
#> 7 95 0.0312
#> 8 97 0.0312
#> 9 105 0.0312
#> 10 109 0.0312
#> # … with 12 more rows
```

The “step” in the name comes from the cdf:

`plot(hp, "cdf", n = 501)`

You can also weigh the outcomes differently. This is useful for explicitly specifying a probability mass function, as well as for other applications such as using kernel smoothing to find a conditional distribution. Here is an estimate of the conditional distribution of `hp`

given `disp = 150`

, with cdf depicted as the dashed line compared o the marginal with the solid line:

```
K <- function(x) dnorm(x, sd = 25)
hp2 <- dst_empirical(hp, data = mtcars, weights = K(disp - 150))
plot(hp, "cdf", n = 1001)
plot(hp2, "cdf", n = 1001, lty = 2, add = TRUE)
```

The weighting provides us with a far more informative prediction of `hp`

when `disp = 150`

compared to the loess, which just gives us the mean:

```
mean(hp2)
#> [1] 109.961
```

With a distribution, you can get much more, such as this 90% prediction interval:

```
eval_quantile(hp2, at = c(0.05, 0.95))
#> [1] 62 175
```

Here’s the proportion of variance that’s reduced compared to the marginal: