Pseudocode • distionary

This document outlines the core functionalities of distionary for other programming languages. Developers can add extra checks as needed.

Distribution Objects

Distribution objects store user-defined properties and distribution metadata. The key metadata field is vtype (e.g., continuous, discrete, mixed). Properties and metadata are bundled in a dictionary-like structure ({}).

distribution = function(properties, vtype = Null):
  return {properties = properties, vtype = vtype}

Currently, distionary supports only numeric, univariate distributions (i.e., single variable). Future versions will support multivariate distributions.

Common Distribution Families

Wrappers allow access to common distributions like Normal and Poisson, prefixed with dst_. For example:

dst_norm = function(mean, sd):
  return distribution(
    properties = [
      density = function(x) dnorm(x, mean = mean, sd = sd),
      realise = function(n) rnorm(n, mean = mean, sd = sd),
      ...,
      range = [-Inf, Inf],
      mean = mean,
      stdev = sd,
      skewness = 0,
      ...
    ],
    vtype = "continuous"
  )

Here, [] is a list-like structure that can be used to pass arguments to the function. dnorm and rnorm are assumed to be available or should be implemented manually.

In addition to common families, a Null distribution that always evaluates to NA should be created. This object is useful when external algorithms fail, where a Null distribution can be output instead of an error.

Evaluating

Start with thin wrappers for evaluating specific properties, which invoke the general function eval_property.

Properties that are not functions can be accessed by functions of the same name:

mean = function(distribution):
  eval_property(distribution, name = "mean")

range = function(distribution):
  eval_property(distribution, name = "range")

Properties that are functions are prefixed with eval_ to indicate they require an argument, passed with at. All such properties currently in distionary only have one argument, but in principle could accept others; these all go in [].

eval_density = function(distribution, at):
  eval_property(distribution, name = "density", args = [at])

eval_survival = function(distribution, at):
  eval_property(distribution, name = "cdf", args = [at])

The at parameter can accept a vector of values. The current version supports a single distribution input, but future versions might allow recycling both arguments.

The main function eval_property evaluates properties if present or delegates network evaluation. If the property is a function, it will evaluate it using the provided arguments in args. By default, args is Null, meaning arguments are not relevant (therefore, this argument can be ignored for numeric properties).

eval_property = function(distribution, property_name, args = Null):
  properties = subset(distribution, "properties")
  property = subset(properties, property_name)
  if is_null(property):
    return eval_from_network(distribution, name = property_name, args = args)
  else if is_function(property):
    return property(args)
  else:
    return property

Use subset to retrieve entries from dictionary-like structures. property(args) means to call the function property with arguments.

The absence of a property is evaluated using a network of properties, specifying how one property can be retrieved from others. For example, the cumulative hazard function (CHF) can be calculated using the survival function:

eval_from_network = function(distribution, name, args):
  if name == "chf":
    return -log(eval_survival(distribution, args))
  if name == "survival":
    return 1 - eval_cdf(distribution, args)
  if name == "mean":
  ...

Network evaluation is still robust against whether the survival function has been specified for the distribution due to the use of eval_survival: if the survival function can’t be retrieved from the distribution, it will evaluate it from the network, this time from the cumulative distribution function (CDF). The network has been set up to avoid getting caught in a loop, so that the survival function doesn’t look back to the CHF for evaluation – it’s (mostly) a directed acyclic graph.

Network computation failures throw errors, though future versions might allow user-specified error-handling strategies.

The defined network is not unique. For example, the CHF could be retrieved from the hazard function rather than the survival function. The network was minimized for numerical efficiency.

Currently, the CDF and density/PMF anchor the property network, meaning that they are assumed available. Future improvements may implement networks as their own objects, customized to each distribution to be seeded from the properties specified in each distribution.

Enframing

Instead of evaluating properties that are functions, enframe these to place inputs and outputs in tabular form. This is useful for table-centric data analyses. Start with simple wrappers:

enframe_density = function(distributions, at):
  enframe_general(distributions, at = at, eval_fn = eval_density)

enframe_survival = function(distributions, at):
  enframe_general(distributions, at = at, eval_fn = eval_survival)

distributions might contain multiple distributions. The main function loops over each distribution, adding results to new columns:

enframe_general = function(distributions, at, eval_fn):
  tbl = [x = at]
  for d in distributions:
    y = eval_fn(d, at)
    tbl = append(tbl, [y])
  return tbl

The developer may wish to consider features related to column naming.