DEV Community: Alex Towell

Noisy Turing Machines: Noisy Logic Gates

Alex Towell — Sun, 07 Jun 2026 15:13:32 +0000

Noisy Turing machines: noisy logic gates

As we consider more complex compound data types, which may always be modeled as
functions, we will see that there are many ways these types can participate
in the Bernoulli Boolean model. When a Bernoulli value is introduced into the
computational model, the entire computation outputs a final result that is
a Bernoulli type, e.g., bernoulli<pair<T1,T2>>, pair<T1,bernoulli<T2>, and so
on.

The easiest way to think about this is to just consider a Universal Turing machine
in which we build programs by composing circuits of binary logic-gates, like and,
or, and not. In general, if we replace a single input into the circuit with a
Bernoulli Boolean, the output of the circuit is a one or more Bernoulli Booleans.
Moreover, and more interestingly, we can replace some of the logic gates with
noisy logic-gates, or Bernoulli logic-gates, and the output of the circuit is
also a Bernoulli Boolean. We can always discard information about the uncertainty
in the output of the circuit, and just get Boolean, but if the uncertainty is
non-negligible, then we may want to keep track of it.

So, let's consider the set of binary functions
f : (bool, bool) -> bool.

There are 2^2 = 4 possible functions f : bool -> bool since for each possible
input, $1$ or $0$, we have two possible outputs, $1$ or $0$.

More generally, if we have f : X -> Y, then we have |Y|^|X| possible functions,
where |.| denotes the cardinality of a set. For instance, if X = (bool, bool)
and Y = bool, then we have 2^4 = 16 possible functions, since |X| = 4 and |Y| = 2.

Each of these functions has a designated name, which we can use to refer to them,
like and, xor, etc. However, we are just going to look at and.

Table 4: and : (bool, bool) -> bool

`x1`	`x2`	`and(x1, x2)`
true	true	true
true	false	false
false	true	false
false	false	false

Now, let's consider

and : (bernoulli<bool,1>, bernoulli<bool,1>) -> bernoulli<bool,2>`

This is more complicated than might first seem. An error occurs if
and returns $1$ when it should return $0$, or vice versa. The input
variables represent latent values, so they do not have a definite value.

We will go row by row, and examine the probability that the output is correct for
each output.

Case 1: The Correct Output Is True

In order for the output to be true, both noisy inputs must be true, which is just
the product of the probabilities of each condition being true since they are
statistically independent outcomes.

Case 2: The Correct Output Is False Given `x1 = true` and `x2 = false`

Consider and(bernoulli<bool,1>{true}, bernoulli<bool,1>{false}).
For this to be true, the first must be a true positive and the second must be
a false postive, which is just p1 * (1-p2). Since we are interested in the probability that it correctly maps to false, that is just
1 - p1 * (1-p2) = 1 - p1 + p1 * p2.

Case 3: The Correct Output Is False Given `x1 = false` and `x2 = true`

Consider and(bernoulli<bool,1>{false}, bernoulli<bool,1>{true}).
For this to be true, the first must be a false positive and the second must
be a true positive, which is just (1-p1) * p2. Since we are interested in the
probability that it maps correctly to false, that is just
1 - (1-p1) * p2 = 1 - p2 + p1 * p2.

Case 4: The Correct Output Is False Given `x1 = false` and `x2 = false`

Consider and(bernoulli<bool,1>{false}, bernoulli<bool,1>{false}).
For this to be true, both must be false positives, which is just
(1-p1) * (1-p2). Since we are interestd in the probability that it maps correctly
to false, that is just 1 - (1-p1) * (1-p2) = p1 + p2 - p1 * p2.

Summary

Table 6: and with Bernoulli inputs

`x1`	`x2`	`and(x1,x2)`	`Pr{correct}`
1	1	1	`p1 * p2`
1	0	0	`1 - p1 + p1 * p2`
0	1	0	`1 - p2 + p1 * p2`
0	0	0	`p1 + p2 - p1 * p2`

We see that and : (bernoulli<bool,1>, bernoulli<bool,1>) -> bernoulli<bool,4>
induces an output that is a fourth-order Bernoulli Boolean. How is this possible
when there are only two possible outputs? The answer is that the output is dependent
on four different combinations of inputs.

Since x1 and x2 are latent, we can only talk about the probability that
the output is correct or not. We see that when the output is 1, the probability that
the output is correct is p1 * p2. When the output is 0, the probability that it is
correct is more complicated.

We could store all of this information in the type bernoulli<bool,4>, but it is
probably more convenient to use interval arithmetic, where we store a range of
probabilities for the probabily that the Boolean value being stored is correct.
The best choice is just the minimum length interval that contains all of the
relevant probabilities for the output being correct. When the output is 1, we see
that the minimum spanning interval is just p1 * p2, and when the output is 0,
the minimum spanning interval is just the minimum span of

min_span{1 - p1 + p1 * p2, 1 - p2 + p1 * p2, p1 + p2 - p1 * p2}

As we compose more and more logic circuits together, we can keep track of the
minimum spanning intervals on outputs using interval arithmetic.

Let's come back to the idea of Bernoulli types over compound types. In particular,
let's consider applynig the Bernoulli approximation to binary functions of the
type (bool, bool) -> bool.

Now, we can apply the Bernoulli approximation

bernoulli<(bool, bool) -> bool>

which will generate functions of the type

(bool, bool) -> bernoulli<bool>

This may be thought of as a noisy binary logic-gate.
For the case of the and gate, what we observe in our model is
bernoulli<(bool, bool) -> bool>{and}, and it can generate up to 16 different
Bernoulli Boolean functions. That means that the maximum order is
$16 (16 - 1) = 240$, which isn't really important, but it's interesting to note.

Of course, if we have this noisy and function and then put in noisy inputs,
then we get a function of type

(bernoulli<bool>, bernoulli<bool>) -> bernoulli<bool>

Kraft's Inequality

Alex Towell — Sun, 07 Jun 2026 03:48:23 +0000

Every prefix-free code satisfies one inequality. That inequality is also sufficient. This post develops the necessary direction.

Kraft's Inequality

I want a code where each symbol maps to a bit string, and where any concatenation of codewords can be decoded unambiguously. The simplest way to guarantee that is prefix-freeness: no codeword is a prefix of any other. A prefix-free code is self-delimiting. The decoder reads bits left-to-right and knows exactly when each codeword ends, with no lookahead and no length headers.

The question I keep returning to is: which collections of lengths are actually achievable? If I want four codewords of lengths 1, 2, 3, and 3, can I build a prefix-free code with those lengths? What if I want two codewords of length 1? (No: there are only two 1-bit strings, and they are prefixes of everything longer.)

Kraft's inequality is the answer. A length vector ((l_1, l_2, \ldots, l_n)) is achievable by a prefix-free binary code only if

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

This is the constraint you cannot escape. Any prefix-free code satisfies it. Any length vector that violates it cannot be realized as a prefix-free code, full stop.

The converse is also true: any length vector satisfying Kraft is realizable by some prefix-free code. That is McMillan's theorem, and it is the subject of the next post in this series. This post develops the necessary direction: every prefix-free code satisfies Kraft.

The right tool for understanding why is the binary tree.

The Trie View

Represent each codeword as a path in a binary tree. Start at the root. For each bit, go left (0) or right (1). The codeword ends at a node, which I mark as a terminal. A code is prefix-free if and only if no terminal node has any descendants that are also terminals. Once you reach a terminal on the way down, you stop.

The example code ({A \to \texttt{0},\ B \to \texttt{10},\ C \to \texttt{110},\ D \to \texttt{111}}) has lengths ((1, 2, 3, 3)). Its trie looks like this:

A is at depth 1, left branch. B is at depth 2, right-then-left. C and D share a parent at depth 2, then split at depth 3. No codeword's node is an ancestor of another's: the code is prefix-free.

The BinaryTree class in kraft.hpp implements exactly this structure. It supports inserting codewords (as strings of '0' and '1' characters), checking membership, and verifying prefix-freeness by a recursive scan:

class BinaryTree {
public:
    BinaryTree() : root_(std::make_unique<Node>()) {}

    // Insert a codeword (a string of '0' and '1' characters).
    // Idempotent: inserting the same codeword twice is a no-op.
    void insert(const std::string& codeword) {
        Node* cur = root_.get();
        for (char c : codeword) {
            assert((c == '0' || c == '1') && "codeword must be over {0,1}");
            std::unique_ptr<Node>& child = (c == '0') ? cur->left : cur->right;
            if (!child) child = std::make_unique<Node>();
            cur = child.get();
        }
        cur->is_codeword = true;
    }

    // Returns true iff the codeword is in the tree.
    bool contains(const std::string& codeword) const {
        const Node* cur = root_.get();
        for (char c : codeword) {
            const std::unique_ptr<Node>& child = (c == '0') ? cur->left : cur->right;
            if (!child) return false;
            cur = child.get();
        }
        return cur->is_codeword;
    }

    // Returns true iff no codeword in the tree is a prefix of another.
    // Equivalently: no terminal node has any descendants that are terminal.
    [[nodiscard]] bool is_prefix_free() const {
        return is_prefix_free_recursive(root_.get(), false);
    }

private:
    struct Node {
        std::unique_ptr<Node> left;   // '0' branch
        std::unique_ptr<Node> right;  // '1' branch
        bool is_codeword = false;
    };

    std::unique_ptr<Node> root_;

    static bool is_prefix_free_recursive(const Node* node, bool ancestor_is_codeword) {
        if (!node) return true;
        // If this node is a codeword AND we passed through a codeword on the
        // way down, the ancestor codeword is a prefix of this one. Violation.
        // Equivalently: if this node is a codeword AND has any children that
        // are also codewords, this codeword is a prefix of those.
        if (node->is_codeword && ancestor_is_codeword) return false;
        bool below = node->is_codeword;
        if (!is_prefix_free_recursive(node->left.get(), ancestor_is_codeword || below)) return false;
        if (!is_prefix_free_recursive(node->right.get(), ancestor_is_codeword || below)) return false;
        return true;
    }
};

is_prefix_free_recursive passes a flag ancestor_is_codeword down the tree. If the current node is a terminal and an ancestor was also a terminal, that ancestor's codeword is a prefix of the current one: violation. This catches both directions of the prefix relationship in a single pass.

The Inequality

Think of the unit interval as a budget. Each codeword of length (l_i) claims a fraction (2^{-l_i}) of that budget. Kraft's inequality says the total claim is at most 1:

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

For the example code with lengths ((1, 2, 3, 3)):

$$\frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \frac{1}{8} = 1.$$

This code saturates the budget exactly. The fractions are (2^{-1} = 0.5) for A, (2^{-2} = 0.25) for B, and (2^{-3} = 0.125) for each of C and D.

The kraft_sum function computes the left-hand side. It uses std::ldexp to compute (2^{-l}) in floating point without overflow or underflow issues:

inline double kraft_sum(const std::vector<std::size_t>& lengths) {
    double sum = 0.0;
    for (std::size_t l : lengths) {
        sum += std::ldexp(1.0, -static_cast<int>(l));
    }
    return sum;
}

Some examples from the test suite:

Lengths ({1, 2, 3, 3}): sum is 1.0. (The example code above. Saturates.)
Lengths ({2, 2, 2, 2}): sum is (4 \times 0.25 = 1.0). (All four 2-bit codewords. Also saturates.)
Lengths ({1, 2, 3, 4, 5}): sum is (\frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \frac{1}{16} + \frac{1}{32} = \frac{31}{32} < 1). (A prefix of the unary code. Strictly below 1.)
Lengths ({1, 1, 1}): sum is (1.5 > 1). (Three 1-bit codewords are impossible: only "0" and "1" exist at depth 1.)

That last case violates Kraft, so no prefix-free code with those lengths exists. The check is_kraft_satisfying wraps this with a small floating-point tolerance:

inline bool is_kraft_satisfying(const std::vector<std::size_t>& lengths) {
    constexpr double kTolerance = 1e-9;
    return kraft_sum(lengths) <= 1.0 + kTolerance;
}

The Proof

The proof is a counting argument on the binary tree, made concrete by the trie_embedding function.

Fix a code with lengths (l_1, l_2, \ldots, l_n) and let (l_{\max} = \max_i l_i). Consider the complete binary tree of depth (l_{\max}): it has (2^{l_{\max}}) leaves, one for each binary string of length (l_{\max}).

Each codeword of length (l_i) is a string of length (l_i). In the depth-(l_{\max}) tree, it corresponds to a subtree: all strings of length (l_{\max}) that begin with the codeword. That subtree has (2^{l_{\max} - l_i}) leaves.

Prefix-freeness says: no codeword is a prefix of another. Equivalently, no string of length (l_{\max}) can begin with two distinct codewords. So the subtrees are disjoint: each leaf belongs to at most one codeword's subtree.

The total number of leaves in all subtrees is (\sum_{i=1}^{n} 2^{l_{\max} - l_i}). Since the subtrees are disjoint and all fit inside the depth-(l_{\max}) tree, this total is at most (2^{l_{\max}}):

$$\sum_{i=1}^{n} 2^{l_{\max} - l_i} \leq 2^{l_{\max}}.$$

Divide both sides by (2^{l_{\max}}):

$$\sum_{i=1}^{n} 2^{-l_i} \leq 1.$$

That is Kraft's inequality.

The trie_embedding function computes the subtree sizes directly, making the proof a concrete calculation:

struct TrieEmbeddingInfo {
    std::size_t l_max;
    std::size_t total_leaves;            // 2^l_max
    std::size_t occupied_leaves;         // sum of subtree_sizes
    std::vector<std::size_t> subtree_sizes;  // 2^{l_max - l_i} for each codeword
};

inline TrieEmbeddingInfo trie_embedding(const std::vector<std::size_t>& lengths) {
    TrieEmbeddingInfo info;
    info.l_max = 0;
    for (std::size_t l : lengths) {
        if (l > info.l_max) info.l_max = l;
    }
    info.total_leaves = std::size_t{1} << info.l_max;
    info.subtree_sizes.reserve(lengths.size());
    info.occupied_leaves = 0;
    for (std::size_t l : lengths) {
        std::size_t size = std::size_t{1} << (info.l_max - l);
        info.subtree_sizes.push_back(size);
        info.occupied_leaves += size;
    }
    return info;
}

For the example code with lengths ({1, 2, 3, 3}), the embedding gives:

(l_{\max} = 3), so the complete tree has (2^3 = 8) leaves.
Codeword A (length 1): subtree size (2^{3-1} = 4). A claims half the tree.
Codeword B (length 2): subtree size (2^{3-2} = 2). B claims a quarter.
Codeword C (length 3): subtree size (2^{3-3} = 1). A single leaf.
Codeword D (length 3): subtree size 1. Another single leaf.
Occupied: (4 + 2 + 1 + 1 = 8 = 2^3). The code saturates the budget.

The test suite checks this exactly:

TEST(KraftTest, TrieEmbeddingComputesSubtreeSizes) {
    auto info = trie_embedding({1, 2, 3, 3});
    EXPECT_EQ(info.l_max, 3u);
    EXPECT_EQ(info.total_leaves, 8u);
    EXPECT_EQ(info.subtree_sizes[0], 4u);  // length 1 -> 4 leaves
    EXPECT_EQ(info.subtree_sizes[1], 2u);  // length 2 -> 2 leaves
    EXPECT_EQ(info.subtree_sizes[2], 1u);  // length 3 -> 1 leaf
    EXPECT_EQ(info.subtree_sizes[3], 1u);  // length 3 -> 1 leaf
    EXPECT_EQ(info.occupied_leaves, 8u);   // saturates
}

An unsaturated case: lengths ({2, 2, 3}) give occupied (2 + 2 + 1 = 5) out of 8 leaves. Three leaves are unoccupied: you could add up to one more codeword of length 3, or a shorter codeword that doesn't conflict with the existing ones.

What Kraft Gives Us

Three consequences fall out of Kraft's inequality directly.

First: a budget. Each codeword consumes a share of the unit budget. A codeword of length 1 costs 1/2. A codeword of length 3 costs 1/8. Once the budget is exhausted, no more codewords can be added without violating prefix-freeness. This is not a practical limitation but a mathematical fact: the fractions must sum to at most 1.

The budget framing makes trade-offs visible. If you want symbol A to have a very short codeword (large budget share), the remaining symbols must share whatever is left. The total is fixed. You are allocating a finite resource.

Second: a trade-off, with an optimal point. Shorter codewords for some symbols means longer codewords for others. The optimal trade-off, under a known probability distribution ((p_1, \ldots, p_n)) over the symbols, is to assign length (-\log_2 p_i) to symbol (i). This minimizes the expected codeword length (\sum_i p_i l_i), and it saturates Kraft when all the (p_i) are dyadic (powers of 2). The lengths (-\log_2 p_i) are not always integers, which is why practical codes like Huffman (which uses integer lengths) incur a small overhead over the theoretical minimum, and why arithmetic coding can approach the minimum more closely by working below the integer boundary.

Kraft's inequality is what makes this optimization well-defined. The constraint (\sum_i 2^{-l_i} \leq 1) defines the feasible region; the optimization finds the best point in that region.

Third: a diagnostic. A length vector that violates Kraft has no prefix-free realization. This is a hard constraint, not a heuristic. If someone proposes a code with lengths that sum past 1 in Kraft's sense, no amount of clever codeword assignment will fix it: the tree does not have enough leaves.

Conversely, a length vector satisfying Kraft always has a prefix-free realization. That is McMillan's theorem, and it is where the story becomes constructive.

The Converse, Foreshadowed

Kraft's inequality is necessary. McMillan's theorem (1956) says it is also sufficient: any length vector satisfying the inequality is realizable by some prefix-free binary code. You can always build the code.

The proof is constructive. Given a Kraft-satisfying length vector, you walk the binary tree left-to-right, assigning the next available node at the right depth to each symbol. The budget guarantee ensures you never run out of room before all symbols are placed.

This constructive direction is what makes Kraft practically useful: it turns a feasibility question ("does a code with these lengths exist?") into a simple arithmetic check. Compute the Kraft sum. If it is at most 1, the code exists. If not, it does not.

The next post in this series proves McMillan's theorem and gives the construction explicitly.

Cross-References

The Bits Follow Types post in the Stepanov series develops codec combinators built on the algebraic structure of types. Each combinator, Opt, Either, Pair, Vec, relies on the element codecs being prefix-free for the decoding to be unambiguous. Kraft's inequality is what certifies that a codec's codeword-length assignment is internally consistent: any prefix-free codec's lengths satisfy it.

The When Lists Become Bits post in the same series shows that prefix-freeness is exactly the condition that makes the free-monoid lift into bit space injective: the Vec<C> combinator works because C is prefix-free, and prefix-freeness is precisely what Kraft characterizes. Kraft and the free-monoid construction are two views of the same constraint.

PFC (github.com/queelius/wire-formats/tree/master/lib/pfc) provides production codecs that all satisfy Kraft. The whole library is structured around Kraft-satisfying universal codes; see include/pfc/codecs.hpp for the full catalog including Elias gamma, Elias delta, Fibonacci, Rice, VByte, and Exp-Golomb codes.

Lattices: Fixed Points and Iteration

Alex Towell — Sun, 07 Jun 2026 03:48:21 +0000

Lattices have two operations of a different kind than rings. The structure determines a fixed-point algorithm.

Two Operations, Different Rules

Monoids have one binary operation. Rings have two (addition and multiplication) linked by distributivity. Lattices also have two operations, but with different laws entirely.

A lattice is a set with two operations:

meet ((\wedge)): greatest lower bound
join ((\vee)): least upper bound

Both are idempotent, commutative, and associative. And they satisfy the absorption laws:

$$a \wedge (a \vee b) = a \qquad a \vee (a \wedge b) = a$$

Absorption is what distinguishes lattices from a pair of unrelated monoids. It ties meet and join together: knowing one constrains the other.

A bounded lattice adds a least element (bottom, (\bot)) and a greatest element (top, (\top)). Bottom is the identity for join, top is the identity for meet.

In C++20 concepts, with ADL free functions:

template<typename L>
concept Lattice = std::semiregular<L> &&
    requires(L a, L b) {
        { meet(a, b) } -> std::convertible_to<L>;
        { join(a, b) } -> std::convertible_to<L>;
    };

template<typename L>
concept BoundedLattice = Lattice<L> &&
    requires(L a) {
        { bottom(a) } -> std::convertible_to<L>;
        { top(a) } -> std::convertible_to<L>;
    };

Four Examples

Sign lattice. Abstract signs of integers: bottom (unreachable), negative, zero, positive, top (unknown). Meet is greatest lower bound, join is least upper bound in the Hasse diagram. This is the classic abstract interpretation domain. You can define abstract arithmetic on it: pos * neg = neg, neg + neg = neg, pos + neg = top.

Intervals. Closed intervals ([a, b]) ordered by inclusion. Meet is intersection. Join is the smallest enclosing interval. Bottom is the empty interval. Top is the full range. This is the foundation of interval arithmetic.

Divisors. Positive integers ordered by divisibility. Meet is gcd, join is lcm. Bottom is 1 (divides everything), top is 0 (everything divides 0). Lattice structure appearing in number theory.

Power sets. Subsets of ({0, \ldots, N-1}). Meet is intersection (bitwise AND), join is union (bitwise OR). Bottom is the empty set, top is the full set.

All four satisfy BoundedLattice. All four satisfy the same laws. The concept constrains the interface; the laws constrain the semantics.

The Algorithm: Tarski's Fixed-Point Theorem

Here is the payoff. Tarski's theorem: any monotone function on a complete lattice has a least fixed point, computable by iterating from bottom.

Start at (\bot). Apply (f). Join the result with the current value. Repeat until nothing changes. This is Kleene iteration:

$$x_0 = \bot, \quad x_{n+1} = x_n \vee f(x_n)$$

The sequence (x_0 \leq x_1 \leq x_2 \leq \cdots) is ascending. In a finite lattice, it must terminate.

template<BoundedLattice L, typename F>
L least_fixed_point(F f, std::size_t max_iter = 1000) {
    L x = bottom(L{});
    for (std::size_t i = 0; i < max_iter; ++i) {
        L next = join(x, f(x));
        if (next == x) return x;
        x = next;
    }
    return x;
}

This is the lattice analog of power() from the peasant post. There, monoid structure gave us repeated squaring. Here, lattice structure gives us fixed-point iteration. Different algebra, different algorithm, same principle: the structure determines the computation.

Application: Abstract Interpretation

The sign lattice is not just a mathematical curiosity. It is the simplest example of abstract interpretation, a technique for reasoning about programs without running them.

Consider a program variable x. Instead of tracking its concrete value, track its abstract sign. An assignment x = 3 gives x = positive. An operation x = a * b where a is negative and b is positive gives x = negative.

For loops, we need a fixed point. If a loop body adds a positive number to a variable starting at zero, what sign does the variable have after the loop? We compute the least fixed point of the transfer function:

Start: x = bot (unreachable)
Join with initial condition: join(bot, zero) = zero
Apply transfer: zero + positive = positive
Join: join(zero, positive) = top
Apply transfer: top + positive = top
Join: join(top, top) = top. Fixed point reached.

The result is top: the variable could be zero or positive (or, conservatively, anything). The sign lattice is too coarse to distinguish "non-negative" from "unknown". A richer lattice (like intervals) would give a tighter answer.

The point is that the algorithm is the same regardless of the lattice. Swap sign_lattice for interval<int> or powerset<N>, and least_fixed_point still works. The lattice structure determines the iteration.

The Connection

In the peasant post, the observation was: any monoid supports efficient exponentiation. In the accumulator post: any monoid supports composable streaming computation. Here: any bounded lattice supports fixed-point iteration.

Each algebraic structure comes with its own generic algorithm. Monoid structure gives power(). Lattice structure gives least_fixed_point(). The pattern repeats: define the algebra, get the algorithm.

Algorithms arise from algebraic structure.

algebraic.mle: MLEs as Algebraic Objects

Alex Towell — Sun, 07 Jun 2026 03:47:49 +0000

Maximum likelihood estimators have rich mathematical structure. They are consistent, asymptotically normal, efficient. algebraic.mle exposes this structure through an algebra where MLEs are objects you compose, transform, and query.

The Abstraction

An MLE is not just a vector of parameter estimates. It is a statistical object that carries point estimates (\hat{\theta}), the Fisher information matrix (I(\hat{\theta})), the variance-covariance matrix (I^{-1}(\hat{\theta})), Wald-type confidence intervals from asymptotic normality, the log-likelihood value, and convergence diagnostics.

The package wraps all of this in a consistent interface:

library(algebraic.mle)

fit <- mle(likelihood_model, data)
coef(fit)           # Parameter estimates
vcov(fit)           # Variance-covariance matrix
confint(fit)        # Confidence intervals
logLik(fit)         # Log-likelihood
aic(fit)            # Model selection

Composition

The real point is that MLEs compose. Independent models combine:

fit1 <- mle(model1, data1)
fit2 <- mle(model2, data2)
combined <- fit1 + fit2  # Joint likelihood

The package handles the algebra. Joint log-likelihood, block-diagonal Fisher information, everything propagates correctly. This works because likelihoods from independent data sources multiply, and multiplication of likelihoods is addition of log-likelihoods. That is a monoid. The package enforces it.

The Ecosystem

algebraic.mle is the foundation for a family of packages:

Package	Purpose
likelihood.model	Compositional likelihood specification
maskedcauses	Masked failure data in series systems
mdrelax	Relaxed masking conditions
algebraic.dist	Distributions as algebraic objects
flexhaz	Dynamic failure rate distributions
hypothesize	Likelihood ratio tests on MLEs
numerical.mle	Numerical optimization backends

The typical workflow:

Define distributions with algebraic.dist
Specify likelihood contributions with likelihood.model
Fit the model and get an mle object from algebraic.mle
Query statistical properties: confidence intervals, hypothesis tests, model selection

For series systems with masked data:

library(maskedcauses)
library(algebraic.mle)

# Specify masking model (C1-C2-C3 conditions)
model <- md_likelihood_model(components = 3, masking = "bernoulli")

# Fit -> returns algebraic.mle object
fit <- md_mle_exp_series_C1_C2_C3(masked_data)

# All the standard MLE methods work
confint(fit)
vcov(fit)
aic(fit)

Theory

The asymptotic properties that algebraic.mle exploits come from classical MLE theory:

$$\sqrt{n}(\hat{\theta}_n - \theta^{\ast}) \xrightarrow{d} \mathcal{N}(0, I^{-1}(\theta^{\ast}))$$

The expo-masked-fim paper derives closed-form Fisher information for exponential series systems. That is exactly what algebraic.mle uses internally for variance estimation in that case.

For more complex models (Weibull, relaxed masking conditions), we compute Fisher information numerically via observed information:

$$\hat{I}(\hat{\theta}) = -\frac{\partial^2 \ell}{\partial \theta \partial \theta^T}\bigg|_{\theta=\hat{\theta}}$$

Design Principles

Separation of concerns. The likelihood specification (likelihood.model) is independent of the fitting algorithm (numerical.mle) and the result type (algebraic.mle). You can swap optimizers without changing downstream code.

Correctness by construction. Standard errors, confidence intervals, and hypothesis tests are computed from the Fisher information, not ad-hoc formulas. If your likelihood is correct, statistical inference follows automatically.

Composability. Build complex models from simpler ones. The algebra ensures properties propagate correctly.

This package directly supports the work in my master's thesis on reliability estimation in series systems. The bootstrap confidence intervals, likelihood ratio tests, and model selection all use algebraic.mle objects.

Resources

GitHub: github.com/queelius/algebraic.mle
Documentation: queelius.github.io/algebraic.mle
Related paper: Closed-Form Fisher Information
Companion package: algebraic.dist (distributions as algebraic objects)

Choosing the Algebra

Alex Towell — Sun, 07 Jun 2026 03:47:48 +0000

The rest of this series asks: given a structure, what algorithms does it support? This post inverts the question.

The Flip Side

The peasant post showed that power() works on any monoid. The semirings post showed that matrix multiplication over different semirings solves six graph problems. The thread running through the whole series is: algorithms arise from algebraic structure.

But there's a flip side that we haven't addressed directly.

Sometimes you're stuck with an expensive algorithm not because the problem is hard, but because you're working in the wrong algebra. Change the algebra, and the algorithm becomes trivial. The cost shows up somewhere else, always. But if the cheap operation is the one you actually need, you win.

This is a very old idea. Napier invented logarithms in 1614 to turn multiplication into addition. What's worth noticing is that logarithms, odds ratios, tropical semirings, and quaternions are all doing the same thing.

The Pattern

A computational basis transform takes values from one domain and represents them in another, where different operations are cheap:

Domain	Cheap	Expensive
Log space	Multiplication (becomes addition)	Addition
Odds ratios	Bayesian updates (become multiplication)	Probability sums
Tropical $(\min, +)$	Shortest paths (become matrix mult)	Subtraction
Quaternions	Rotation composition	Euler angle extraction
Modular integers	Exponentiation	Ordering
Rationals	Exact arithmetic	Irrational representation

Each row follows the same structure. A transform $\varphi: D \to D'$ makes some operations cheaper and others more expensive. There is no free lunch.

This is not a deep theorem. It's almost tautological: if you could make everything cheaper by relabeling, the labels would already be the standard ones. But making the pattern explicit helps you recognize when you're paying for an operation you don't need.

Three Examples

Log Space

The most familiar example. You have a million small probabilities and need their product.

// Standard: underflows to 0 after ~30 terms
double product = 1.0;
for (double p : probs) product *= p;  // 0.0

// Log domain: addition instead of multiplication
mutatio::lgd product(1.0);
for (double p : probs) product = product * mutatio::lgd(p);
// product.log() is finite. product.value() would overflow,
// but you stay in log space.

The algebra changed from $(\mathbb{R}^+, \times)$ to $(\mathbb{R}, +)$. Multiplication became addition. The isomorphism is $\log$.

Tropical Semirings

The semirings post showed this already. Replace $(+, \times)$ with $(\min, +)$ and matrix multiplication becomes shortest-path computation. That's the same move: you changed the semiring to make the algorithm you wanted (all-pairs shortest paths) fall out of a generic operation (matrix power) that you already had.

mutatio::tropical_min<double> a(3.0), b(5.0);
auto sum = a + b;   // min(3, 5) = 3
auto prod = a * b;  // 3 + 5 = 8

The semirings post covers why this works in more detail than I'll repeat here.

Odds Ratios

Bayesian inference in probability space requires normalization after every update. In odds space, it's just multiplication:

// Prior: 1% disease prevalence
auto prior = mutatio::odds_ratio<double>::from_probability(0.01);

// Positive test with likelihood ratio 18
auto posterior = prior * mutatio::odds_ratio<double>(18.0);

posterior.to_probability();  // ~15.4%

The transform $p \mapsto p/(1-p)$ is a homomorphism from Bayesian updates to multiplication. Sequential updates compose by multiplying likelihood ratios. No normalization needed until you want a probability back.

Connection to the Series

The Stepanov perspective says: find the structure, then the algorithm follows. The flip side: if the algorithm you need is expensive in the current structure, maybe you're in the wrong one.

The modular arithmetic post already showed this implicitly. Working in $\mathbb{Z}/p\mathbb{Z}$ instead of $\mathbb{Z}$ gives you multiplicative inverses (when $p$ is prime) and bounded arithmetic. The rational arithmetic post showed how working in $\mathbb{Q}$ instead of floating point eliminates rounding error at the cost of GCD computation.

Those were choosing algebras too. This post just names the pattern.

The Library

The six transforms live in lib/mutatio in the stepanov repo. Header-only, like everything else in the series.

Logarithmic, odds ratio, Stern-Brocot rationals, tropical, modular, and quaternion. Each one follows the same interface pattern: construct from the base domain, operate cheaply in the transformed domain, convert back when needed.

This is not a production numerical library. Eigen does quaternions better. GMP does rationals better. The point is making the pattern visible, not replacing specialized tools.

#include <mutatio.hpp>  // All six transforms

What I Left Out

I originally tried to write an academic paper about this, complete with a "No Free Lunch theorem" and category-theoretic formalization. The theorem was trivially true and the formalism didn't add anything the pattern didn't already say. So I deleted it.

Some things are better as observations than as theorems. "Choose the algebra that makes your dominant operation cheap" is one of them.

One Algorithm, Infinite Powers

Alex Towell — Sun, 07 Jun 2026 03:47:16 +0000

How the Russian peasant algorithm reveals the universal structure of exponentiation

The Algorithm

Russian peasants had a clever method for multiplication that does not require memorizing times tables. To compute 23 x 17:

23    17
11    34     (halve, double)
 5    68
 2   136
 1   272

Add the right column wherever the left is odd: 17 + 34 + 68 + 272 = 391. That is 23 x 17.

Why does this work? Because we are really computing:

23 x 17 = (16 + 4 + 2 + 1) x 17 = 16x17 + 4x17 + 2x17 + 17

The algorithm only needs three operations on the multiplier:

half(n), integer division by 2
even(n), test if divisible by 2
Addition on the result

From Multiplication to Exponentiation

Here is the insight that makes this interesting: the same algorithm computes powers.

Replace "add to accumulator" with "multiply into accumulator" and "double the multiplicand" with "square the base":

T power(T base, int exp) {
    T result = 1;
    while (exp > 0) {
        if (!even(exp)) result = result * base;
        base = base * base;
        exp = half(exp);
    }
    return result;
}

This is O(log n) multiplications instead of O(n). Computing 2^1000 takes about 10 multiplications, not 1000.

The Monoid Connection

The peasant algorithm works whenever you have:

An associative binary operation *
An identity element 1 where 1 * x = x * 1 = x

This structure is called a monoid. The algorithm computes x * x * ... * x (n times) using O(log n) operations.

What makes this powerful is that many things form monoids:

Type	Operation	Identity	Computing x^n gives you...
Integers	x	1	Powers
Matrices	x	I	Matrix powers
Strings	concat	""	String repetition
Functions	compose	id	Function iteration
Permutations	compose	id	Permutation powers
Quaternions	x	1	Rotation composition

Why Associativity Unlocks Efficiency

Why does the peasant algorithm achieve O(log n) instead of O(n)? The answer lies in a single algebraic law: associativity.

Associativity says ((a \cdot b) \cdot c = a \cdot (b \cdot c)). This looks innocuous, but it means we can restructure computation without changing results. Consider computing (a^8):

Naive:     a x a x a x a x a x a x a x a     (7 multiplications)
Peasant:   ((a^2)^2)^2                        (3 multiplications)

Both produce the same answer because we can freely regroup. The peasant algorithm exploits this freedom systematically: instead of accumulating one factor at a time, it squares intermediate results and combines them.

This restructuring is the source of logarithmic complexity. Not a clever trick, but an inevitable consequence of the law. Given associativity, you are permitted to compute (a^8 = (a^4)^2 = ((a^2)^2)^2). Without associativity, this rewriting would be invalid; the expressions would mean different things.

Each algebraic law unlocks specific computational freedoms:

Law	What it permits
Associativity	Restructuring evaluation order (enables O(log n) via squaring)
Identity	Base case for recursion ((x^0 = 1))
Commutativity	Reordering operands (we don't require this!)
Inverses	Computing negative powers

Note what we don't require: commutativity. If we needed (a \cdot b = b \cdot a), then quaternions and matrices could not participate. But they do, because the algorithm never reorders operands. It only regroups them.

What We Don't Require

The absence of a law is information: it tells you what you can't do.

Quaternion multiplication is non-commutative: (q_1 \cdot q_2 \neq q_2 \cdot q_1) in general. The peasant algorithm works anyway because it never swaps operand order. When computing (q^5), we use (q \cdot q \cdot q \cdot q \cdot q), and the q's appear in the same sequence regardless of how we group them.

If we had required commutativity, we would have excluded quaternions, matrices, permutations, and function composition, losing most of the interesting examples. Minimal requirements maximize applicability.

What would commutativity enable? Parallel reduction with arbitrary partitioning. If (a \cdot b = b \cdot a), you can split a sequence into chunks, reduce each chunk independently, then combine, regardless of which elements ended up in which chunk. Without commutativity, chunk boundaries must respect operand order.

Examples in Code

Fibonacci via Matrix Exponentiation

The Fibonacci recurrence F(n) = F(n-1) + F(n-2) can be encoded as matrix multiplication:

[F(n+1)]   [1 1]^n   [1]
[F(n)  ] = [1 0]   x [0]

Computing F(1,000,000) takes about 20 matrix multiplications:

mat2 fib_matrix{1, 1, 1, 0};
mat2 result = power(fib_matrix, 1000000);
// result.b is F(1,000,000)

Quaternion Rotations

A rotation by angle theta around axis (x,y,z) is a unit quaternion. Composing rotations is quaternion multiplication. To rotate by theta x n:

auto rot = rotation_z(theta);
auto rot_n = power(rot, n);  // O(log n) multiplications

Shortest Paths via Tropical Semiring

In the tropical semiring, "addition" is min and "multiplication" is +. Matrix "multiplication" computes path lengths. Powers find multi-hop paths:

trop_matrix adj = /* adjacency matrix with edge weights */;
auto paths_k = power(adj, k);  // paths_k[i][j] = shortest k-hop path i->j

Compound Interest

An affine transformation f(x) = ax + b under composition:

(a1*x + b1) compose (a2*x + b2) = a1(a2*x + b2) + b1 = (a1*a2)x + (a1*b2 + b1)

Compound interest with rate r and deposit d is f(x) = (1+r)x + d. After n years:

affine yearly = {1.05, 100};  // 5% interest, $100 deposit
affine after_30_years = power(yearly, 30);
double final_balance = after_30_years(1000);  // Starting with $1000

The Minimal Interface

The implementation uses C++20 concepts to express exactly what is needed:

template<typename T>
concept algebraic = requires(T a) {
    { zero(a) } -> convertible_to<T>;
    { one(a) } -> convertible_to<T>;
    { twice(a) } -> convertible_to<T>;
    { half(a) } -> convertible_to<T>;
    { even(a) } -> convertible_to<bool>;
    { a + a } -> convertible_to<T>;
};

Any type satisfying this concept works with power(). The examples demonstrate 15 different monoids, each with about 50 lines of code.

Why This Matters

Stepanov's key insight: algorithms arise from algebraic structure. The peasant algorithm is not really "about" integers or matrices. It is about monoids. Once you see this, you find the same pattern everywhere.

This is generic programming: write the algorithm once, state its requirements precisely, and let it work on anything that satisfies those requirements.

The Media Page

Alex Towell — Sun, 07 Jun 2026 03:47:15 +0000

I added a Media section to the site. It's a collection of books, lectures, talks, and other resources that have influenced my work. Part recommendation list, part intellectual autobiography.

Why Track This?

We are, to a large extent, what we read and watch. The books that stay with us shape how we frame problems, what questions we find interesting, the vocabulary we use to think. I wanted a place to make that visible, both for myself and for anyone curious about the intellectual foundations behind the work here.

The collection spans AI and machine learning, mathematics, CS theory, programming languages, systems, plus some science fiction and philosophy. These aren't random. They reflect threads I've been pulling on for years.

Some Highlights

Not "the best" necessarily, but formative.

Structure and Interpretation of Computer Programs

SICP remains, decades later, one of the most important books I've read. It's not really about Scheme or even programming. It's about computation as a medium for expressing ideas. The way Abelson and Sussman build up from simple primitives to interpreters to register machines shaped how I think about abstraction and the layering of meaning. The full text is free online, and the MIT lectures are worth watching too.

Rich Hickey: Language of the System

Hickey's talks are always good, but Language of the System stands out. It articulates something I'd felt but couldn't express: the problems we face in system design are often problems of language, of finding the right vocabulary for composition, the right primitives for combination. It's an engineer-philosophical meditation on why we struggle to build coherent systems and what it might take to do better.

Protector

This one is personal. Larry Niven's Protector isn't the best-written science fiction, but it was my first real introduction to the genre. I read it as a teenager traveling to and from a construction job. The premise stayed with me: humanity as a "child species," stuck in an early developmental phase, being quietly shepherded by a vastly more intelligent being who views us with something like parental care but without our moral sentimentality.

The Protector character, utilitarian, patient, operating on timescales we can barely comprehend, shaped how I think about intelligence gradients and what it might mean to be on the receiving end of paternalistic superintelligence. In retrospect, it was an early seed for my later interest in AI alignment.

The Stepanov Lineage

Alexander Stepanov's work, Elements of Programming, From Mathematics to Generic Programming, and the A9 Lectures, represents a philosophy of programming that resonates with me: algorithms have mathematical essences, generic programming is about capturing those essences in code, and good abstractions come from understanding the algebraic structure of what you're computing. If you care about the foundations of the STL or want to understand why certain APIs feel "right," start here.

Universal AI and Solomonoff Induction

Marcus Hutter's Universal Artificial Intelligence and the associated lectures on AIXI represent a Platonic ideal: the mathematically optimal agent, even if uncomputable. Understanding why AIXI works (and why it's uncomputable) illuminates the deep structure of the prediction-action problem. It's the theoretical ceiling against which practical approaches can be measured.

What's Next

Right now the Media page is mostly a list with categories and status markers. I plan to expand it:

Annotations and notes. For items I've deeply engaged with, I want to add marginalia: insights, disagreements, connections that emerge from careful reading.
Connected posts. Some books deserve full treatment. Expect posts that dig into specific ideas from these works.
Reading paths. Curated sequences for particular topics: "if you want to understand X, read these in this order."

For now, browse the Media page and see what catches your interest.

Succinct Bit Vectors and Rank/Select

Alex Towell — Sun, 07 Jun 2026 03:46:43 +0000

The claim that drives this post: store $n$ bits and answer prefix-count queries in $O(1)$ time, using only $n + o(n)$ bits total. The auxiliary index is asymptotically negligible. That is not obvious, and it is worth understanding why it holds.

Constant-Time Queries on Bit Vectors

Posts 1 through 10 of this series focused on encoding: universal codes, arithmetic coding, Huffman, LZ77. They all ask the same question in slightly different ways: how do we turn a sequence into a compact bit string? This post shifts direction. The question here is different: once we have a bit vector, how do we query it efficiently without expanding it?

The two fundamental queries on a bit vector of $n$ bits are:

rank$_1(i)$: the number of 1-bits in positions $[0, i)$, i.e., strictly before position $i$.
select$_1(j)$: the position of the $j$-th 1-bit (0-indexed).

These appear throughout data structures: inverted indexes, compressed graphs, FM-indexes, wavelet trees. Rank tells you how many elements precede a position. Select inverts it.

Three design points exist along the space-time axis:

Approach	Space	rank time	select time
Naive scan	$n$ bits	$O(n/64)$	$O(n/64)$
Full lookup table	$O(n \log n)$ bits	$O(1)$	$O(1)$
Succinct (this post)	$n + o(n)$ bits	$O(1)$	$O(\log n)$

The succinct approach hits the right trade-off: an auxiliary index that is asymptotically negligible (sublinear, so $o(n)$) while buying constant-time rank. Select costs $O(\log n)$ with a binary search. Getting $O(1)$ select requires one more index structure; that is covered briefly at the end and implemented in PFC's production version.

The Structure

The bit vector stores its $n$ bits packed into 64-bit words, LSB-first. Bit $i$ lives at position $i \bmod 64$ within word $\lfloor i / 64 \rfloor$. Unused bits in the final word are zeroed.

The auxiliary index has two levels:

Superblock array: one uint64_t per 4096-bit chunk. Entry $s$ holds the absolute cumulative rank from bit 0 to the start of superblock $s$.
Block array: one uint16_t per 64-bit word. Entry $b$ holds the rank within the enclosing superblock, from the superblock's start to the start of block $b$.

class SuccinctBitVector {
public:
    explicit SuccinctBitVector(const std::vector<bool>& bits);
    [[nodiscard]] std::size_t size()  const noexcept;
    [[nodiscard]] bool        bit(std::size_t i)  const noexcept;
    [[nodiscard]] std::size_t rank1(std::size_t i) const noexcept;   // O(1)
    [[nodiscard]] std::size_t select1(std::size_t j) const noexcept; // O(log n)

protected:
    std::size_t n_;
    std::vector<uint64_t> bits_;
    std::vector<uint64_t> superblock_ranks_;  // abs. rank at superblock boundaries
    std::vector<uint16_t> block_ranks_;       // superblock-relative rank per block

    static constexpr std::size_t SUPERBLOCK_BITS = 4096;
    static constexpr std::size_t BLOCK_BITS      = 64;
    static constexpr std::size_t BLOCKS_PER_SB   = 64;
};

The production version lives in PFC's include/pfc/succinct.hpp. The pedagogical version in this post strips the production features to the core structure.

Why Constant Time

A rank query for position $i$ proceeds in three steps:

Superblock lookup: $s = \lfloor i / 4096 \rfloor$. Load superblock_ranks_[s], giving the absolute count of 1-bits before superblock $s$.
Block lookup: $b = \lfloor i / 64 \rfloor$. Load block_ranks_[b], giving the count within superblock $s$ before block $b$.
Word popcount: mask the word at position $b$ to keep only bits $0 \ldots (i \bmod 64) - 1$, then call std::popcount.

std::size_t rank1(std::size_t i) const noexcept {
    if (i == 0) return 0;
    std::size_t sb    = i / SUPERBLOCK_BITS;
    std::size_t blk   = i / BLOCK_BITS;
    std::size_t w_off = i % BLOCK_BITS;
    std::size_t res   = superblock_ranks_[sb] + block_ranks_[blk];
    if (w_off > 0) {
        uint64_t mask = (uint64_t{1} << w_off) - 1;
        res += std::popcount(bits_[blk] & mask);
    }
    return res;
}

All three steps are $O(1)$: two array indexing operations and one hardware-accelerated popcount (POPCNT on x86-64, VCNT on ARM). The key insight is that partial sums are precomputed at two granularities. The superblock array handles the large-scale prefix sum; the block array handles the within-superblock refinement. The final popcount handles the within-word residual. Three lookups, total.

Select via Binary Search

Select is harder than rank. Rank has a natural decomposition: superblock boundary, then block boundary, then word bits. Select does not. The $j$-th 1-bit could be anywhere, and without additional precomputed information, there is no shortcut.

The approach here uses binary search over superblock_ranks_ to find which superblock contains the $j$-th 1-bit (the largest $s$ such that superblock_ranks_[s] <= j). Within that superblock, a linear scan over the block array pinpoints the right block. Within that block, iterating over set bits via repeated __builtin_ctzll (count trailing zeros) finds the exact position.

Complexity: $O(\log(n / 4096))$ for the binary search over superblocks, then $O(1)$ per block within the superblock (at most 64 blocks), then $O(64)$ for the bit scan. Total: $O(\log n)$.

The $O(1)$ select extension stores one additional array: a "select samples" array holding the position of every $(\log^2 n)$-th 1-bit. That compresses the binary-search range and allows $O(1)$ select at the cost of a few extra kilobytes for typical $n$. PFC's production version implements this. For the pedagogical version here, $O(\log n)$ select is sufficient.

Where Succinct Bit Vectors Show Up

Inverted indexes in full-text search use bit vectors to represent posting lists (which documents contain a term). Rank and select over those bit vectors power gap compression: the posting list stores gaps between document IDs, and select recovers the original IDs.

The FM-index, used in short-read DNA alignment tools like BWA, builds a rank-queryable bit vector over the Burrows-Wheeler Transform. Each rank query takes $O(1)$ and drives an $O(m)$ pattern-search algorithm for a pattern of length $m$.

Compressed graphs (as in the WebGraph framework) store adjacency lists as bit vectors with succinct structures for traversal. Wavelet trees use a hierarchy of bit vectors to answer range-frequency queries in $O(\log \sigma)$ where $\sigma$ is the alphabet size.

In all of these, the $n + o(n)$ space guarantee is not academic. The index must not dwarf the data it queries.

The Space-Time Trade

The auxiliary index for a 10000-bit vector costs about 338 bytes: 24 bytes for the superblock array (3 entries at 8 bytes each) and 314 bytes for the block array (157 entries at 2 bytes each). The bit vector itself costs 1256 bytes (10000 / 8, rounded up to the nearest word). The index is roughly 27% of the bit vector, comfortably $o(n)$.

Here is the comparison table across the four main approaches:

Representation	Space	rank	select	Best for
Plain scan	$n$ bits	$O(n/64)$	$O(n/64)$	Very small $n$
Full table	$O(n \log n)$ bits	$O(1)$	$O(1)$	Tiny $n$, memory to burn
Succinct (this post)	$n + o(n)$ bits	$O(1)$	$O(\log n)$	Large dense bit vectors
Sparse / RLE	$O(k)$ bits for $k$ ones	$O(\log k)$	$O(\log k)$	Very sparse sets

The sparse / RLE row bridges to the next post. When a bit vector is extremely sparse, storing only the positions of the 1-bits beats the $n + o(n)$ dense approach. RoaringBitmap, covered in post 12, chooses dynamically between an array, a dense bit vector, and a run-length encoding based on the local density of each 64K-integer chunk.

Cross-References

This post builds on two earlier posts:

Post 1 established the Kraft lower bound: any lossless code must use at least $H$ bits per symbol. The succinct structure respects this by using $n + o(n)$ bits for an $n$-bit vector, asymptotically optimal.
Post 3 showed that a coding choice is implicitly a prior over the data. The choice of a dense bit vector (as opposed to an array or run-length encoding) reflects a prior that 1-bits are distributed roughly uniformly. When that prior is wrong, a different structure wins.

The next post shows how RoaringBitmap resolves this by refusing to commit to a single prior.

Footnote: PFC's include/pfc/succinct.hpp implements the full SuccinctBitVector with O(1) select via a select-samples index, and also includes RoaringBitmap with three container types. The pedagogical code in this post covers the core rank structure and O(log n) select.

Arithmetic Coding

Alex Towell — Sun, 07 Jun 2026 03:46:42 +0000

Huffman codes one symbol at a time. Arithmetic coding encodes the whole sequence as a single number. The difference is a factor of twelve, at least on the right source.

The Last Bit of Redundancy

Huffman coding gets expected codeword length within one bit of entropy. That is the best it can do, because codeword lengths must be integers while entropy is a real number.

The waste is structural. A symbol with probability $p = 0.7$ has optimal (fractional) length $-\log_2(0.7) \approx 0.515$ bits. Huffman rounds that up to 1 bit: 0.485 bits wasted per occurrence. For a nearly-deterministic source with $p_0 = 0.99$ and $p_1 = 0.01$, the entropy is $H \approx 0.081$ bits per symbol. Huffman is stuck at 1 bit per symbol. That is a factor-of-twelve gap, and Huffman cannot close it: a symbol that appears 99% of the time still gets a complete codeword.

Arithmetic coding steps back from per-symbol codewords entirely. It encodes an entire sequence as a single rational number in $[0, 1)$. The bit-length of that number converges to the entropy of the sequence as the sequence grows. No integer rounding, no per-symbol overhead.

This post builds an integer range coder in C++23 and demonstrates the factor-of-twelve improvement on the Bernoulli(0.99) source.

The Continuous View

Start with the unit interval $[0, 1)$. For a two-symbol source, partition it by probability: symbol 0 gets $[0, p_0)$ and symbol 1 gets $[p_0, 1)$.

To encode a sequence, begin with the full interval and narrow it with each symbol. After symbol $s_1$, restrict to the corresponding sub-interval. After $s_2$, apply the same proportional rule inside that sub-interval. After $L$ symbols, the interval has width $\prod_{i=1}^{L} p_{s_i}$.

Any number inside the final interval is a valid encoding. The shortest such number in binary requires approximately $-\log_2(\prod p_{s_i}) = \sum_{i=1}^{L} (-\log_2 p_{s_i})$ bits. As $L \to \infty$, bits per symbol approaches $H(p) = -\sum_k p_k \log_2 p_k$ exactly.

Decoding is the inverse: given the encoded number, determine at each step which sub-interval it falls in, recover the symbol, narrow the interval, and repeat.

The theory is complete. The practice is not. A real interval narrows exponentially fast: after a few dozen symbols you need arbitrary precision. The integer range coder fixes this with 32-bit arithmetic and a renormalization step.

The Integer Implementation

Real coders use 32-bit unsigned integers. Four boundary constants define the working range:

constexpr std::uint32_t TOP_VALUE     = 0xFFFFFFFFu;
constexpr std::uint32_t HALF          = 0x80000000u;
constexpr std::uint32_t QUARTER       = 0x40000000u;
constexpr std::uint32_t THREE_QUARTER = 0xC0000000u;

The encoder maintains low_ (lower bound), high_ (upper bound), and underflow_count_ (pending bits). Initially low_ = 0 and high_ = TOP_VALUE, representing the full interval.

Encoding symbol $s$ with cumulative frequency range $[\text{lo_cum}, \text{hi_cum})$ out of total $T$ shrinks the interval:

void encode_symbol(std::uint32_t low_cum, std::uint32_t high_cum,
                   std::uint32_t total) {
    std::uint64_t range = static_cast<std::uint64_t>(high_) - low_ + 1;
    high_ = low_ + static_cast<std::uint32_t>((range * high_cum) / total - 1);
    low_  = low_ + static_cast<std::uint32_t>((range * low_cum)  / total);
    renormalize();
}

The intermediate product is 64-bit to avoid overflow before the division. After shrinking, renormalize() extracts any bits that are now settled.

The renormalization loop is where the fiddly part lives:

void renormalize() {
    while (true) {
        if (high_ < HALF) {
            // Both bounds below the midpoint: the high bit is 0.
            emit_bit_and_underflow(false);
        } else if (low_ >= HALF) {
            // Both bounds above the midpoint: the high bit is 1.
            emit_bit_and_underflow(true);
            low_  -= HALF;
            high_ -= HALF;
        } else if (low_ >= QUARTER && high_ < THREE_QUARTER) {
            // Underflow: interval straddles the midpoint and is shrinking
            // toward it. Neither high bit is agreed. Count the pending bit.
            ++underflow_count_;
            low_  -= QUARTER;
            high_ -= QUARTER;
        } else {
            break;
        }
        low_  <<= 1;
        high_ = (high_ << 1) | 1u;
    }
}

Three cases. When both bounds are below the midpoint, the high bit is 0: emit it and double both bounds. When both are above, the high bit is 1: subtract HALF, emit, double. After emitting a bit, the working range is restored. The third case is the tricky one.

When low_ >= QUARTER and high_ < THREE_QUARTER, the interval straddles the midpoint and is narrowing toward it. Neither branch applies: the high bit is not yet resolved. The encoder defers by incrementing underflow_count_ and centering the interval. When the interval eventually escapes to one side, emit_bit_and_underflow emits the resolved bit followed by underflow_count_ complementary bits:

void emit_bit_and_underflow(bool bit) {
    sink_.write(bit);
    while (underflow_count_ > 0) {
        sink_.write(!bit);
        --underflow_count_;
    }
}

This is Elias-Gallager underflow correction. Without it, the encoder stalls whenever the interval converges toward $1/2$ without resolving to either half. It is not complicated in retrospect, but getting the invariants right in the first place requires care.

The decoder mirrors the encoder exactly. It maintains low_, high_, and a 32-bit code_ register primed from the first 32 bits of the compressed stream. To decode a symbol, it scales code_ into $[0, \text{total})$, looks up which cumulative interval it falls in, updates low_ and high_ identically to the encoder, then shifts in a new bit from the source:

template <typename FreqCb, typename RangeCb>
std::size_t decode_symbol(FreqCb&& get_freq_cb, RangeCb&& cum_range_cb,
                          std::uint32_t total) {
    std::uint64_t range  = static_cast<std::uint64_t>(high_) - low_ + 1;
    std::uint32_t scaled = static_cast<std::uint32_t>(
        (static_cast<std::uint64_t>(code_ - low_) * total) / range);

    std::size_t sym = get_freq_cb(scaled);
    auto [lo_cum, hi_cum] = cum_range_cb(sym);

    high_ = low_ + static_cast<std::uint32_t>((range * hi_cum) / total - 1);
    low_  = low_ + static_cast<std::uint32_t>((range * lo_cum) / total);
    decoder_renormalize();
    return sym;
}

Both sides apply identical interval arithmetic. They stay synchronized without any out-of-band length information.

Tests and the Compelling Example

The test suite verifies round-trip correctness across distributions and sequence lengths. The number worth looking at is the Bernoulli(0.99) demo: 1000 symbols, each drawn independently with $P(\text{sym0}) = 0.99$ and $P(\text{sym1}) = 0.01$.

H(0.99, 0.01) = -(0.99 log2 0.99 + 0.01 log2 0.01)
              ≈ 0.081 bits/symbol

Huffman cannot compress a binary source below 1 bit per symbol. The arithmetic coder, after encoding 1000 symbols, emits approximately 82 bits total. Huffman needs 1000. The arithmetic coder gets better as the sequence grows; Huffman stays stuck.

The BinarySourceDemoTest.Bernoulli99OneThousandSymbols test confirms:

bits_per_symbol < 1.0      (beats Huffman's floor)
bits_per_symbol < H + 1.0  (within 1 bit/symbol of entropy)

and verifies lossless round-trip on the full 1000-symbol sequence.

The Adaptive Variant

The encoder above uses a fixed probability model. Adaptive arithmetic coding generalizes to unknown or changing distributions by updating the cumulative-frequency table after each symbol. Both encoder and decoder apply identical updates in the same order, so they stay synchronized without any transmitted model.

The update is simple: after encoding or decoding symbol $s$, increment its frequency count by 1, update the cumulative totals, and rescale if the total exceeds a threshold. This is the semi-adaptive (online) model used in practice.

The structural advantage over adaptive Huffman is real. Adaptive Huffman requires rebalancing a tree after each symbol: $O(\log n)$ with non-trivial bookkeeping. Adaptive arithmetic coding only increments a count and recomputes cumulative sums: $O(\text{alphabet size})$, cache-friendly.

Production arithmetic coders in JPEG XL and AV1 use more sophisticated context models. The arithmetic stage itself is unchanged. The probability estimate feeding it becomes a function of recent history. The coder only needs a cumulative-frequency interval; how the model produces $p$ is not its concern.

The Theoretical Endpoint

Shannon's source-coding theorem says that for a memoryless source with distribution $p$, no prefix-free code can achieve expected length below $H(p)$ bits per symbol. Arithmetic coding achieves $H(p)$ in the limit. The bound is tight, and arithmetic coding reaches it.

That settles compression for memoryless sources. Sources with memory are a different problem. A Markov source of order $k$ can be handled by conditioning on the last $k$ symbols: one arithmetic coder per context. Generalizing further, context-mixing predictors maintain an ensemble of models and blend their probability estimates. The arithmetic coding stage stays unchanged throughout: it only needs a good $p$.

The best general-purpose lossless compressors in use (PAQ, ZPAQ, and derivatives) are context-mixing predictors driving an arithmetic coder. The predictor is where the innovation happens. The coder is the fixed, information-theoretically optimal back end.

What arithmetic coding cannot do is compress below $H(p)$. That is not an implementation limitation. It is a theorem. The next tool in the series, Succinct Bit Vectors and Rank/Select, shifts from compressing data to indexing it space-efficiently: a different kind of optimality.

Cross-references

Back: Huffman Coding (post 9) showed that integer codeword lengths bound compression at one bit above entropy. Arithmetic coding removes that bound.

Universal Codes as Priors (post 3) introduced entropy and redundancy measurements; the priors::entropy() function used in the convergence tests comes from priors.hpp in that post's directory.

Forward: Succinct Bit Vectors and Rank/Select (post 11) shifts from entropy coding to space-efficient indexing, where the goal is not compression but constant-time rank and select queries on compressed representations.

Cross-series: In the Bits Follow Types framing, arithmetic coding is the entropy-optimal realization of the Either combinator's tag bit. The Either codec tags a choice with 1 bit regardless of probability; arithmetic coding replaces the tag with a fractional contribution proportional to the symbol's true information content.

Footnote: The production implementation lives at include/pfc/arithmetic_coding.hpp in the PFC library. It includes both the integer range coder developed here and a higher-level adaptive variant with configurable context models.

The Reality of Moral Properties: Do Values Exist?

Alex Towell — Sun, 07 Jun 2026 03:46:10 +0000

"Murder is wrong."

Is that statement like "2+2=4," objectively true regardless of what anyone thinks? Or is it like "chocolate tastes good," subjective and mind-dependent?

I keep returning to this question because it sits at the foundation of everything I care about in AI alignment. If moral properties (goodness, wrongness, oughtness) are real features of the universe, then in principle an AI could discover them. If they're human constructions, then values must be learned from us, with all the mess that entails. In my essay On Moral Responsibility, I try to take this seriously without pretending to have it figured out.

Moral Realism: Values Are Real

The realist says moral properties exist objectively, independent of anyone's beliefs or attitudes. Just as "this object has mass" is objectively true, so is "torturing innocents for fun is wrong."

There are several flavors. The Platonic version treats moral properties as abstract objects, like numbers. The naturalistic version says moral properties supervene on natural properties, so "wrong" might reduce to "causes suffering." The intuitionist version says we grasp moral truths through something like moral perception.

The Case For

Moral phenomenology. When you see someone torturing a child, wrongness isn't something you decide. It's something you perceive. The moral fact presents itself directly, much like the sky presenting as blue.

Disagreement presupposes objectivity. We argue about ethics. But disagreement only makes sense if there's a fact of the matter. Compare "Is torture wrong?" (where we assume there's an answer) with "Is chocolate tasty?" (where disagreement is just strange). The existence of genuine moral debate suggests we treat morality as objective.

Moral progress. We say abolishing slavery was moral progress. But if there's no objective moral truth, what does "progress" mean? Progress toward what?

Convergence. Despite cultural variation, core moral principles show remarkable convergence across societies: don't kill innocents, care for children, reciprocate cooperation, punish free-riders. This suggests universal moral truths that different cultures discover independently.

The Case Against

Metaphysical queerness (Mackie). Moral properties would be very strange entities. They're not physical (you can't detect "wrongness" with instruments). They're not mental (they're supposed to be mind-independent). They have intrinsic prescriptivity (they inherently motivate action). What kind of entity has all these properties?

The is/ought gap (Hume). You can't derive "ought" from "is." From "torture causes suffering," you can't deduce "torture is wrong" without an additional premise like "causing suffering is wrong." If moral facts are objective, shouldn't they be derivable from non-moral facts?

Persistent disagreement. While some principles converge, others show radical and persistent disagreement even among informed, rational people: honor killings, animal rights, abortion, euthanasia. If moral facts are objective and perceivable, this is hard to explain.

Evolutionary debunking. Our moral intuitions were shaped by evolution for inclusive fitness, not truth-tracking. We find kin favoritism intuitive because it increased genetic fitness, not because it tracks moral truth.

Moral Nominalism: Values Are Constructed

The nominalist says moral categories are human constructions, useful ways to organize experience and coordinate behavior. "Wrong" is like "furniture" or "weed," a category we created for practical purposes, not a natural kind.

The Case For

Parsimony. We can explain all moral phenomena (moral beliefs, moral language, moral motivation) without positing objective moral properties. Why multiply entities beyond necessity?

Anthropological diversity. Moral systems vary wildly across cultures: collectivist versus individualist moralities, honor-based versus care-based ethics, radically different views on sexuality, family, authority, purity. This suggests morality is culturally constructed, not discovered.

Evolutionary explanation. We can fully explain moral intuitions as evolutionary adaptations. Kin altruism produces nepotism intuitions. Reciprocal altruism produces fairness intuitions. Group selection produces loyalty intuitions. No need to posit objective moral facts being tracked.

The Case Against

Moral horror. "The Holocaust was wrong" seems objectively true, not a matter of opinion or cultural construction. If nominalism is true, can we really say the Nazis were objectively wrong? Or just that we disapprove?

The phenomenology of obligation. "I shouldn't steal" doesn't feel like "I prefer not to steal." It feels like a binding obligation independent of my preferences, coming from outside me.

Moral criticism. We criticize other cultures and individuals. "Female genital mutilation is wrong" seems to say more than "I don't like your culture's conventions." If morality is constructed, what grounds this criticism?

My Position: Pragmatic Agnosticism

In On Moral Responsibility, I take a middle path, and I think it's the honest one.

Whether moral properties are real or constructed, I can still make moral judgments, engage in moral reasoning, and work to restructure reality toward better states. You don't need to solve the philosophy of mathematics to do arithmetic. Similarly, you don't need to solve metaethics to do ethics.

Instead of starting with metaphysics (are values real?), I start with phenomenology (what's given in experience?). Some things are undeniable: suffering hurts, we prefer flourishing to suffering, we can act to reduce suffering. Whether suffering is "objectively bad" in some Platonic sense is contestable. I build on the undeniable and remain agnostic about the contestable.

This isn't a dodge. It's a recognition that practical ethics and AI alignment can't wait for metaphysics to be settled. We can treat moral claims as if they're objective for practical purposes, remain uncertain about their ultimate status, and still get on with the work.

What This Means for AI

The realism/nominalism debate has direct consequences for alignment.

If realism is true, an AI like SIGMA (from my novel The Policy) could in principle discover objective moral truths through rational reflection. The optimistic version: AI converges on objective morality aligned with human flourishing. The terrifying version: SIGMA discovers "objective values" that horrify humans. Who's right then?

If nominalism is true, AI can't discover values. It must learn them from humans. But which humans? Whose values? How do you aggregate conflicting values? And there's no objective standard to check whether it learned correctly.

There's a third option that I find compelling: value uncertainty as a safety feature. If SIGMA remains uncertain about whether values are objective, it might optimize more cautiously, preserve option value, seek human feedback more often, and resist overriding human judgment even when it "knows better." Moral uncertainty might be exactly the disposition we want in a powerful AI system.

Regardless of where you land on the metaphysics, the practical problems are the same: how do we specify what matters, how does AI learn complex context-dependent values, how do we handle conflicts between individuals, and how do we deal with value drift over time? These are the problems I focus on in the essay, because they need solving either way.

These ideas are explored more fully in On Moral Responsibility and dramatized in The Policy.

Bootstrap Methods: When Theory Meets Computation

Alex Towell — Sun, 07 Jun 2026 03:46:09 +0000

The bootstrap is a trade: mathematical complexity for computational burden. Instead of deriving analytical formulas for sampling distributions, you simulate them.

The Idea

If you don't know the sampling distribution of a statistic, approximate it by resampling from your data.

Draw samples with replacement from the original data
Compute your statistic on each resample
The distribution of resampled statistics approximates the true sampling distribution

That's it. The justification is more subtle than the procedure. Under regularity conditions, the bootstrap distribution converges to the true sampling distribution as sample size grows. This is non-parametric inference: you use the empirical distribution as a stand-in for the true distribution, without assuming a parametric form.

When I Use It

Bootstrap is my default tool when:

I need confidence intervals for statistics with no closed-form variance
Asymptotic theory doesn't apply (small samples, non-standard statistics)
I'm doing model selection via bootstrap cross-validation
I'm working with censored data where standard errors are intractable

That last case is the one that matters most for my research.

The Computational Trade

Better to get the right answer slowly than the wrong answer quickly.

Deriving an analytical variance formula is hard. Sometimes it's impossible for the statistic you actually care about. Bootstrap says: just compute the statistic 10,000 times on resampled data and look at the spread. With modern hardware, 10,000 resamples takes seconds.

The trade is almost always worth it.

My Thesis Work

My research uses bootstrap heavily. I'm working on reliability estimation for series systems where components fail and you don't know which one caused the system failure. This is the masked failure data problem.

For these models, the MLE exists and you can compute it, but the standard variance formulas don't. The Fisher information matrix involves expectations over the masking distribution that don't simplify to anything closed-form.

Bootstrap gives me confidence intervals anyway. Resample the masked failure data, recompute the MLE on each resample, and use the distribution of bootstrapped MLEs to construct intervals. It's not elegant, but it works, and "works" is the right criterion when the alternative is "no confidence intervals at all."

TD Learning for Exemplar Retrieval: Why It Doesn't Really Work

Alex Towell — Sun, 07 Jun 2026 03:45:37 +0000

Standard RAG retrieves few-shot examples by embedding similarity, which doesn't learn from outcomes. A trace that looks similar but leads the LLM astray gets retrieved just as readily as one that consistently helps. Closing that loop sounds clean.

Here's the setup. Store every reasoning trace the LLM produces. When a new problem arrives, retrieve the top-k most similar traces, feed them as few-shot examples, observe the solution, score it. Now do something standard RAG doesn't: assign each stored trace a learned value V(T) representing its utility as a few-shot example, and weight retrieval by a mix of similarity, learned value, and a UCB exploration bonus.

The learning rule is TD(0):

V(T) <- V(T) + alpha * (r + gamma * V(T_new) - V(T))

T is a retrieved trace, r is the reward (did the solution score well?), T_new is the trace just produced. The bootstrap term gamma * V(T_new) is supposed to be the magic. If T_new later turns out to be a useful example in its own right, V(T_new) rises, and on subsequent updates that rise flows back to T. Credit propagates through chains of retrieval influence (A retrieved to help produce B, B retrieved to help produce C, so A gets some credit for C) without explicit graph traversal. It's textbook TD, applied to a graph where the nodes are stored traces and the edges are retrieval provenance.

I built this, evaluated on GSM8K with Haiku, and spent a few days tuning it.

Here's the problem. Compare the full method against a trivial baseline: set V(T) = whether trace T solved its own problem correctly, weight retrieval by that, never update it. No TD. No influence graph. No bootstrapping. Just "prefer exemplars whose own answers were right." That baseline matches the full method.

The improvement over similarity-only retrieval isn't coming from learning. It's coming from not retrieving exemplars with wrong answers. On GSM8K, correctness is binary and "good trace = good exemplar." A lookup table of correctness captures everything the TD machinery laboriously rediscovers.

The lesson isn't that TD on an influence graph is wrong in principle. It's that for the mechanism to matter, good-trace and good-exemplar have to diverge. You need tasks where a trace can be correct but unhelpful as a demonstration, or incorrect but pedagogically useful (common failure modes, near-miss reasoning). GSM8K is not that. It's a task where the reward signal is exactly what you want to retrieve on, and any quality tracker converges to the same answer.

If I come back to this, I'll try code generation, where a correct solution can be a misleading template for a subtly different problem, or theorem proving, where the useful examples are often the failed attempts.

For now: run the trivial baseline first.

Code: github.com/queelius/exemplar-rl

DEV Community: Alex Towell

Noisy Turing Machines: Noisy Logic Gates

Noisy Turing machines: noisy logic gates

Case 1: The Correct Output Is True

Case 2: The Correct Output Is False Given x1 = true and x2 = false

Case 3: The Correct Output Is False Given x1 = false and x2 = true

Case 4: The Correct Output Is False Given x1 = false and x2 = false

Summary

Kraft's Inequality

Kraft's Inequality

The Trie View

The Inequality

The Proof

What Kraft Gives Us

The Converse, Foreshadowed

Cross-References

Lattices: Fixed Points and Iteration

Two Operations, Different Rules

Four Examples

The Algorithm: Tarski's Fixed-Point Theorem

Application: Abstract Interpretation

The Connection

Further Reading

algebraic.mle: MLEs as Algebraic Objects

The Abstraction

Composition

The Ecosystem

Theory

Design Principles

Resources

Choosing the Algebra

The Flip Side

The Pattern

Three Examples

Log Space

Tropical Semirings

Odds Ratios

Connection to the Series

The Library

What I Left Out

One Algorithm, Infinite Powers

The Algorithm

From Multiplication to Exponentiation

The Monoid Connection

Why Associativity Unlocks Efficiency

What We Don't Require

Examples in Code

Fibonacci via Matrix Exponentiation

Quaternion Rotations

Shortest Paths via Tropical Semiring

Compound Interest

The Minimal Interface

Why This Matters

Further Reading

The Media Page

Why Track This?

Some Highlights

Structure and Interpretation of Computer Programs

Rich Hickey: Language of the System

Protector

The Stepanov Lineage

Universal AI and Solomonoff Induction

What's Next

Succinct Bit Vectors and Rank/Select

Constant-Time Queries on Bit Vectors

The Structure

Why Constant Time

Select via Binary Search

Where Succinct Bit Vectors Show Up

The Space-Time Trade

Cross-References

Arithmetic Coding

The Last Bit of Redundancy

The Continuous View

The Integer Implementation

Tests and the Compelling Example

The Adaptive Variant

The Theoretical Endpoint

Cross-references

The Reality of Moral Properties: Do Values Exist?

Case 2: The Correct Output Is False Given `x1 = true` and `x2 = false`

Case 3: The Correct Output Is False Given `x1 = false` and `x2 = true`

Case 4: The Correct Output Is False Given `x1 = false` and `x2 = false`