Matrix-Vector Calculus

Solverz provides automatic symbolic differentiation for mixed matrix-vector expressions. When an equation contains matrix-vector products (Mat_Mul), Solverz automatically computes analytical Jacobian blocks using tensor calculus with Einstein index notation.

This module is based on the approach described in:

Laue, Mitterreiter, Giesen. A Simple and Efficient Tensor Calculus. Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

See also: matrixcalculus.org

Glossary

This page uses several abbreviations and design terms that recur throughout the document. They are collected here for reference; each one is hyperlinked from its first use in the body so you don’t have to scroll back. The terms describe Solverz’s code-generation strategy and the underlying numerical building blocks.

Term

Definition

SpMV

Sparse Matrix-Vector multiplication — computing \(y = M\,x\) where \(M\) is stored in a sparse format.

SpMM

Sparse Matrix-Matrix multiplication — computing \(C = M\,B\) where \(B\) has more than one column.

CSC / CSR

Compressed Sparse Column / Compressed Sparse Row — the two scipy.sparse storage formats Solverz uses. CSC stores data, indices, indptr walking columns; CSR walks rows. The legacy MatVecMul and the 0.8.1 Mat_Mul fast path both consume CSC.

@njit

The Numba decorator that compiles a Python function to native LLVM-backed code at first call. Solverz uses @njit(cache=True) so the compiled code is cached on disk and reused across processes.

scatter-add

An assembly pattern where many indexed reads are accumulated (+=) into a fixed output buffer. The mutable-matrix Jacobian kernels are pure scatter-add loops over precomputed index arrays — see Layer 2.

fancy indexing

NumPy / scipy.sparse indexing with arrays of indices, e.g. M[[r0, r1, r2], [c0, c1, c2]], returning the values at those positions. The slow fallback path uses this to extract block values from a freshly-built sparse expression.

fast path

The Solverz code-gen path used for Mat_Mul(A, x) placeholders where A is a plain sparse dim=2 Para: the matvec runs inside inner_F via the Numba SolCF.csc_matvec helper, with zero scipy.sparse calls per evaluation. Mutable-matrix Jacobian blocks have a parallel fast path of typed scatter-add kernels.

fallback path

The slower-but-correct path used whenever a Mat_Mul operand or a Jacobian block doesn’t fit the fast-path shape — for instance negated matrices, nested Mat_Mul, dense dim=2, or general matrix expressions. The wrapper falls back to scipy.sparse @ and (for Jacobian blocks) fancy indexing into a freshly-built sparse matrix. The fast and fallback paths co-exist: the runtime picks per placeholder / per Jacobian block based on what the symbolic classifier recognises.

lambdify

SymPy’s symbolic-to-callable converter. Solverz’s “inline” mode uses lambdify to turn each equation’s symbolic expression into a Python function on every call (no JIT).

hot F / hot J

Steady-state per-call wall-clock time for F_(y, p) / J_(y, p) after JIT warm-up. Distinguished from cold compile.

cold compile

The first call into a freshly-rendered module — pays the Numba @njit compile cost for every decorated helper. Cached to disk so subsequent process starts skip it.

LICM

Loop-Invariant Code Motion, an LLVM optimisation pass that hoists computations out of inner loops when their inputs do not change inside the loop.

inline mode

Solverz’s made_numerical(spf, y0) code path — uses lambdify to build callable F/J from the SymPy expressions on every call. No Numba compilation.

module printer mode

Solverz’s module_printer(spf, y0).render() code path — emits a Python module file with @njit-decorated helpers, cached to disk. The fast / fallback distinction only applies to this mode.

Supported Operations

For each supported operation the table below shows three things: the Solverz symbolic constructor, the mathematical form it represents, and the Jacobian block \(\partial f / \partial x\) that the matrix calculus engine emits when the equation is differentiated with respect to a vector variable x.

The “Jacobian block” column is not the symbolic intermediate Solverz computes for each equation. For an element-wise function like sin(x), Solverz first produces the elementwise vector cos(x) — the same shape as the input — and only at Jacobian assembly time wraps that vector in diag(...) to express the diagonal structure of the matrix block. Equation residuals stay vector-valued throughout symbolic processing; the matrix wrapper only appears in the final Jacobian. Read the third column as “the matrix you would get if you materialised \(\partial f / \partial x\) as a dense block”.

Operation

Symbolic Form

Example

Jacobian block (\(\partial f / \partial x\) for vector \(x\))

Matrix-vector multiply

Mat_Mul(A, x)

\(Ax\)

\(A\)

Element-wise multiply

x * y

\(x \odot y\)

\(\operatorname{diag}(y)\)

Addition

x + y

\(x + y\)

\(I\)

Absolute value

Abs(x)

\(|x|\)

\(\operatorname{diag}(\operatorname{sign}(x))\)

Exponential

exp(x)

\(e^x\)

\(\operatorname{diag}(e^x)\)

Sine / Cosine

sin(x) / cos(x)

\(\sin(x)\)

\(\operatorname{diag}(\cos(x))\)

Logarithm

ln(x)

\(\ln(x)\)

\(\operatorname{diag}(x^{-1})\)

Power

x**n

\(x^n\)

\(\operatorname{diag}(n x^{n-1})\)

Transpose

transpose(A)

\(A^T\)

\(A^T\)

Diagonal

Diag(x)

\(\operatorname{diag}(x)\)

\(\operatorname{diag}(\cdot)\)

Note

Mat_Mul is the recommended interface for matrix-vector products in new Solverz models. It is symbolic-first: the symbolic engine sees each Mat_Mul(A, x) as a real operator, walks it through the matrix-calculus pipeline, and emits an analytical Jacobian block automatically.

On the fast path — when A is a plain sparse dim=2 ParamMat_Mul ends up calling exactly the same Numba helper (SolCF.csc_matvec) the legacy MatVecMul interface uses. The two have the same per-call cost; the difference is that Mat_Mul survives the matrix-calculus engine and MatVecMul does not. On the fallback path (negated matrices, nested Mat_Mul, dense dim=2 parameters, or other matrix expressions) Mat_Mul falls through to scipy.sparse @ in the wrapper — see the Layer 1 discussion below.

MatVecMul is still deprecated for new models; pick Mat_Mul.

Warning

Sparse matrices MUST be immutable after modelling

All sparse matrix parameters declared with Param(..., dim=2, sparse=True) are assumed to be immutable (both sparsity pattern and numerical values) once modelling is complete and model.create_instance() has been called.

Solverz relies on this assumption for maximum runtime efficiency:

  • At code-generation time, the sparsity pattern of every mutable-matrix Jacobian block is analysed once, the output data-array layout is fixed, and a pre- computed index mapping is baked into a Numba-compiled scatter-add loop.

  • At runtime, every J_(y, p) call assembles Jacobian block data by directly indexing into Matrix.data using the pre-computed row/column mappings — no scipy.sparse matrix is constructed per call.

  • For sparse-matrix parameters used in Mat_Mul(A, x), the matrix-vector product A @ x is precomputed in the F_ wrapper and passed as a dense vector to the @njit-compiled equation functions.

If your application truly needs to update sparse-matrix values between solves (e.g., sweeping a parameter), rebuild the model with model.create_instance() each time. Mutating p_["A"].data after creation will silently produce wrong Jacobians for matrices that flow through the vectorised Mat_Mul fast path — scatter-add kernels cache M.data at analysis time and the cache becomes stale. Matrices that only land in the MutableMatJacDataModule fallback path (see Fallback path) re-evaluate the full expression on every J_() call and would reflect the mutation, but there is no way to guarantee a particular block won’t be promoted to the fast path in a future release, so the rule is still: do not mutate sparse matrix params post-build.

Dense parameters (Param(..., dim=2, sparse=False)) are fine to update via mdl.p["name"] = new_value; the restriction only applies to sparse dim=2 parameters.

All element-wise functions (exp, sin, cos, ln, Abs) can be composed with matrix operations:

from Solverz import Var, Param, Eqn, Model
from Solverz.sym_algebra.functions import Mat_Mul, exp, sin

m = Model()
m.x = Var('x', [1, 2, 3])
m.A = Param('A', [[1, 0, 0], [0, 2, 0], [0, 0, 3]], dim=2)
m.f = Eqn('f', exp(Mat_Mul(m.A, m.x)))  # exp(A@x)

Solverz will automatically compute \(\frac{\partial}{\partial x} e^{Ax} = \operatorname{diag}(e^{Ax}) \cdot A\).

How It Works

The matrix calculus engine is a four-stage pipeline: each expression is lifted into Einstein index notation, rebuilt as a computation DAG, differentiated forward through the DAG, and finally collapsed back to ordinary matrix/vector operations. We walk through each stage with a running example from power flow.

1. Einstein Index Notation

Every tensor is labelled with one index per axis:

  • Scalar: no index (e.g., \(\alpha\)). Zeroth-order tensor, no axes.

  • Vector \(x \in \mathbb{R}^n\): a single index \(x_i\), where \(i\) ranges over \(\{1, \dots, n\}\). The index names the axis; the letter used is arbitrary, only its pattern of repetition matters.

  • Matrix \(A \in \mathbb{R}^{m \times n}\): two indices \(A_{ij}\). The first index \(i\) labels rows, the second \(j\) labels columns.

A repeated index inside a product implies summation (Einstein summation convention, or contraction):

\[A_{ij}\, x_j \;\equiv\; \sum_{j=1}^{n} A_{ij}\, x_j \;=\; (Ax)_i.\]

Here \(j\) is repeated (contracted — it disappears in the result), while \(i\) is a free index that labels the resulting vector’s axis. Conventionally we write the non-contracted indices on the left of the expression, so the indices of the left-hand side \((Ax)_i\) match the remaining free indices \(i\) on the right.

The benefit of this notation is that the shape of an expression is fully encoded in its index signature. Element-wise multiplication uses repeated indices without contraction:

\[(x \odot y)_i \;=\; x_i\, y_i,\]

and the outer product uses two different free indices:

\[(x\, y^\top)_{ij} \;=\; x_i\, y_j.\]

2. Computation DAG

Solverz parses each equation’s symbolic expression into a directed acyclic graph (DAG):

  • Leaf nodes — variables (\(x\), \(y\)) and parameters (\(A\), \(B\), \(G\)). Each leaf carries its index signature (\(x_i\), \(A_{ij}\), etc.).

  • Internal nodes — operations: addition \(+\), element-wise product \(\odot\), matrix product \(@\), element-wise unary function \(f(\cdot)\).

  • Root node — the output expression whose Jacobian we want.

For the power flow active-power sub-expression \(e \odot (Ge - Bf)\):

        *[i]          ← element-wise multiply (free index i)
       /    \
     e[i]   -[i]      ← subtraction (free index i)
           /    \
       @[i]    @[i]   ← matrix-vector products, contraction on j
       / \     / \
     G[ij] e[j] B[ij] f[j]

Each node remembers its free index set. The matrix product nodes @[i] each contract index \(j\) between \(G_{ij}\) and \(e_j\), so the result has free index \(i\). The subtraction and element-wise product keep \(i\) as their only free index, so the whole expression has index signature \([i]\) — a vector.

3. Forward-Mode Automatic Differentiation

To obtain \(\partial f / \partial x\) where \(f\) is the root and \(x\) is some leaf variable, derivatives propagate from leaves upward to the root (pushforward, a.k.a. forward-mode AD). Every internal node computes its own Jacobian contribution and combines it with the Jacobians already accumulated at its children.

Throughout this section, \(A_{s_1}\) and \(B_{s_2}\) denote the free indices of the two operands (before differentiation), \(s_3\) denotes the free index of the result, and \(s_4\) is the derivative-direction index — the axis along which we differentiate, equal to the free index of the variable \(x\) we differentiate with respect to. Tilded tensors \(\tilde A, \tilde B\) carry one extra axis (labelled \(s_4\)) compared to \(A, B\): the derivative of a rank-\(k\) tensor w.r.t. a rank-1 variable is a rank-\((k{+}1)\) tensor.

  1. Leaf variables. Differentiating \(x_j\) w.r.t. \(x_{s_4}\) gives the Kronecker delta:

    \[\tilde{x}_{j\,s_4} \;=\; \frac{\partial x_j}{\partial x_{s_4}} \;=\; \delta_{j s_4}.\]

    All other variables (and all parameters) have derivative \(0\) w.r.t. \(x\).

  2. Multiplication (product rule in index notation):

\[\tilde{C}_{s_3\,s_4} \;=\; B_{s_2}\cdot\tilde{A}_{s_1\,s_4} \;+\; A_{s_1}\cdot\tilde{B}_{s_2\,s_4}\]

where \(A_{s_1} \cdot B_{s_2} = C_{s_3}\) is the original product (the specific index pattern \(s_1, s_2 \to s_3\) depends on the product type — element-wise, matrix-vector, or matrix-matrix). The two summands encode “differentiate the first factor, keep the second” plus “keep the first factor, differentiate the second”.

  1. Element-wise unary function \(f\) (e.g. \(\exp\), \(\sin\), \(\ln\)): since \(f\) acts point-wise, the chain rule gives

\[\frac{\partial f(A_{s_1})}{\partial x_{s_4}} \;=\; f'(A_{s_1})\odot\tilde{A}_{s_1\,s_4}.\]

The derivative \(f'\) is also computed element-wise and multiplied (Hadamard) with the child’s Jacobian.

  1. Addition. Linearity:

\[\tilde{(A+B)}_{s_3\,s_4} \;=\; \tilde{A}_{s_1\,s_4} + \tilde{B}_{s_2\,s_4}.\]

4. TMul2Mul: Index Resolution

After the forward pass, derivatives live as tensor expressions labelled with free indices. The TMul2Mul stage reads the index pattern of each product and emits the corresponding concrete matrix / vector operation. Each row of the table has the form \((s_1, s_2, s_3)\): the free indices of the left operand, right operand, and the result.

Index pattern \((s_1, s_2, s_3)\)

Matrix operation

Meaning

\((i,\; i,\; i)\)

\(x \odot y\)

element-wise multiply — both operands share axis \(i\), no contraction

\((i,\; j,\; ij)\)

\(x\,y^\top\)

outer product — two disjoint free indices survive into the result

\((ij,\; j,\; i)\)

\(A\,y\)

matrix-vector: index \(j\) is shared (contracted), \(i\) survives

\((ij,\; jk,\; ik)\)

\(A\,B\)

matrix-matrix: inner \(j\) is contracted, \((i,k)\) survive

\((i,\; ij,\; ij)\)

\(\operatorname{diag}(x)\,A\)

left diagonal scaling — scale each row of \(A\) by the matching entry of \(x\)

\((ij,\; j,\; ij)\)

\(A\,\operatorname{diag}(y)\)

right diagonal scaling — scale each column of \(A\) by the matching entry of \(y\)

\((i,\; ii,\; ii)\)

\(\operatorname{diag}(x \odot y)\)

diagonal-from-element-wise — used when a product has matching repeated indices \((ii)\) on one operand

The key insight is that every well-formed index pattern maps to one concrete sparse-preserving operation. Patterns like \((ij, ij, ij)\) that don’t match the table are either simplified earlier (e.g. \((ij, ij, ij) \to A \odot B\), element-wise matrix product) or emitted via a fallback that produces a dense matrix.

Application Examples

Example 1: Power Flow Equations

The active power balance equation in power systems:

\[P = e \odot (Ge - Bf) + f \odot (Be + Gf)\]
from Solverz.sym_algebra.symbols import iVar, Para
from Solverz.sym_algebra.functions import Mat_Mul
from Solverz.sym_algebra.matrix_calculus import TensorExpr

e = iVar('e')
f = iVar('f')
G = Para('G', dim=2)
B = Para('B', dim=2)

P = e * (Mat_Mul(G, e) - Mat_Mul(B, f)) + f * (Mat_Mul(B, e) + Mat_Mul(G, f))
TP = TensorExpr(P)

print(TP.diff(e))  # diag(-B@f + G@e) + diag(e)@G + diag(f)@B
print(TP.diff(f))  # diag(B@e + G@f) + diag(e)@(-B) + diag(f)@G

Example 2: Nonlinear Matrix Equations

from Solverz.sym_algebra.functions import exp

A = Para('A', dim=2)
x = iVar('x')

expr = exp(Mat_Mul(A, x))
TE = TensorExpr(expr)
print(TE.diff(x))  # diag(exp(A@x))@A

Example 3: Heat Network Equations

mL = Para('mL', dim=2)
K = Para('K')
m = iVar('m')
from Solverz.sym_algebra.functions import Abs

expr = Mat_Mul(mL, K * Abs(m) * m)
TE = TensorExpr(expr)
print(TE.diff(m))  # mL@(diag(K*Abs(m)) + diag(K*m*Sign(m)))

How numerical code is generated and optimized

When module_printer compiles an equation system that contains Mat_Mul, it applies a three-layer optimisation strategy to make the generated F_/J_ functions run as fast as possible. All three layers operate on the same insight: at modelling time we already know what the equations and their Jacobians look like — so every computation that does not depend on the current iterate \(y\) can be lifted out of the hot loop and paid for once, at code generation or module load.

Warning

Everything below assumes sparse matrix parameters are immutable after model.create_instance(). See the warning earlier in this document. Every layer described here bakes in the matrix’s sparsity pattern or its .data values as a constant; mutating them in-place will silently corrupt results without raising any error.

Layer 1 — Mat_Mul precomputes as @njit CSC matvec

The code printer walks every residual RHS looking for Mat_Mul(A, v) sub-trees, where A is a sparse parameter and v is any vector-valued expression (a variable, a slice, or a larger expression). It performs three transformations at code gen time:

  1. Hoist. Each Mat_Mul(A, v) is replaced by a fresh placeholder variable _sz_mm_N (where N is a monotonically increasing integer unique across the whole system). The _sz_mm_ prefix is reserved by Solverz — see Restrictions and reserved names below.

  2. Deduplicate. Two Mat_Mul sub-trees that are structurally equal (same matrix, same operand expression after hoisting) are collapsed into a single placeholder. The power-flow example below computes G_nr @ e once even though both the P-balance and Q-balance residuals reference it.

  3. Classify each placeholder. The classifier (_classify_matmul_placeholders, with the inner shape predicate _shape_is_fast, in Solverz/code_printer/python/module/module_printer.py) decides whether the precompute can go on the fast path (matrix is a plain sparse dim=2 Para and the operand is a clean vector expression) or the fallback path (anything else: -Para, nested Mat_Mul, dense dim=2 matrices, matrix expressions, or operands that still contain unresolved Mat_Mul / sparse dim=2 Para references).

    A fast-path candidate is also demoted to the fallback path if another fallback placeholder’s matrix_arg or operand_arg references it as a free symbol — otherwise the wrapper would emit a scipy SpMV like _sz_mm_1 = (-A) @ _sz_mm_0 whose _sz_mm_0 operand was never materialised. The classifier iterates this demotion to a fixed point so chains of nested Mat_Muls land on a topologically valid emission order.

Fast path — the matvec is computed inside inner_F via the existing SolCF.csc_matvec Numba helper (Solverz/num_api/custom_function.py). Each sparse dim=2 Param is already decomposed at model-build time into its CSC flat arrays (A_data / A_indices / A_indptr / A_shape0) by Solverz/model/basic.py; these flat arrays flow through param_list to inner_F as ordinary numpy arrays, and inner_F issues a SolCF.csc_matvec(A_data, A_indices, A_indptr, A_shape0, operand) call per placeholder. The wrapper F_() does not load the sparse matrix object A at all — the whole matvec lives in Numba land and pays zero scipy dispatch cost:

def F_(y_, p_):
    e = y_[0:29]
    f = y_[29:58]
    G_nr_data = p_["G_nr_data"]
    G_nr_indices = p_["G_nr_indices"]
    G_nr_indptr = p_["G_nr_indptr"]
    G_nr_shape0 = p_["G_nr_shape0"]
    B_nr_data = p_["B_nr_data"]
    ...   # dense vector params follow
    return inner_F(_F_, e, f, p_ref, q_ref,
                   G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0,
                   B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0,
                   ...)

@njit(cache=True)
def inner_F(_F_, e, f, p_ref, q_ref,
            G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0,
            B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, ...):
    # Fast-path matvecs: all inside @njit, no scipy dispatch.
    _sz_mm_0 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, e)
    _sz_mm_1 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, f)
    _sz_mm_2 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, e)
    _sz_mm_3 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, f)
    _F_[0:29] = inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3)
    _F_[29:53] = inner_F1(...)
    return _F_

@njit(cache=True)
def inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3):
    return e * (p_ref - _sz_mm_1 + _sz_mm_0) + f * (q_ref + _sz_mm_2 + _sz_mm_3) - Pinj

Why this matters: scipy.sparse matrix-vector products are implemented in C and fast per se, but each call crosses Python → scipy dispatch → C → back, which on a 58×58 matrix costs ~1.5 µs of pure dispatch overhead — 10× the actual arithmetic. Eight such SpMVs in a hot F made Mat_Mul catastrophically slow on small networks (case30 hot F was 10× worse than the for-loop form). Moving the matvec into SolCF.csc_matvec inside inner_F eliminates all Python dispatch: the helper is @njit(cache=True) and inner_F calls it as an intra-Numba function call, which the LLVM inline pass typically flattens into the caller’s body. On case30, this drops hot F from ≈14 µs to ≈3 µs (~4.4× speedup) without any other code change.

Fallback path — when the matrix operand is not a plain sparse Para (e.g. -G_nr, Mat_Mul(A, B), or a dense dim=2 parameter), the old path is kept: F_() emits _sz_mm_N = matrix_expr @ operand in the wrapper (using scipy.sparse or numpy @) and passes _sz_mm_N as an extra dense argument to inner_F. These placeholders do not benefit from the fast path and pay the scipy dispatch cost per call. Dense dim=2 parameters fire a one-shot UserWarning at FormJac time — see Restrictions and reserved names.

Note

The fast path reuses infrastructure that predates the Mat_Mul interface: the same SolCF.csc_matvec helper and the same <name>_data / _indices / _indptr / _shape0 Param decomposition were introduced for the legacy MatVecMul operator. Before 0.8.1 fast path, those fields were still emitted into inner_F’s argument list but never referenced — they are now the conduit for the Layer 1 matvec. Legacy MatVecMul users already benefited from this pipeline; from 0.8.1 Mat_Mul users do as well, without any model-level changes.

When to use Mat_Mul

API-level rules of thumb. The detailed case study with full benchmark numbers, decision matrix, and the case30-scale hot-F regression analysis lives in the power-flow chapter of the Solverz Cookbook — this section just states the API guidance.

Use Mat_Mul when any of the following hold:

  • You want compact, paper-faithful equations. Mat_Mul(G, e) replaces nb scalar Eqn definitions and lets the matrix calculus engine derive the Jacobian automatically.

  • You iterate on the model. Every compile / build phase (Model construction, FormJac, made_numerical, module_printer.render, and the first-time Numba cold compile) is dramatically faster under Mat_Mul than under the equivalent element-wise for-loop form.

  • All matrix operands are plain Param(..., dim=2, sparse=True) — this unlocks the Layer 1 fast path described above.

The narrow case where the for-loop form still wins (very small network, F-dominated hot loop, single compile) and the underlying case30 numbers are documented in the Cookbook chapter linked above.

Matrix shapes that fall out of the fast path and therefore pay the scipy SpMV dispatch cost on every F_ call (~1.5 µs per SpMV — consequential at small network scale, negligible elsewhere):

  • Negated matricesMat_Mul(-G, x). G is a plain Para but -G is a Mul(-1, G) expression the classifier doesn’t recognise. Workaround: write -Mat_Mul(G, x) instead of Mat_Mul(-G, x) so the sign flows through the outer expression and G stays on the fast path.

  • Dense dim=2 paramsParam(A, dim=2, sparse=False). FormJac fires a one-shot UserWarning for these; they fall through to the MutableMatJacDataModule path. Convert to sparse with csc_array(A) and declare sparse=True.

  • Nested Mat_MulMat_Mul(A, Mat_Mul(B, x)). The classifier’s dependency-aware demotion takes care of the topological ordering, but only the inner Mat_Mul ends up on the fast path when both matrices are plain Paras; the outer pays the wrapper SpMV cost.

  • Three-or-more-arg Mat_MulMat_Mul(A, B, x). The operand of the placeholder is itself a Mat_Mul, so the classifier rejects it from the fast path.

  • Matrix expressionsMat_Mul(A + B, x) or similar. The classifier recognises only a bare Para as the matrix operand. Workaround: materialise the combined matrix as a single Param(A + B, dim=2, sparse=True) outside the equation.

Layer 2 — Mutable matrix Jacobian blocks

The Jacobian is handled by a separate, symbolic decomposition because its blocks are matrices (not vectors) whose structure depends on the full expression. Consider the power-flow block

\[\frac{\partial P}{\partial e} \;=\; \operatorname{diag}\!\bigl(G\,e - B\,f + p_\text{ref}\bigr) \;+\; \operatorname{diag}(e)\,G \;+\; \operatorname{diag}(f)\,B.\]

All three terms are mutable — they depend on \(e\) and \(f\) and must be re-evaluated on every J_(y, p) call. But their sparsity pattern is constant: the union of the diagonal, the pattern of \(G\), and the pattern of \(B\) — all fixed at modelling time.

Solverz exploits this in a pattern-match compiler:

  1. Parse the block into typed terms. The symbolic analyser walks the SymPy expression tree (implemented in Solverz/code_printer/python/module/mutable_mat_analyzer.py). Each additive term is classified as one of:

    Shape

    Generated code contribution

    Diag(u_expr) (diagonal term)

    data[out_pos] += u[src_idx]

    Mat_Mul(Diag(v), M) (row-scale)

    data[out_pos] += v[src_row] * M_data[k]

    Mat_Mul(M, Diag(v)) (col-scale)

    data[out_pos] += v[src_col] * M_data[k]

    A term is deferred to the fallback path when any of the following is true:

    • it has a non-unit numeric coefficient other than ±1 (e.g. 3 * Diag(x)) — the analyser only unwraps ±1 via _extract_sign_and_core;

    • it is Mat_Mul(Diag(v), X) or Mat_Mul(X, Diag(v)) where X is not a plain Para (or -Para) — for instance a matrix expression such as A @ B, or a user-constructed product that the matrix calculus engine hasn’t simplified down to a single leaf matrix;

    • it is a Mat_Mul with three or more arguments that the matrix-calculus engine didn’t fold into a nested binary form;

    • it is any other shape the classifier doesn’t recognise (transpose of a mutable matrix, element-wise matrix product, a non-sparse matrix leaf, etc.).

    Unrecognised terms are deferred to the fallback path, which is correct but slower: it evaluates the full sub-expression via scipy.sparse on every J_() call and extracts values by fancy indexing.

  2. Precompute the output sparsity pattern (once). At model initialisation (inside FormJac, Solverz/equation/equations.py), every mutable-matrix block is evaluated one time with all variables perturbed to distinct non-zero values and all Diag nodes substituted with SpDiag (which emits sps.diags instead of the dense np.diagflat). This perturbed evaluation is the entire reason we go to the trouble of the SpDiag substitution: using non-zero variable values guarantees that no term accidentally collapses to zero at \(y_0\) (which would happen in a flat-start power flow where \(f = 0\) and sps.diags(f)@B is empty). The resulting sparse matrix gives us the union of all terms’ patterns — the structural upper bound that the runtime block must fit into.

  3. Precompute the index mapping arrays (once). For each recognised term, the analyser builds three numpy arrays:

    • out_pos[k] — the index into the block’s output data array where the \(k\)-th nonzero contribution of this term lands.

    • src_idx[k] — for diagonal terms, the row \(i\) in the inner vector \(u\) to read from; for row-scale terms, the row of \(M\)’s \(k\)-th nonzero; for col-scale terms, the column.

    • mat_data (row/col-scale terms only) — a direct reference to M.data (which is why the matrix must be immutable).

    These arrays are stored inside setting, a per-module dict that also holds things like the CSC row, col, and initial data for the full Jacobian.

  4. Generate a dedicated Numba kernel per block. For each mutable matrix block the code printer emits a new top-level function named _mut_block_N (where N is the block index). Its signature takes the diag inner vectors, the base variables needed by any row/col-scale term, and each term’s three mapping arrays. All per-block helper arrays use the reserved _sz_mb_{N}_ prefix so different blocks’ mappings never collide. The body is a pure- Python / Numba scatter-add loop:

# Module-level (loaded once from setting at import time):
_sz_mb_0_diag_out_0 = setting["_sz_mb_0_diag_out_0"]
_sz_mb_0_diag_src_0 = setting["_sz_mb_0_diag_src_0"]
_sz_mb_0_rs_out_0   = setting["_sz_mb_0_rs_out_0"]
_sz_mb_0_rs_src_0   = setting["_sz_mb_0_rs_src_0"]
_sz_mb_0_rs_dat_0   = setting["_sz_mb_0_rs_dat_0"]   # frozen G_nr.data
...

@njit(cache=True)
def _mut_block_0(_sz_mb_0_u0, e, f,
                 _sz_mb_0_diag_out_0, _sz_mb_0_diag_src_0,
                 _sz_mb_0_rs_out_0,  _sz_mb_0_rs_src_0,  _sz_mb_0_rs_dat_0,
                 _sz_mb_0_rs_out_1,  _sz_mb_0_rs_src_1,  _sz_mb_0_rs_dat_1):
    data = zeros(107)
    # Diagonal term: diag(G_nr@e - B_nr@f + p_ref)
    for i in range(_sz_mb_0_diag_out_0.shape[0]):
        data[_sz_mb_0_diag_out_0[i]] += _sz_mb_0_u0[_sz_mb_0_diag_src_0[i]]
    # Row-scale term: diag(e) @ G_nr
    for i in range(_sz_mb_0_rs_out_0.shape[0]):
        data[_sz_mb_0_rs_out_0[i]] += e[_sz_mb_0_rs_src_0[i]] * _sz_mb_0_rs_dat_0[i]
    # Row-scale term: diag(f) @ B_nr
    for i in range(_sz_mb_0_rs_out_1.shape[0]):
        data[_sz_mb_0_rs_out_1[i]] += f[_sz_mb_0_rs_src_1[i]] * _sz_mb_0_rs_dat_1[i]
    return data

Note

Every scatter-add loop uses +=, not =, including the diagonal terms. This matters when a block contains multiple independent Diag(...) additive terms whose output positions coincide — for example the Jacobian of x*(A@x) + x*(B@x) produces diag(A@x) and diag(B@x), both landing on the same (i,i) positions. The += accumulates their contributions correctly; an earlier implementation used = and silently dropped one of the two terms.

  1. Wire the kernel up in the J_ wrapper. For each block the wrapper emits: (a) one line per diag term that materialises its inner vector using scipy.sparse (e.g. _sz_mb_0_u0 = p_ref - (B_nr@f) + (G_nr@e)); (b) a call to _mut_block_N(...) passing the inner vectors, base variables, and mapping arrays; (c) an assignment of the returned data array into the appropriate slice of the overall Jacobian _data_.

    Crucially, each block’s scipy work is capped at one SpMV per diag term — nothing else in the block touches sparse matrices at runtime. The data for row/col-scale terms comes straight from the pre-baked _sz_mb_{N}_rs_dat_{k} arrays.

The net effect on each J_(y, p) call is that Jacobian assembly cost is dominated by (i) the constant inner_J call for element-wise blocks and (ii) a handful of SpMVs for the diag inner vectors. The actual mutable-matrix blocks themselves are assembled in \(\mathcal{O}(\text{nnz})\) vectorised Numba loops — no sparse matrix allocation, no sparsity reconstruction, no format conversion.

This benefit applies to every solver that calls J_(y, p) — algebraic-equation Newton solvers (nr_method, sicnm), DAE integrators (which call J_ at every implicit substep), FDAE solvers, and any future solver that consumes Solverz’s Jacobian contract. The cost model is “per J_ call”, not “per Newton step”.

Two paths: scatter-add (fast) and fancy indexing (fallback)

Solverz keeps both assembly paths in every generated module and picks per Jacobian block at code-gen time. The fast path is the typed scatter-add described above; the fallback path exists for blocks whose term structure the symbolic analyser cannot classify (see Layer 2 step 1) and is described in the next subsection. There is no on/off switch — both paths co-exist, and the runtime always uses whichever the classifier picked.

The fallback path is what an earlier iteration of the pipeline used everywhere: build the sparse matrix with sps.diags and friends, then read back the nonzero values at the precomputed COO positions via fancy indexing (expr.tocsr()[[rows], [cols]]). It is correct and elegant but slow — scipy has to construct several intermediate sparse matrices per call, sum them (with explicit-zero elimination), and then perform a linear-scan lookup for each output position. On a 30-bus power flow case, each J_(y, p) call took ≈ 280 µs in that pre-0.8 architecture. The scatter-add Numba path introduced in 0.8.0 drops that to ≈ 50 µs while producing bit-identical results.

If you want to know whether a particular block in your model is on the fast path or the fallback, profile with pytest --durations=10 or cProfile and look for MutableMatJacDataModule in the traces — its presence in the hot J_() line indicates that block is on the fallback.

Layer 3 — Selective @njit

All generated inner loops — inner_F, every per-equation inner_F{N}, inner_J, every per-block inner_J{N}, and every _mut_block_{N} — carry @njit(cache=True). Only the F_ / J_ wrappers remain pure Python; they host any fallback-path SpMV precomputes, the module-level parameter unpacks, and the final sps.coo_array((data, (row, col)), shape).tocsc() that packages the Jacobian back into a scipy object for the downstream solver (algebraic Newton-style or DAE/ODE integrator).

The selectivity matters: Numba absolutely cannot digest a scipy.sparse.csc_array argument, which is why the wrappers exist at all. Anything that can run inside Numba is pushed there, at zero syntactic cost to the user — they just write Mat_Mul(A, x) and module_printer figures out the rest.

An important corollary for non-Mat_Mul models: if an equation system contains no Mat_Mul at all, the entire precompute / mutable- matrix machinery is disabled and F_ / J_ collapse back to the original pre-0.8 architecture (direct @njit on inner_F receiving only dense vectors and scalar params). Performance on pure element-wise models is therefore unchanged by the Mat_Mul rewrite.

Fallback path — scipy.sparse + fancy indexing

When a Jacobian block contains a term shape the symbolic analyser does not recognise — for example an unusual nested Mat_Mul, a non-parameter matrix, a matrix-valued non-linear expression, or a term with a non-±1 coefficient — that single block is handled by the fallback path. The wrapper rebuilds the block with scipy.sparse and reads the requested nonzero values back via fancy indexing:

data[start:stop] = asarray(
    (sps.diags(u) + sps.diags(v)@M + ...)
    .tocsr()[[rows], [cols]]
).ravel()

It is \(\mathcal{O}(\text{nnz} \log n)\) per call instead of the \(\mathcal{O}(\text{nnz})\) scatter-add path, but it never misses a derivative term. The two paths co-exist in every generated module: only the single affected block falls back; the rest of the Jacobian still uses the fast scatter-add path. There is no fallback-only mode and no opt-in switch — Solverz simply picks per block based on what the classifier recognised at code-gen time.

Performance

The full case30 power-flow benchmark — every phase from Model construction through cold compile to hot F / hot J — lives in the Solverz Cookbook power-flow chapter. This subsection only carries one extra micro-benchmark (DHS hydraulic on Barry Island) that exercises a Jacobian shape the power-flow case does not.

Benchmark environment. All numbers below were measured under:

  • Hardware: 2025 MacBook Air, Apple M4, AC power, no thermal throttling observed during the runs.

  • OS: macOS 26.4 (build 25E246).

  • Python: 3.11.13.

  • Library versions: numpy==2.3.3, scipy==1.16.0, numba==0.65.0, sympy==1.13.3, Solverz==0.8.1 (the post-csc_matvec fast path).

  • Methodology: 10 warm-up calls (to bake the Numba caches and prime the CPU branch predictor) followed by 5000–20000 timed iterations, median of three repeats. The cold compile measurement (Cookbook only) is run in a fresh Python subprocess with __pycache__ and Numba .nbi / .nbc caches wiped beforehand. The same per-call timing helper is used in the Cookbook’s bench_pf_matmul_vs_polar.py; the per-J_ numbers below were captured by importing the rendered DHS module and calling mdl.J(y, mdl.p) in a tight loop.

DHS hydraulic subproblem on BarryIsland (35 unknowns, 1 loop, 1 mutable-matrix block — the loop pressure Jacobian contains two col-scale terms \(L\,\operatorname{diag}(K \odot |m|)\) and \(L\,\operatorname{diag}(K \odot m \odot \operatorname{sign}(m))\) whose scaling vectors are non-trivial expressions of the variable):

Pipeline

J_() per call

Element-wise inline (one scalar Eqn per node + loop)

≈ 103 µs

Mat_Mul inline (lambdify)

≈ 45 µs

Mat_Mul module printer, vectorized @njit

≈ 28 µs

  • Element-wise vs Mat_Mul inline. Writing the same hydraulic system as V @ m - m_inj = 0 plus L @ (K m |m|) = 0 (two vector equations total) instead of 35 scalar equations cuts inline time by more than half, because lambdify has far fewer expression nodes to walk on every call.

  • Inline vs module printer. Even on a small system (only 79 nonzeros), the module printer wins once every mutable-matrix term reaches the vectorised scatter-add path. Here the loop- pressure Jacobian has two col-scale terms whose scaling vectors (K * |m| and K * m * sign(m)) are computed once per J_(y, p) call as dense numpy expressions in the wrapper, then handed to a Numba kernel that does two 10-iteration scatter-add loops. The only remaining overhead is the final coo_array(...).tocsc() packing.

Rule of thumb: the module printer wins whenever the mutable-matrix blocks successfully match the fast path; it can lose to inline mode for models that fall through to the fallback path, or models small enough that the final COO-to-CSC packing dominates. Profile with %timeit mdl.J(y, mdl.p) on a representative iterate if in doubt, and check the generated num_func.py for any lingering MutableMatJacDataModule (= fallback) calls.

The scatter-add path is strictly \(\mathcal{O}(\text{nnz})\) in the block contents, while the old fancy-indexing path was dominated by the cost of constructing the intermediate sparse matrices — so the speed-up scales with nnz and with the number of mutable-matrix blocks. For models with many rows and few blocks the advantage is small; for models with many blocks per variable (such as power flow, with distinct \(\partial/\partial e\) and \(\partial/\partial f\) blocks for both \(P\) and \(Q\) balances) the advantage compounds.

Diagnostic warnings

When a Mat_Mul placeholder or a mutable Jacobian block cannot use the fast path, Solverz emits a UserWarning describing what broke the fast path, why, and how to rewrite the expression to recover it. Warnings fire once per fallback item during code generation (module mode only — the inline printer has no fast/fallback split).

All warnings use the standard Python warnings.warn(..., UserWarning) mechanism. To silence them:

import warnings
warnings.simplefilter('ignore', UserWarning)

Layer 1 — Mat_Mul placeholder fallback

Expression

Reason

Suggested rewrite

Mat_Mul(-A, x)

Negation of a sparse Param

-Mat_Mul(A, x) — move negation outside

Mat_Mul(c*A, x)

Scalar multiple of a Param (numeric or scalar Param)

c * Mat_Mul(A, x) — move scalar outside

Mat_Mul(transpose(A), x)

Transposed matrix Param

Predeclare A_T = Param('A_T', value=A.T, dim=2, sparse=True) and write Mat_Mul(A_T, x)

Mat_Mul(A*B, x)

Element-wise Mul of two matrix Params (*, not Mat_Mul)

Nest as Mat_Mul(A, Mat_Mul(B, x)), or distribute

Mat_Mul(A+B, x)

Sum of matrices

Mat_Mul(A, x) + Mat_Mul(B, x) — distribute

Mat_Mul(A, B, x)

Multi-arg Mat_Mul leaves inner product in operand

Mat_Mul(A, Mat_Mul(B, x)) — nest explicitly

Mat_Mul(A, f(B, x))

Operand references sparse dim=2 Param

Precompute the lookup as a dim=1 vector Param

Mat_Mul(A, x) with dense A

Dense dim=2 Param (from _warn_dense_matmul_params at FormJac time)

Declare with sparse=True

Demotion cascade

Upstream fallback placeholder references this one

Fix the upstream fallback first; demoted placeholders auto-recover

Layer 2 — Mutable Jacobian block fallback

When a mutable Jacobian block contains a term that doesn’t match the Diag / row-scale / col-scale / biscale fast path, the warning names the equation, variable, and the offending term.

Common causes (the L2 classifier reports each one distinctly):

  • Element-wise Mul with Diag factors — e.g. Diag(u) * M * Diag(v) written with Python *. Rewrite as Mat_Mul(Diag(u), M, Diag(v)).

  • Single-Diag Mat_Mul with an unsupported matrix factorMat_Mul(Diag(v), A+B) or similar where the analyzer’s _sparse_matrix_nnz can’t materialise the matrix factor. Split the term so each piece has a single sparse dim=2 Param.

  • Biscale Mat_Mul(Diag(u), M, Diag(v)) with unsupported middle matrix M.

  • No-Diag Mat_Mul of two matrices — wrap one operand in Diag(...) or split into supported shapes.

  • Two-argument Mat_Mul(Diag(u), Diag(v)) — degenerate; rewrite as Diag(u * v).

Restrictions and reserved names

Mat_Mul imposes a few hard restrictions that are enforced at model build time. Each one prevents a class of silent-wrong-answer bugs that would otherwise slip past the generated Jacobian.

Time-varying sparse dim=2 parameters are rejected at construction

Solverz does not support sparse dim=2 Params whose values change at runtime. A declaration like:

from scipy.sparse import csc_array
m.K = Param('K',
            csc_array([[1, 0], [0, 1]]),
            dim=2, sparse=True,
            triggerable=True,
            trigger_var=['x'],
            trigger_fun=lambda x: csc_array([[2*x[0], 0], [0, 2*x[1]]]))
# ⇒ NotImplementedError: Parameter 'K': a sparse 2-D ``Param`` cannot
#   be declared ``triggerable=True``. Time-varying sparse matrices are
#   unsupported because Solverz's code generation caches the matrix's
#   CSC fields at model-build time; a runtime trigger update would
#   silently be ignored.

and the equivalent TimeSeriesParam:

m.G = TimeSeriesParam('G',
                      v_series=[...],
                      time_series=[...],
                      value=np.eye(3),
                      dim=2, sparse=True)
# ⇒ NotImplementedError: Parameter 'G': a sparse 2-D
#   ``TimeSeriesParam`` is not supported.

both raise at the point of construction, not deep inside FormJac, so the error points at the exact line where the user wrote the offending declaration.

Why: every Solverz code path that consumes a sparse dim=2 Param caches its CSC decomposition at model-build time:

  • The legacy MatVecMul code path decomposes every sparse dim=2 Param into flat arrays (<name>_data, <name>_indices, <name>_indptr, <name>_shape0) in Solverz/model/basic.py:56, which are stored in p_ and passed to inner_F as read-only numpy arrays.

  • The 0.8.1 Mat_Mul SolCF.csc_matvec fast path reuses those same CSC flats inside inner_F to do the matvec in @njit land.

  • The mutable-matrix Jacobian scatter-add kernel caches the matrix’s .data array directly as a _sz_mb_{N}_rs_dat_{k} setting entry loaded at module import time.

None of these caches are invalidated when p_["K"].trigger_fun fires or p_["G"].get_v_t(t) returns a fresh matrix. The runtime update simply gets lost, the Jacobian keeps using the initial values, and the downstream solver (Newton-Raphson, DAE integrator, or any other consumer of J_) either diverges or converges to the wrong solution — silently.

Historically the check was narrow (only equations containing Mat_Mul were rejected, and the check ran at FormJac time). In the 0.8.1 hotfix the policy broadened to cover every sparse dim=2 time-varying Param, regardless of where or whether it is used. The check now lives in ParamBase.__init__ and TimeSeriesParam.__init__, with an unconditional backstop in Equations._check_no_timevar_sparse_matrices that runs on every FormJac call in case a tainted Param was constructed via __new__ + attribute assignment (which bypasses the __init__ check).

Workarounds — all of these are supported:

  • Triggerable / time-series vectors next to a Mat_Mul. Scalar or 1-D time-varying parameters are fine; only the dim=2

    • sparse=True combination is rejected. Write m.K * m.x + Mat_Mul(m.A, m.x) - m.b with K a triggerable 1-D vector and A a plain sparse 2-D matrix — the wrapper rebuilds K on every solver step via the trigger_var mechanism and the Mat_Mul(A, x) fast path uses the unchanged A.

  • Element-wise rewrite. If you genuinely need the matrix itself to evolve, write the residual out one scalar Eqn per row, with the per-row coefficients as 1-D Params or TimeSeriesParams. The element-wise form has no CSC caching — every call re-evaluates the full trigonometric / algebraic expression via pure numpy / numba.

  • Dense dim=2 parameters (sparse=False). The MutableMatJacDataModule fallback path used for dense matrices re-evaluates the entire block expression via scipy/numpy on every J_() call, so get_v_t(t) updates DO propagate. Dense parameters do not get the csc_matvec fast path, but they’re a correct (if slower) way to carry a time-varying matrix through the solver. A UserWarning still fires once per such parameter to flag the performance cost.

Reserved symbol prefixes

User symbols (Var, Param, iVar, Para, …) are rejected at construction time if their name starts with any of:

Prefix

Used for

_sz_mm_

Mat_Mul precompute placeholders in the F_/J_ wrapper (e.g. _sz_mm_0 = G @ e).

_sz_mb_

Mutable-matrix Jacobian block helpers (diag inner vectors, row/col-scale scaling vectors, mapping arrays).

m.x = Var('_sz_mm_0', [1.0])
# ⇒ ValueError: Symbol name '_sz_mm_0' starts with Solverz-reserved
#   internal prefix '_sz_mm_'. This prefix is used by the code
#   generator for Mat_Mul / mutable-matrix Jacobian helper names and
#   cannot be used for user variables or parameters.

Why: the code printer emits an unbounded family of helper identifiers of the form <prefix><int> and needs to know that none of them will collide with a user symbol. The prefix ban is the simplest way to guarantee that — users pick any name that does not start with _sz_mm_ / _sz_mb_, and the code printer is free to use any name that does.

Dense dim=2 parameters emit a performance warning

A parameter declared with dim=2, sparse=False that appears inside a Mat_Mul is legal — the fallback path handles it correctly — but it forfeits every optimisation described above:

  • no sparse pattern precomputation (the “output sparsity pattern” step produces a fully dense block),

  • no _sz_mb_{N}_rs_dat_k scatter-add (the block falls through to the MutableMatJacDataModule fancy-indexing path),

  • each Jacobian call allocates and populates a dense intermediate.

Solverz emits a one-shot UserWarning at FormJac time, once per offending parameter, to flag the performance cost:

m.A = Param('A', np.array([[2.0, 1.0], [1.0, 3.0]]), dim=2, sparse=False)
m.eqn = Eqn('f', m.x * Mat_Mul(m.A, m.x) - m.b)
smdl, y0 = m.create_instance()
smdl.FormJac(y0)
# UserWarning: Parameter 'A' is a dense 2-D ``Param(..., dim=2,
#   sparse=False)`` used inside ``Mat_Mul``. Dense matrices bypass
#   the vectorised mutable-matrix Jacobian fast path and fall back
#   to a slower scipy/ndarray indexing path. Declare 'A' with
#   ``sparse=True`` for substantially better performance.

How to fix: wrap the value in a scipy.sparse container and declare sparse=True:

from scipy.sparse import csc_array
m.A = Param('A', csc_array(np.array([[2.0, 1.0], [1.0, 3.0]])),
            dim=2, sparse=True)

On a dense 2×2 matrix the difference is invisible, but for anything larger than a few dozen rows the sparse path is substantially faster and (more importantly) keeps the Jacobian assembly off the fallback path.

API Reference

Solverz.sym_algebra.matrix_calculus.MixedEquationDiff(expr: Expr, symbol: Symbol)

Calculate the derivative of a mixed matrix-vector expression.

Parameters:

expr : sympy.Expr

The expression to differentiate (may contain Mat_Mul, Diag, etc.)

symbol : sympy.Symbol

The variable to differentiate with respect to

Returns:

sympy.Expr

The symbolic derivative expression

class Solverz.sym_algebra.matrix_calculus.TensorExpr(expr: Expr)
diff(s)

Differentiate the expression w.r.t. symbol s using forward mode.

Returns a SymPy expression representing the derivative.

visualize()

Visualize the computation DAG.