Matrix-Vector Calculus

Solverz provides automatic symbolic differentiation for mixed matrix-vector expressions. When an equation contains matrix-vector products (Mat_Mul), Solverz automatically computes analytical Jacobian blocks using tensor calculus with Einstein index notation.

This module is based on the approach described in:

Laue, Mitterreiter, Giesen. A Simple and Efficient Tensor Calculus. Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

See also: matrixcalculus.org

Supported Operations

Operation

Symbolic Form

Example

Derivative

Matrix-vector multiply

Mat_Mul(A, x)

\(Ax\)

\(A\)

Element-wise multiply

x * y

\(x \odot y\)

\(\operatorname{diag}(y)\)

Addition

x + y

\(x + y\)

\(I\)

Absolute value

Abs(x)

\(|x|\)

\(\operatorname{diag}(\operatorname{sign}(x))\)

Exponential

exp(x)

\(e^x\)

\(\operatorname{diag}(e^x)\)

Sine / Cosine

sin(x) / cos(x)

\(\sin(x)\)

\(\operatorname{diag}(\cos(x))\)

Logarithm

ln(x)

\(\ln(x)\)

\(\operatorname{diag}(x^{-1})\)

Power

x**n

\(x^n\)

\(\operatorname{diag}(n x^{n-1})\)

Transpose

transpose(A)

\(A^T\)

\(A^T\)

Diagonal

Diag(x)

\(\operatorname{diag}(x)\)

\(\operatorname{diag}(\cdot)\)

Note

Mat_Mul is the recommended interface for matrix-vector products. The legacy MatVecMul function (which decomposes sparse matrices into CSC components for Numba) is deprecated. Mat_Mul uses scipy.sparse directly, which is both faster and supports full matrix calculus.

Warning

Sparse matrices MUST be immutable after modelling

All sparse matrix parameters declared with Param(..., dim=2, sparse=True) are assumed to be immutable (both sparsity pattern and numerical values) once modelling is complete and model.create_instance() has been called.

Solverz relies on this assumption for maximum runtime efficiency:

  • At code-generation time, the sparsity pattern of every mutable-matrix Jacobian block is analysed once, the output data-array layout is fixed, and a pre- computed index mapping is baked into a Numba-compiled scatter-add loop.

  • At runtime, each Newton step assembles Jacobian block data by directly indexing into Matrix.data using the pre-computed row/column mappings — no scipy.sparse matrix is constructed per iteration.

  • For sparse-matrix parameters used in Mat_Mul(A, x), the matrix-vector product A @ x is precomputed in the F_ wrapper and passed as a dense vector to the @njit-compiled equation functions.

If your application truly needs to update sparse-matrix values between solves (e.g., sweeping a parameter), rebuild the model with model.create_instance() each time. Mutating p_["A"].data after creation will silently produce wrong Jacobians for matrices that flow through the vectorised Mat_Mul fast path — scatter-add kernels cache M.data at analysis time and the cache becomes stale. Matrices that only land in the MutableMatJacDataModule fallback path (see Fallback path) re-evaluate the full expression on every J_() call and would reflect the mutation, but there is no way to guarantee a particular block won’t be promoted to the fast path in a future release, so the rule is still: do not mutate sparse matrix params post-build.

Dense parameters (Param(..., dim=2, sparse=False)) are fine to update via mdl.p["name"] = new_value; the restriction only applies to sparse dim=2 parameters.

All element-wise functions (exp, sin, cos, ln, Abs) can be composed with matrix operations:

from Solverz import Var, Param, Eqn, Model
from Solverz.sym_algebra.functions import Mat_Mul, exp, sin

m = Model()
m.x = Var('x', [1, 2, 3])
m.A = Param('A', [[1, 0, 0], [0, 2, 0], [0, 0, 3]], dim=2)
m.f = Eqn('f', exp(Mat_Mul(m.A, m.x)))  # exp(A@x)

Solverz will automatically compute \(\frac{\partial}{\partial x} e^{Ax} = \operatorname{diag}(e^{Ax}) \cdot A\).

How It Works

The matrix calculus engine is a four-stage pipeline: each expression is lifted into Einstein index notation, rebuilt as a computation DAG, differentiated forward through the DAG, and finally collapsed back to ordinary matrix/vector operations. We walk through each stage with a running example from power flow.

1. Einstein Index Notation

Every tensor is labelled with one index per axis:

  • Scalar: no index (e.g., \(\alpha\)). Zeroth-order tensor, no axes.

  • Vector \(x \in \mathbb{R}^n\): a single index \(x_i\), where \(i\) ranges over \({1, \dots, n}\). The index names the axis; the letter used is arbitrary, only its pattern of repetition matters.

  • Matrix \(A \in \mathbb{R}^{m \times n}\): two indices \(A_{ij}\). The first index \(i\) labels rows, the second \(j\) labels columns.

A repeated index inside a product implies summation (Einstein summation convention, or contraction):

\[A_{ij}\, x_j \;\equiv\; \sum_{j=1}^{n} A_{ij}\, x_j \;=\; (Ax)_i.\]

Here \(j\) is repeated (contracted — it disappears in the result), while \(i\) is a free index that labels the resulting vector’s axis. Conventionally we write the non-contracted indices on the left of the expression, so the indices of the left-hand side \((Ax)_i\) match the remaining free indices \(i\) on the right.

The benefit of this notation is that the shape of an expression is fully encoded in its index signature. Element-wise multiplication uses repeated indices without contraction:

\[(x \odot y)_i \;=\; x_i\, y_i,\]

and the outer product uses two different free indices:

\[(x\, y^\top)_{ij} \;=\; x_i\, y_j.\]

2. Computation DAG

Solverz parses each equation’s symbolic expression into a directed acyclic graph (DAG):

  • Leaf nodes — variables (\(x\), \(y\)) and parameters (\(A\), \(B\), \(G\)). Each leaf carries its index signature (\(x_i\), \(A_{ij}\), etc.).

  • Internal nodes — operations: addition \(+\), element-wise product \(\odot\), matrix product \(@\), element-wise unary function \(f(\cdot)\).

  • Root node — the output expression whose Jacobian we want.

For the power flow active-power sub-expression \(e \odot (Ge - Bf)\):

        *[i]          ← element-wise multiply (free index i)
       /    \
     e[i]   -[i]      ← subtraction (free index i)
           /    \
       @[i]    @[i]   ← matrix-vector products, contraction on j
       / \     / \
     G[ij] e[j] B[ij] f[j]

Each node remembers its free index set. The matrix product nodes @[i] each contract index \(j\) between \(G_{ij}\) and \(e_j\), so the result has free index \(i\). The subtraction and element-wise product keep \(i\) as their only free index, so the whole expression has index signature \([i]\) — a vector.

3. Forward-Mode Automatic Differentiation

To obtain \(\partial f / \partial x\) where \(f\) is the root and \(x\) is some leaf variable, derivatives propagate from leaves upward to the root (pushforward, a.k.a. forward-mode AD). Every internal node computes its own Jacobian contribution and combines it with the Jacobians already accumulated at its children.

Throughout this section, \(A_{s_1}\) and \(B_{s_2}\) denote the free indices of the two operands (before differentiation), \(s_3\) denotes the free index of the result, and \(s_4\) is the derivative-direction index — the axis along which we differentiate, equal to the free index of the variable \(x\) we differentiate with respect to. Tilded tensors \(\tilde A, \tilde B\) carry one extra axis (labelled \(s_4\)) compared to \(A, B\): the derivative of a rank-\(k\) tensor w.r.t. a rank-1 variable is a rank-\((k{+}1)\) tensor.

  1. Leaf variables. Differentiating \(x_j\) w.r.t. \(x_{s_4}\) gives the Kronecker delta:

    \[\tilde{x}_{j\,s_4} \;=\; \frac{\partial x_j}{\partial x_{s_4}} \;=\; \delta_{j s_4}.\]

    All other variables (and all parameters) have derivative \(0\) w.r.t. \(x\).

  2. Multiplication (product rule in index notation):

\[\tilde{C}_{s_3\,s_4} \;=\; B_{s_2}\cdot\tilde{A}_{s_1\,s_4} \;+\; A_{s_1}\cdot\tilde{B}_{s_2\,s_4}\]

where \(A_{s_1} \cdot B_{s_2} = C_{s_3}\) is the original product (the specific index pattern \(s_1, s_2 \to s_3\) depends on the product type — element-wise, matrix-vector, or matrix-matrix). The two summands encode “differentiate the first factor, keep the second” plus “keep the first factor, differentiate the second”.

  1. Element-wise unary function \(f\) (e.g. \(\exp\), \(\sin\), \(\ln\)): since \(f\) acts point-wise, the chain rule gives

\[\frac{\partial f(A_{s_1})}{\partial x_{s_4}} \;=\; f'(A_{s_1})\odot\tilde{A}_{s_1\,s_4}.\]

The derivative \(f'\) is also computed element-wise and multiplied (Hadamard) with the child’s Jacobian.

  1. Addition. Linearity:

\[\tilde{(A+B)}_{s_3\,s_4} \;=\; \tilde{A}_{s_1\,s_4} + \tilde{B}_{s_2\,s_4}.\]

4. TMul2Mul: Index Resolution

After the forward pass, derivatives live as tensor expressions labelled with free indices. The TMul2Mul stage reads the index pattern of each product and emits the corresponding concrete matrix / vector operation. Each row of the table has the form \((s_1, s_2, s_3)\): the free indices of the left operand, right operand, and the result.

Index pattern \((s_1, s_2, s_3)\)

Matrix operation

Meaning

\((i,; i,; i)\)

\(x \odot y\)

element-wise multiply — both operands share axis \(i\), no contraction

\((i,; j,; ij)\)

\(x,y^\top\)

outer product — two disjoint free indices survive into the result

\((ij,; j,; i)\)

\(A,y\)

matrix-vector: index \(j\) is shared (contracted), \(i\) survives

\((ij,; jk,; ik)\)

\(A,B\)

matrix-matrix: inner \(j\) is contracted, \((i,k)\) survive

\((i,; ij,; ij)\)

\(\operatorname{diag}(x),A\)

left diagonal scaling — scale each row of \(A\) by the matching entry of \(x\)

\((ij,; j,; ij)\)

\(A,\operatorname{diag}(y)\)

right diagonal scaling — scale each column of \(A\) by the matching entry of \(y\)

\((i,; ii,; ii)\)

\(\operatorname{diag}(x \odot y)\)

diagonal-from-element-wise — used when a product has matching repeated indices \((ii)\) on one operand

The key insight is that every well-formed index pattern maps to one concrete sparse-preserving operation. Patterns like \((ij, ij, ij)\) that don’t match the table are either simplified earlier (e.g. $(ij, ij, ij) \to A \odot B$, element-wise matrix product) or emitted via a fallback that produces a dense matrix.

Application Examples

Example 1: Power Flow Equations

The active power balance equation in power systems:

\[P = e \odot (Ge - Bf) + f \odot (Be + Gf)\]
from Solverz.sym_algebra.symbols import iVar, Para
from Solverz.sym_algebra.functions import Mat_Mul
from Solverz.sym_algebra.matrix_calculus import TensorExpr

e = iVar('e')
f = iVar('f')
G = Para('G', dim=2)
B = Para('B', dim=2)

P = e * (Mat_Mul(G, e) - Mat_Mul(B, f)) + f * (Mat_Mul(B, e) + Mat_Mul(G, f))
TP = TensorExpr(P)

print(TP.diff(e))  # diag(-B@f + G@e) + diag(e)@G + diag(f)@B
print(TP.diff(f))  # diag(B@e + G@f) + diag(e)@(-B) + diag(f)@G

Example 2: Nonlinear Matrix Equations

from Solverz.sym_algebra.functions import exp

A = Para('A', dim=2)
x = iVar('x')

expr = exp(Mat_Mul(A, x))
TE = TensorExpr(expr)
print(TE.diff(x))  # diag(exp(A@x))@A

Example 3: Heat Network Equations

mL = Para('mL', dim=2)
K = Para('K')
m = iVar('m')
from Solverz.sym_algebra.functions import Abs

expr = Mat_Mul(mL, K * Abs(m) * m)
TE = TensorExpr(expr)
print(TE.diff(m))  # mL@(diag(K*Abs(m)) + diag(K*m*Sign(m)))

How numerical code is generated and optimized

When module_printer compiles an equation system that contains Mat_Mul, it applies a three-layer optimisation strategy to make the generated F_/J_ functions run as fast as possible. All three layers operate on the same insight: at modelling time we already know what the equations and their Jacobians look like — so every computation that does not depend on the current iterate \(y\) can be lifted out of the hot loop and paid for once, at code generation or module load.

Warning

Everything below assumes sparse matrix parameters are immutable after model.create_instance(). See the warning earlier in this document. Every layer described here bakes in the matrix’s sparsity pattern or its .data values as a constant; mutating them in-place will silently corrupt results without raising any error.

Layer 1 — Mat_Mul precomputes as @njit CSC matvec

The code printer walks every residual RHS looking for Mat_Mul(A, v) sub-trees, where A is a sparse parameter and v is any vector-valued expression (a variable, a slice, or a larger expression). It performs three transformations at code gen time:

  1. Hoist. Each Mat_Mul(A, v) is replaced by a fresh placeholder variable _sz_mm_N (where N is a monotonically increasing integer unique across the whole system). The _sz_mm_ prefix is reserved by Solverz — see Restrictions and reserved names below.

  2. Deduplicate. Two Mat_Mul sub-trees that are structurally equal (same matrix, same operand expression after hoisting) are collapsed into a single placeholder. The power-flow example below computes G_nr @ e once even though both the P-balance and Q-balance residuals reference it.

  3. Classify each placeholder. The classifier (implemented in _is_csc_matvec_fast_path in Solverz/code_printer/python/module/module_printer.py) decides whether the precompute can go on the fast path (matrix is a plain sparse dim=2 Para) or the fallback path (anything else: -Para, nested Mat_Mul, dense dim=2 matrices, or a matrix expression).

Fast path — the matvec is computed inside inner_F via the existing SolCF.csc_matvec Numba helper (Solverz/num_api/custom_function.py). Each sparse dim=2 Param is already decomposed at model-build time into its CSC flat arrays (A_data / A_indices / A_indptr / A_shape0) by Solverz/model/basic.py; these flat arrays flow through param_list to inner_F as ordinary numpy arrays, and inner_F issues a SolCF.csc_matvec(A_data, A_indices, A_indptr, A_shape0, operand) call per placeholder. The wrapper F_() does not load the sparse matrix object A at all — the whole matvec lives in Numba land and pays zero scipy dispatch cost:

def F_(y_, p_):
    e = y_[0:29]
    f = y_[29:58]
    G_nr_data = p_["G_nr_data"]
    G_nr_indices = p_["G_nr_indices"]
    G_nr_indptr = p_["G_nr_indptr"]
    G_nr_shape0 = p_["G_nr_shape0"]
    B_nr_data = p_["B_nr_data"]
    ...   # dense vector params follow
    return inner_F(_F_, e, f, p_ref, q_ref,
                   G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0,
                   B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0,
                   ...)

@njit(cache=True)
def inner_F(_F_, e, f, p_ref, q_ref,
            G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0,
            B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, ...):
    # Fast-path matvecs: all inside @njit, no scipy dispatch.
    _sz_mm_0 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, e)
    _sz_mm_1 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, f)
    _sz_mm_2 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, e)
    _sz_mm_3 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, f)
    _F_[0:29] = inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3)
    _F_[29:53] = inner_F1(...)
    return _F_

@njit(cache=True)
def inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3):
    return e * (p_ref - _sz_mm_1 + _sz_mm_0) + f * (q_ref + _sz_mm_2 + _sz_mm_3) - Pinj

Why this matters: scipy.sparse matrix-vector products are implemented in C and fast per se, but each call crosses Python → scipy dispatch → C → back, which on a 58×58 matrix costs ~1.5 µs of pure dispatch overhead — 10× the actual arithmetic. Eight such SpMVs in a hot F made Mat_Mul catastrophically slow on small networks (case30 hot F was 10× worse than the for-loop form). Moving the matvec into SolCF.csc_matvec inside inner_F eliminates all Python dispatch: the helper is @njit(cache=True) and inner_F calls it as an intra-Numba function call, which the LLVM inline pass typically flattens into the caller’s body. On case30, this drops hot F from ≈14 µs to ≈3 µs (~4.4× speedup) without any other code change.

Fallback path — when the matrix operand is not a plain sparse Para (e.g. -G_nr, Mat_Mul(A, B), or a dense dim=2 parameter), the old path is kept: F_() emits _sz_mm_N = matrix_expr @ operand in the wrapper (using scipy.sparse or numpy @) and passes _sz_mm_N as an extra dense argument to inner_F. These placeholders do not benefit from the fast path and pay the scipy dispatch cost per call. Dense dim=2 parameters fire a one-shot UserWarning at FormJac time — see Restrictions and reserved names.

Note

The fast path reuses infrastructure that predates the Mat_Mul interface: the same SolCF.csc_matvec helper and the same <name>_data / _indices / _indptr / _shape0 Param decomposition were introduced for the legacy MatVecMul operator. Before 0.8.1 fast path, those fields were still emitted into inner_F’s argument list but never referenced — they are now the conduit for the Layer 1 matvec. Legacy MatVecMul users already benefited from this pipeline; from 0.8.1 Mat_Mul users do as well, without any model-level changes.

When to use (and not use) Mat_Mul

Mat_Mul wins decisively for every compile and build phase of the workflow: Model construction, FormJac, made_numerical (inline compile), module_printer.render, and — most importantly — the first-time Numba module compile. On a typical matpower case30 power-flow model the compile phases are 20–40× faster under Mat_Mul than under the equivalent element-wise for-loop formulation. The hot J call is also slightly faster (vectorised scatter-add in the mutable-matrix blocks). See the power-flow chapter of the Solverz Cookbook for the full benchmark table.

Use Mat_Mul when:

  • You are iterating on the model (changing parameters, equations, or dimensions) — compile-time savings are paid on every rebuild and dwarf the runtime differences.

  • The system has more than a handful of unknowns — for ≳ 50 unknowns the scipy/Numba dispatch overhead is amortised over enough arithmetic that hot F is within striking distance of (or beats) the for-loop form.

  • You want compact model code. Mat_Mul(G, e) replaces nb scalar Eqn definitions per bus and makes the equations read like the paper.

  • All sparse matrices used in Mat_Mul are plain Param(..., dim=2, sparse=True) — this unlocks the Layer 1 Numba fast path described above.

Consider the for-loop form when all of the following hold:

  • The system is very small (≲ 30 unknowns);

  • Your hot loop is dominated by F evaluations (not Jacobian, not linear solve), so the raw hot-F number actually matters;

  • The module is compiled once and then reused many times, so the compile-time saving of Mat_Mul is not recovered.

This combination of conditions is rare in practice — most power-flow and DHS workloads hit the Jacobian assembly and linear solve at least as often as F, and both of those are independent of (or favourable to) Mat_Mul.

Known performance regression: on very small networks (≈ case30-scale) the Mat_Mul hot F is ~3 µs per call while the equivalent for-loop form runs at ~1.1 µs (≈ 2.9× slower). The difference is the structural cost of 8 SolCF.csc_matvec calls dispatched through inner_F plus the sub-function dispatch layer, versus 53 inlined scalar kernels the element-wise form collapses into. The 2 µs gap is usually invisible next to the J call (~55 µs) and the linear solve, but if your workload is millions of pure F evaluations on a tiny network, the for-loop form still wins.

The gap shrinks and flips with problem size. For networks large enough that each SpMV does meaningful arithmetic — roughly case118 and above — the element-wise form’s huge cold-compile cost and per-call dispatch overhead on every scalar inner_F{i} make Mat_Mul the strictly better choice on all metrics.

Matrices that fall out of the fast path (and therefore pay the scipy SpMV dispatch cost on every F_ call, ≈ 1.5 µs per SpMV):

  • Negated matricesMat_Mul(-G, x). G is a plain Para but -G is a Mul(-1, G) expression the classifier doesn’t recognise. Workaround: fold the sign into the coefficient of the surrounding expression — write -Mat_Mul(G, x) instead of Mat_Mul(-G, x).

  • Dense dim=2 paramsParam(A, dim=2, sparse=False). FormJac fires a one-shot UserWarning for these; they fall back to the MutableMatJacDataModule path. Convert to sparse with csc_array(A) and declare sparse=True.

  • Nested Mat_MulMat_Mul(A, Mat_Mul(B, x)). The outer call receives the inner placeholder _sz_mm_N as its operand, which hits the fast path; the inner call still hits the fast path because its matrix is a plain Para. Only the inner is made fast, though, and only if both are plain Paras. Triple-nested is uncommon but would degrade similarly.

  • Matrix expressionsMat_Mul(A + B, x) or similar. The classifier recognises only a bare Para as the matrix operand. Workaround: materialise the combined matrix as a single Param (A + B) outside the equation.

Layer 2 — Mutable matrix Jacobian blocks

The Jacobian is handled by a separate, symbolic decomposition because its blocks are matrices (not vectors) whose structure depends on the full expression. Consider the power-flow block

\[\frac{\partial P}{\partial e} \;=\; \operatorname{diag}\!\bigl(G\,e - B\,f + p_\text{ref}\bigr) \;+\; \operatorname{diag}(e)\,G \;+\; \operatorname{diag}(f)\,B.\]

All three terms are mutable — they depend on \(e\) and \(f\) and must be re-evaluated every Newton step. But their sparsity pattern is constant: the union of the diagonal, the pattern of \(G\), and the pattern of \(B\) — all fixed at modelling time.

Solverz exploits this in a pattern-match compiler:

  1. Parse the block into typed terms. The symbolic analyser walks the SymPy expression tree (implemented in Solverz/code_printer/python/module/mutable_mat_analyzer.py). Each additive term is classified as one of:

    Shape

    Generated code contribution

    Diag(u_expr) (diagonal term)

    data[out_pos] += u[src_idx]

    Mat_Mul(Diag(v), M) (row-scale)

    data[out_pos] += v[src_row] * M_data[k]

    Mat_Mul(M, Diag(v)) (col-scale)

    data[out_pos] += v[src_col] * M_data[k]

    A term is deferred to the fallback path when any of the following is true:

    • it has a non-unit numeric coefficient other than ±1 (e.g. 3 * Diag(x)) — the analyser only unwraps ±1 via _extract_sign_and_core;

    • it is Mat_Mul(Diag(v), X) or Mat_Mul(X, Diag(v)) where X is not a plain Para (or -Para) — for instance a matrix expression such as A @ B, or a user-constructed product that the matrix calculus engine hasn’t simplified down to a single leaf matrix;

    • it is a Mat_Mul with three or more arguments that the matrix-calculus engine didn’t fold into a nested binary form;

    • it is any other shape the classifier doesn’t recognise (transpose of a mutable matrix, element-wise matrix product, a non-sparse matrix leaf, etc.).

    Unrecognised terms are deferred to the fallback path, which is correct but slower: it evaluates the full sub-expression via scipy.sparse on every J_() call and extracts values by fancy indexing.

  2. Precompute the output sparsity pattern (once). At model initialisation (inside FormJac, Solverz/equation/equations.py), every mutable-matrix block is evaluated one time with all variables perturbed to distinct non-zero values and all Diag nodes substituted with SpDiag (which emits sps.diags instead of the dense np.diagflat). This perturbed evaluation is the entire reason we go to the trouble of the SpDiag substitution: using non-zero variable values guarantees that no term accidentally collapses to zero at \(y_0\) (which would happen in a flat-start power flow where \(f = 0\) and sps.diags(f)@B is empty). The resulting sparse matrix gives us the union of all terms’ patterns — the structural upper bound that the runtime block must fit into.

  3. Precompute the index mapping arrays (once). For each recognised term, the analyser builds three numpy arrays:

    • out_pos[k] — the index into the block’s output data array where the \(k\)-th nonzero contribution of this term lands.

    • src_idx[k] — for diagonal terms, the row \(i\) in the inner vector \(u\) to read from; for row-scale terms, the row of \(M\)’s \(k\)-th nonzero; for col-scale terms, the column.

    • mat_data (row/col-scale terms only) — a direct reference to M.data (which is why the matrix must be immutable).

    These arrays are stored inside setting, a per-module dict that also holds things like the CSC row, col, and initial data for the full Jacobian.

  4. Generate a dedicated Numba kernel per block. For each mutable matrix block the code printer emits a new top-level function named _mut_block_N (where N is the block index). Its signature takes the diag inner vectors, the base variables needed by any row/col-scale term, and each term’s three mapping arrays. All per-block helper arrays use the reserved _sz_mb_{N}_ prefix so different blocks’ mappings never collide. The body is a pure- Python / Numba scatter-add loop:

# Module-level (loaded once from setting at import time):
_sz_mb_0_diag_out_0 = setting["_sz_mb_0_diag_out_0"]
_sz_mb_0_diag_src_0 = setting["_sz_mb_0_diag_src_0"]
_sz_mb_0_rs_out_0   = setting["_sz_mb_0_rs_out_0"]
_sz_mb_0_rs_src_0   = setting["_sz_mb_0_rs_src_0"]
_sz_mb_0_rs_dat_0   = setting["_sz_mb_0_rs_dat_0"]   # frozen G_nr.data
...

@njit(cache=True)
def _mut_block_0(_sz_mb_0_u0, e, f,
                 _sz_mb_0_diag_out_0, _sz_mb_0_diag_src_0,
                 _sz_mb_0_rs_out_0,  _sz_mb_0_rs_src_0,  _sz_mb_0_rs_dat_0,
                 _sz_mb_0_rs_out_1,  _sz_mb_0_rs_src_1,  _sz_mb_0_rs_dat_1):
    data = zeros(107)
    # Diagonal term: diag(G_nr@e - B_nr@f + p_ref)
    for i in range(_sz_mb_0_diag_out_0.shape[0]):
        data[_sz_mb_0_diag_out_0[i]] += _sz_mb_0_u0[_sz_mb_0_diag_src_0[i]]
    # Row-scale term: diag(e) @ G_nr
    for i in range(_sz_mb_0_rs_out_0.shape[0]):
        data[_sz_mb_0_rs_out_0[i]] += e[_sz_mb_0_rs_src_0[i]] * _sz_mb_0_rs_dat_0[i]
    # Row-scale term: diag(f) @ B_nr
    for i in range(_sz_mb_0_rs_out_1.shape[0]):
        data[_sz_mb_0_rs_out_1[i]] += f[_sz_mb_0_rs_src_1[i]] * _sz_mb_0_rs_dat_1[i]
    return data

Note

Every scatter-add loop uses +=, not =, including the diagonal terms. This matters when a block contains multiple independent Diag(...) additive terms whose output positions coincide — for example the Jacobian of x*(A@x) + x*(B@x) produces diag(A@x) and diag(B@x), both landing on the same (i,i) positions. The += accumulates their contributions correctly; an earlier implementation used = and silently dropped one of the two terms.

  1. Wire the kernel up in the J_ wrapper. For each block the wrapper emits: (a) one line per diag term that materialises its inner vector using scipy.sparse (e.g. _sz_mb_0_u0 = p_ref - (B_nr@f) + (G_nr@e)); (b) a call to _mut_block_N(...) passing the inner vectors, base variables, and mapping arrays; (c) an assignment of the returned data array into the appropriate slice of the overall Jacobian _data_.

    Crucially, each block’s scipy work is capped at one SpMV per diag term — nothing else in the block touches sparse matrices at runtime. The data for row/col-scale terms comes straight from the pre-baked _sz_mb_{N}_rs_dat_{k} arrays.

The net effect on a Newton step is that the Jacobian assembly cost is dominated by (i) the constant inner_J call for element-wise blocks and (ii) a handful of SpMVs for the diag inner vectors. The actual mutable-matrix blocks themselves are assembled in \(\mathcal{O}(\text{nnz})\) vectorised Numba loops — no sparse matrix allocation, no sparsity reconstruction, no format conversion.

Why not fancy indexing?

An earlier iteration of the pipeline used expr.tocsr()[[row],[col]] fancy indexing — build the sparse matrix with sps.diags and read back the nonzero values at the precomputed COO positions. It is correct and elegant but slow: scipy has to construct several intermediate sparse matrices per step, sum them (with explicit-zero elimination), and then perform a linear-scan lookup for each output position. On a 30-bus power flow case, each Jacobian call took ≈ 280 µs. The scatter-add Numba path drops that to ≈ 50 µs while producing bit-identical results.

The fallback path still uses fancy indexing, so if you write an equation whose Jacobian term structure the analyser doesn’t recognise, you get correct results — just slower. Profile with pytest --durations=10 and look for mutable-matrix blocks lingering in the slow path if performance matters.

Layer 3 — Selective @njit

All generated inner loops — inner_F, every per-equation inner_F{N}, inner_J, every per-block inner_J{N}, and every _mut_block_{N} — carry @njit(cache=True). Only the F_ / J_ wrappers remain pure Python; they host the scipy.sparse SpMV precomputes, the module-level parameter unpacks, and the final sps.coo_array((data, (row, col)), shape).tocsc() that packages the Jacobian back into a scipy object for the downstream Newton solver.

The selectivity matters: Numba absolutely cannot digest a scipy.sparse.csc_array argument, which is why the wrappers exist at all. Anything that can run inside Numba is pushed there, at zero syntactic cost to the user — they just write Mat_Mul(A, x) and module_printer figures out the rest.

An important corollary for non-Mat_Mul models: if an equation system contains no Mat_Mul at all, the entire precompute / mutable- matrix machinery is disabled and F_ / J_ collapse back to the original pre-0.8 architecture (direct @njit on inner_F receiving only dense vectors and scalar params). Performance on pure element-wise models is therefore unchanged by the Mat_Mul rewrite.

Fallback path

If a Jacobian block contains a term whose shape the symbolic analyser does not recognise — for example an unusual nested Mat_Mul, a non-parameter matrix, or a matrix-valued non-linear expression — that single block is handled by a slower but always-correct path:

data[start:stop] = asarray(
    (sps.diags(u) + sps.diags(v)@M + ...)
    .tocsr()[[rows], [cols]]
).ravel()

This builds the block with scipy.sparse and reads its values at the precomputed COO positions. It is \(\mathcal{O}(\text{nnz} \log n)\) per call instead of the \(\mathcal{O}(\text{nnz})\) scatter-add path, but it never misses a derivative term. Only the single affected block falls back; the rest of the Jacobian still uses the fast path.

If you want to know whether your model is hitting the fallback, profile with pytest --durations=20 or cProfile and look for MutableMatJacDataModule in the traces — its presence in the hot J_() line indicates a fallback block.

Performance

Measurements below are per J_() call on a 2025 MacBook Air (Apple M4), averaged over 5000–10000 iterations.

Case A — rectangular power flow on MATPOWER case30 (58 unknowns, 450 Jacobian nonzeros, 6 mutable-matrix blocks, the flow has meaningful \(G,e\), \(B,f\), \(B,e\), \(G,f\) combinations):

Pipeline

J_() per call

Inline mode (lambdify + scipy SpMV)

≈ 268 µs

Module printer, old tocsr()[rows, cols]

≈ 283 µs

Module printer, vectorized @njit

≈ 50 µs

Case B — DHS hydraulic subproblem on BarryIsland (35 unknowns, 1 loop, 1 mutable-matrix block — the loop pressure Jacobian contains two col-scale terms \(L,\operatorname{diag}(K \odot |m|)\) and \(L,\operatorname{diag}(K \odot m \odot \operatorname{sign}(m))\) whose scaling vectors are non-trivial expressions of the variable):

Pipeline

J_() per call

Element-wise inline (one scalar Eqn per node + loop)

≈ 103 µs

Mat_Mul inline (lambdify)

≈ 45 µs

Mat_Mul module printer, vectorized @njit

≈ 28 µs

  • Element-wise vs Mat_Mul inline. Writing the same hydraulic system as V @ m - m_inj = 0 plus L @ (K m |m|) = 0 (two vector equations total) instead of 35 scalar equations cuts inline time by more than half, because lambdify has far fewer expression nodes to walk at runtime.

  • Inline vs module printer. Even on a small system (only 79 nonzeros), the module printer wins once every mutable-matrix term reaches the vectorised scatter-add path. Here the loop-pressure Jacobian has two col-scale terms whose scaling vectors (K * |m| and K * m * sign(m)) are computed once per J_ call as dense numpy expressions in the wrapper, then handed to a Numba kernel that does two 10-iteration scatter-add loops. The only remaining overhead is the final coo_array(...).tocsc() packing.

Rule of thumb: module printer wins whenever the mutable-matrix blocks successfully match the fast path; it can lose to inline for models that fall through to the fallback (scipy-sparse + fancy indexing), or models small enough that the final COO-to-CSC packing dominates. Profile with %timeit mdl.J(y, mdl.p) on a representative iterate if in doubt, and check the generated num_func.py for any lingering MutableMatJacDataModule (= fallback) calls.

The scatter-add path is strictly \(\mathcal{O}(\text{nnz})\) in the block contents, while the old fancy-indexing path was dominated by the cost of constructing the intermediate sparse matrices — so the speed-up scales with nnz and with the number of mutable-matrix blocks. For models with many rows and few blocks the advantage is small; for models with many blocks per variable (such as power flow, with distinct \(\partial/\partial e\) and \(\partial/\partial f\) blocks for both \(P\) and \(Q\) balances) the advantage compounds.

Restrictions and reserved names

Mat_Mul imposes a few hard restrictions that are enforced at model build time. Each one prevents a class of silent-wrong-answer bugs that would otherwise slip past the generated Jacobian.

Time-varying sparse dim=2 parameters are rejected at construction

Solverz does not support sparse dim=2 Params whose values change at runtime. A declaration like:

from scipy.sparse import csc_array
m.K = Param('K',
            csc_array([[1, 0], [0, 1]]),
            dim=2, sparse=True,
            triggerable=True,
            trigger_var=['x'],
            trigger_fun=lambda x: csc_array([[2*x[0], 0], [0, 2*x[1]]]))
# ⇒ NotImplementedError: Parameter 'K': a sparse 2-D ``Param`` cannot
#   be declared ``triggerable=True``. Time-varying sparse matrices are
#   unsupported because Solverz's code generation caches the matrix's
#   CSC fields at model-build time; a runtime trigger update would
#   silently be ignored.

and the equivalent TimeSeriesParam:

m.G = TimeSeriesParam('G',
                      v_series=[...],
                      time_series=[...],
                      value=np.eye(3),
                      dim=2, sparse=True)
# ⇒ NotImplementedError: Parameter 'G': a sparse 2-D
#   ``TimeSeriesParam`` is not supported.

both raise at the point of construction, not deep inside FormJac, so the error points at the exact line where the user wrote the offending declaration.

Why: every Solverz code path that consumes a sparse dim=2 Param caches its CSC decomposition at model-build time:

  • The legacy MatVecMul code path decomposes every sparse dim=2 Param into flat arrays (<name>_data, <name>_indices, <name>_indptr, <name>_shape0) in Solverz/model/basic.py:56, which are stored in p_ and passed to inner_F as read-only numpy arrays.

  • The 0.8.1 Mat_Mul SolCF.csc_matvec fast path reuses those same CSC flats inside inner_F to do the matvec in @njit land.

  • The mutable-matrix Jacobian scatter-add kernel caches the matrix’s .data array directly as a _sz_mb_{N}_rs_dat_{k} setting entry loaded at module import time.

None of these caches are invalidated when p_["K"].trigger_fun fires or p_["G"].get_v_t(t) returns a fresh matrix. The runtime update simply gets lost, the Jacobian keeps using the initial values, and the Newton iteration either diverges or converges to the wrong solution — silently.

Historically the check was narrow (only equations containing Mat_Mul were rejected, and the check ran at FormJac time). In the 0.8.1 hotfix the policy broadened to cover every sparse dim=2 time-varying Param, regardless of where or whether it is used. The check now lives in ParamBase.__init__ and TimeSeriesParam.__init__, with an unconditional backstop in Equations._check_no_timevar_sparse_matrices that runs on every FormJac call in case a tainted Param was constructed via __new__ + attribute assignment (which bypasses the __init__ check).

Workarounds — all of these are supported:

  • Triggerable / time-series vectors next to a Mat_Mul. Scalar or 1-D time-varying parameters are fine; only the dim=2

    • sparse=True combination is rejected. Write m.K * m.x + Mat_Mul(m.A, m.x) - m.b with K a triggerable 1-D vector and A a plain sparse 2-D matrix — the wrapper rebuilds K on every Newton step via the trigger_var mechanism and the Mat_Mul(A, x) fast path uses the unchanged A.

  • Element-wise rewrite. If you genuinely need the matrix itself to evolve, write the residual out one scalar Eqn per row, with the per-row coefficients as 1-D Params or TimeSeriesParams. The element-wise form has no CSC caching — every call re-evaluates the full trigonometric / algebraic expression via pure numpy / numba.

  • Dense dim=2 parameters (sparse=False). The MutableMatJacDataModule fallback path used for dense matrices re-evaluates the entire block expression via scipy/numpy on every J_() call, so get_v_t(t) updates DO propagate. Dense parameters do not get the csc_matvec fast path, but they’re a correct (if slower) way to carry a time-varying matrix through the solver. A UserWarning still fires once per such parameter to flag the performance cost.

Reserved symbol prefixes

User symbols (Var, Param, iVar, Para, …) are rejected at construction time if their name starts with any of:

Prefix

Used for

_sz_mm_

Mat_Mul precompute placeholders in the F_/J_ wrapper (e.g. _sz_mm_0 = G @ e).

_sz_mb_

Mutable-matrix Jacobian block helpers (diag inner vectors, row/col-scale scaling vectors, mapping arrays).

m.x = Var('_sz_mm_0', [1.0])
# ⇒ ValueError: Symbol name '_sz_mm_0' starts with Solverz-reserved
#   internal prefix '_sz_mm_'. This prefix is used by the code
#   generator for Mat_Mul / mutable-matrix Jacobian helper names and
#   cannot be used for user variables or parameters.

Why: the code printer emits an unbounded family of helper identifiers of the form <prefix><int> and needs to know that none of them will collide with a user symbol. The prefix ban is the simplest way to guarantee that — users pick any name that does not start with _sz_mm_ / _sz_mb_, and the code printer is free to use any name that does.

Dense dim=2 parameters emit a performance warning

A parameter declared with dim=2, sparse=False that appears inside a Mat_Mul is legal — the fallback path handles it correctly — but it forfeits every optimisation described above:

  • no sparse pattern precomputation (the “output sparsity pattern” step produces a fully dense block),

  • no _sz_mb_{N}_rs_dat_k scatter-add (the block falls through to the MutableMatJacDataModule fancy-indexing path),

  • each Jacobian call allocates and populates a dense intermediate.

Solverz emits a one-shot UserWarning at FormJac time, once per offending parameter, to flag the performance cost:

m.A = Param('A', np.array([[2.0, 1.0], [1.0, 3.0]]), dim=2, sparse=False)
m.eqn = Eqn('f', m.x * Mat_Mul(m.A, m.x) - m.b)
smdl, y0 = m.create_instance()
smdl.FormJac(y0)
# UserWarning: Parameter 'A' is a dense 2-D ``Param(..., dim=2,
#   sparse=False)`` used inside ``Mat_Mul``. Dense matrices bypass
#   the vectorised mutable-matrix Jacobian fast path and fall back
#   to a slower scipy/ndarray indexing path. Declare 'A' with
#   ``sparse=True`` for substantially better performance.

How to fix: wrap the value in a scipy.sparse container and declare sparse=True:

from scipy.sparse import csc_array
m.A = Param('A', csc_array(np.array([[2.0, 1.0], [1.0, 3.0]])),
            dim=2, sparse=True)

On a dense 2×2 matrix the difference is invisible, but for anything larger than a few dozen rows the sparse path is substantially faster and (more importantly) keeps the Jacobian assembly off the fallback path.

API Reference

Solverz.sym_algebra.matrix_calculus.MixedEquationDiff(expr: Expr, symbol: Symbol)

Calculate the derivative of a mixed matrix-vector expression.

Parameters:

expr : sympy.Expr

The expression to differentiate (may contain Mat_Mul, Diag, etc.)

symbol : sympy.Symbol

The variable to differentiate with respect to

Returns:

sympy.Expr

The symbolic derivative expression

class Solverz.sym_algebra.matrix_calculus.TensorExpr(expr: Expr)
diff(s)

Differentiate the expression w.r.t. symbol s using forward mode.

Returns a SymPy expression representing the derivative.

visualize()

Visualize the computation DAG.