Matrix-Vector Calculus¶
Solverz provides automatic symbolic differentiation for mixed matrix-vector expressions.
When an equation contains matrix-vector products (Mat_Mul), Solverz automatically computes
analytical Jacobian blocks using tensor calculus with Einstein index notation.
This module is based on the approach described in:
Laue, Mitterreiter, Giesen. A Simple and Efficient Tensor Calculus. Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
See also: matrixcalculus.org
Glossary¶
This page uses several abbreviations and design terms that recur throughout the document. They are collected here for reference; each one is hyperlinked from its first use in the body so you don’t have to scroll back. The terms describe Solverz’s code-generation strategy and the underlying numerical building blocks.
Term |
Definition |
|---|---|
SpMV |
Sparse Matrix-Vector multiplication — computing \(y = M\,x\) where \(M\) is stored in a sparse format. |
SpMM |
Sparse Matrix-Matrix multiplication — computing \(C = M\,B\) where \(B\) has more than one column. |
CSC / CSR |
Compressed Sparse Column / Compressed Sparse Row — the two scipy.sparse storage formats Solverz uses. CSC stores |
|
The Numba decorator that compiles a Python function to native LLVM-backed code at first call. Solverz uses |
scatter-add |
An assembly pattern where many indexed reads are accumulated ( |
fancy indexing |
NumPy / scipy.sparse indexing with arrays of indices, e.g. |
fast path |
The Solverz code-gen path used for |
fallback path |
The slower-but-correct path used whenever a |
|
SymPy’s symbolic-to-callable converter. Solverz’s “inline” mode uses |
hot F / hot J |
Steady-state per-call wall-clock time for |
cold compile |
The first call into a freshly-rendered module — pays the Numba |
LICM |
Loop-Invariant Code Motion, an LLVM optimisation pass that hoists computations out of inner loops when their inputs do not change inside the loop. |
inline mode |
Solverz’s |
module printer mode |
Solverz’s |
Supported Operations¶
For each supported operation the table below shows three things: the
Solverz symbolic constructor, the mathematical form it represents, and
the Jacobian block \(\partial f / \partial x\) that the matrix
calculus engine emits when the equation is differentiated with respect
to a vector variable x.
The “Jacobian block” column is not the symbolic intermediate
Solverz computes for each equation. For an element-wise function like
sin(x), Solverz first produces the elementwise vector cos(x) —
the same shape as the input — and only at Jacobian assembly time wraps
that vector in diag(...) to express the diagonal structure of the
matrix block. Equation residuals stay vector-valued throughout
symbolic processing; the matrix wrapper only appears in the final
Jacobian. Read the third column as “the matrix you would get if you
materialised \(\partial f / \partial x\) as a dense block”.
Operation |
Symbolic Form |
Example |
Jacobian block (\(\partial f / \partial x\) for vector \(x\)) |
|---|---|---|---|
Matrix-vector multiply |
|
\(Ax\) |
\(A\) |
Element-wise multiply |
|
\(x \odot y\) |
\(\operatorname{diag}(y)\) |
Addition |
|
\(x + y\) |
\(I\) |
Absolute value |
|
\(|x|\) |
\(\operatorname{diag}(\operatorname{sign}(x))\) |
Exponential |
|
\(e^x\) |
\(\operatorname{diag}(e^x)\) |
Sine / Cosine |
|
\(\sin(x)\) |
\(\operatorname{diag}(\cos(x))\) |
Logarithm |
|
\(\ln(x)\) |
\(\operatorname{diag}(x^{-1})\) |
Power |
|
\(x^n\) |
\(\operatorname{diag}(n x^{n-1})\) |
Transpose |
|
\(A^T\) |
\(A^T\) |
Diagonal |
|
\(\operatorname{diag}(x)\) |
\(\operatorname{diag}(\cdot)\) |
Note
Mat_Mul is the recommended interface for matrix-vector products in
new Solverz models. It is symbolic-first: the symbolic engine sees
each Mat_Mul(A, x) as a real operator, walks it through the
matrix-calculus pipeline, and emits an analytical Jacobian block
automatically.
On the fast path — when A is a plain sparse dim=2 Param
— Mat_Mul ends up calling exactly the same Numba helper
(SolCF.csc_matvec) the legacy MatVecMul interface uses. The two
have the same per-call cost; the difference is that Mat_Mul survives
the matrix-calculus engine and MatVecMul does not. On the
fallback path (negated matrices, nested Mat_Mul, dense
dim=2 parameters, or other matrix expressions) Mat_Mul falls
through to scipy.sparse @ in the wrapper — see the
Layer 1
discussion below.
MatVecMul is still deprecated for new models; pick Mat_Mul.
Warning
Sparse matrices MUST be immutable after modelling
All sparse matrix parameters declared with Param(..., dim=2, sparse=True) are
assumed to be immutable (both sparsity pattern and numerical values) once
modelling is complete and model.create_instance() has been called.
Solverz relies on this assumption for maximum runtime efficiency:
At code-generation time, the sparsity pattern of every mutable-matrix Jacobian block is analysed once, the output data-array layout is fixed, and a pre- computed index mapping is baked into a Numba-compiled scatter-add loop.
At runtime, every
J_(y, p)call assembles Jacobian block data by directly indexing intoMatrix.datausing the pre-computed row/column mappings — no scipy.sparse matrix is constructed per call.For sparse-matrix parameters used in
Mat_Mul(A, x), the matrix-vector productA @ xis precomputed in theF_wrapper and passed as a dense vector to the@njit-compiled equation functions.
If your application truly needs to update sparse-matrix values between solves
(e.g., sweeping a parameter), rebuild the model with model.create_instance()
each time. Mutating p_["A"].data after creation will silently produce wrong
Jacobians for matrices that flow through the vectorised Mat_Mul fast path —
scatter-add kernels cache M.data at analysis time and the cache becomes
stale. Matrices that only land in the MutableMatJacDataModule fallback path
(see Fallback path) re-evaluate the full expression on
every J_() call and would reflect the mutation, but there is no way to
guarantee a particular block won’t be promoted to the fast path in a future
release, so the rule is still: do not mutate sparse matrix params
post-build.
Dense parameters (Param(..., dim=2, sparse=False)) are fine to update via
mdl.p["name"] = new_value; the restriction only applies to sparse
dim=2 parameters.
All element-wise functions (exp, sin, cos, ln, Abs) can be composed with matrix operations:
from Solverz import Var, Param, Eqn, Model
from Solverz.sym_algebra.functions import Mat_Mul, exp, sin
m = Model()
m.x = Var('x', [1, 2, 3])
m.A = Param('A', [[1, 0, 0], [0, 2, 0], [0, 0, 3]], dim=2)
m.f = Eqn('f', exp(Mat_Mul(m.A, m.x))) # exp(A@x)
Solverz will automatically compute \(\frac{\partial}{\partial x} e^{Ax} = \operatorname{diag}(e^{Ax}) \cdot A\).
How It Works¶
The matrix calculus engine is a four-stage pipeline: each expression is lifted into Einstein index notation, rebuilt as a computation DAG, differentiated forward through the DAG, and finally collapsed back to ordinary matrix/vector operations. We walk through each stage with a running example from power flow.
1. Einstein Index Notation¶
Every tensor is labelled with one index per axis:
Scalar: no index (e.g., \(\alpha\)). Zeroth-order tensor, no axes.
Vector \(x \in \mathbb{R}^n\): a single index \(x_i\), where \(i\) ranges over \(\{1, \dots, n\}\). The index names the axis; the letter used is arbitrary, only its pattern of repetition matters.
Matrix \(A \in \mathbb{R}^{m \times n}\): two indices \(A_{ij}\). The first index \(i\) labels rows, the second \(j\) labels columns.
A repeated index inside a product implies summation (Einstein summation convention, or contraction):
Here \(j\) is repeated (contracted — it disappears in the result), while \(i\) is a free index that labels the resulting vector’s axis. Conventionally we write the non-contracted indices on the left of the expression, so the indices of the left-hand side \((Ax)_i\) match the remaining free indices \(i\) on the right.
The benefit of this notation is that the shape of an expression is fully encoded in its index signature. Element-wise multiplication uses repeated indices without contraction:
and the outer product uses two different free indices:
2. Computation DAG¶
Solverz parses each equation’s symbolic expression into a directed acyclic graph (DAG):
Leaf nodes — variables (\(x\), \(y\)) and parameters (\(A\), \(B\), \(G\)). Each leaf carries its index signature (\(x_i\), \(A_{ij}\), etc.).
Internal nodes — operations: addition \(+\), element-wise product \(\odot\), matrix product \(@\), element-wise unary function \(f(\cdot)\).
Root node — the output expression whose Jacobian we want.
For the power flow active-power sub-expression \(e \odot (Ge - Bf)\):
*[i] ← element-wise multiply (free index i)
/ \
e[i] -[i] ← subtraction (free index i)
/ \
@[i] @[i] ← matrix-vector products, contraction on j
/ \ / \
G[ij] e[j] B[ij] f[j]
Each node remembers its free index set. The matrix product nodes
@[i] each contract index \(j\) between \(G_{ij}\) and \(e_j\), so the
result has free index \(i\). The subtraction and element-wise product
keep \(i\) as their only free index, so the whole expression has index
signature \([i]\) — a vector.
3. Forward-Mode Automatic Differentiation¶
To obtain \(\partial f / \partial x\) where \(f\) is the root and \(x\) is some leaf variable, derivatives propagate from leaves upward to the root (pushforward, a.k.a. forward-mode AD). Every internal node computes its own Jacobian contribution and combines it with the Jacobians already accumulated at its children.
Throughout this section, \(A_{s_1}\) and \(B_{s_2}\) denote the free indices of the two operands (before differentiation), \(s_3\) denotes the free index of the result, and \(s_4\) is the derivative-direction index — the axis along which we differentiate, equal to the free index of the variable \(x\) we differentiate with respect to. Tilded tensors \(\tilde A, \tilde B\) carry one extra axis (labelled \(s_4\)) compared to \(A, B\): the derivative of a rank-\(k\) tensor w.r.t. a rank-1 variable is a rank-\((k{+}1)\) tensor.
Leaf variables. Differentiating \(x_j\) w.r.t. \(x_{s_4}\) gives the Kronecker delta:
\[\tilde{x}_{j\,s_4} \;=\; \frac{\partial x_j}{\partial x_{s_4}} \;=\; \delta_{j s_4}.\]All other variables (and all parameters) have derivative \(0\) w.r.t. \(x\).
Multiplication (product rule in index notation):
where \(A_{s_1} \cdot B_{s_2} = C_{s_3}\) is the original product (the specific index pattern \(s_1, s_2 \to s_3\) depends on the product type — element-wise, matrix-vector, or matrix-matrix). The two summands encode “differentiate the first factor, keep the second” plus “keep the first factor, differentiate the second”.
Element-wise unary function \(f\) (e.g. \(\exp\), \(\sin\), \(\ln\)): since \(f\) acts point-wise, the chain rule gives
The derivative \(f'\) is also computed element-wise and multiplied (Hadamard) with the child’s Jacobian.
Addition. Linearity:
4. TMul2Mul: Index Resolution¶
After the forward pass, derivatives live as tensor expressions labelled with free indices. The TMul2Mul stage reads the index pattern of each product and emits the corresponding concrete matrix / vector operation. Each row of the table has the form \((s_1, s_2, s_3)\): the free indices of the left operand, right operand, and the result.
Index pattern \((s_1, s_2, s_3)\) |
Matrix operation |
Meaning |
|---|---|---|
\((i,\; i,\; i)\) |
\(x \odot y\) |
element-wise multiply — both operands share axis \(i\), no contraction |
\((i,\; j,\; ij)\) |
\(x\,y^\top\) |
outer product — two disjoint free indices survive into the result |
\((ij,\; j,\; i)\) |
\(A\,y\) |
matrix-vector: index \(j\) is shared (contracted), \(i\) survives |
\((ij,\; jk,\; ik)\) |
\(A\,B\) |
matrix-matrix: inner \(j\) is contracted, \((i,k)\) survive |
\((i,\; ij,\; ij)\) |
\(\operatorname{diag}(x)\,A\) |
left diagonal scaling — scale each row of \(A\) by the matching entry of \(x\) |
\((ij,\; j,\; ij)\) |
\(A\,\operatorname{diag}(y)\) |
right diagonal scaling — scale each column of \(A\) by the matching entry of \(y\) |
\((i,\; ii,\; ii)\) |
\(\operatorname{diag}(x \odot y)\) |
diagonal-from-element-wise — used when a product has matching repeated indices \((ii)\) on one operand |
The key insight is that every well-formed index pattern maps to one concrete sparse-preserving operation. Patterns like \((ij, ij, ij)\) that don’t match the table are either simplified earlier (e.g. \((ij, ij, ij) \to A \odot B\), element-wise matrix product) or emitted via a fallback that produces a dense matrix.
Application Examples¶
Example 1: Power Flow Equations¶
The active power balance equation in power systems:
from Solverz.sym_algebra.symbols import iVar, Para
from Solverz.sym_algebra.functions import Mat_Mul
from Solverz.sym_algebra.matrix_calculus import TensorExpr
e = iVar('e')
f = iVar('f')
G = Para('G', dim=2)
B = Para('B', dim=2)
P = e * (Mat_Mul(G, e) - Mat_Mul(B, f)) + f * (Mat_Mul(B, e) + Mat_Mul(G, f))
TP = TensorExpr(P)
print(TP.diff(e)) # diag(-B@f + G@e) + diag(e)@G + diag(f)@B
print(TP.diff(f)) # diag(B@e + G@f) + diag(e)@(-B) + diag(f)@G
Example 2: Nonlinear Matrix Equations¶
from Solverz.sym_algebra.functions import exp
A = Para('A', dim=2)
x = iVar('x')
expr = exp(Mat_Mul(A, x))
TE = TensorExpr(expr)
print(TE.diff(x)) # diag(exp(A@x))@A
Example 3: Heat Network Equations¶
mL = Para('mL', dim=2)
K = Para('K')
m = iVar('m')
from Solverz.sym_algebra.functions import Abs
expr = Mat_Mul(mL, K * Abs(m) * m)
TE = TensorExpr(expr)
print(TE.diff(m)) # mL@(diag(K*Abs(m)) + diag(K*m*Sign(m)))
How numerical code is generated and optimized¶
When module_printer compiles an equation system that contains
Mat_Mul, it applies a three-layer optimisation strategy to make
the generated F_/J_ functions run as fast as possible. All three
layers operate on the same insight: at modelling time we already know
what the equations and their Jacobians look like — so every
computation that does not depend on the current iterate \(y\) can be
lifted out of the hot loop and paid for once, at code generation or
module load.
Warning
Everything below assumes sparse matrix parameters are immutable
after model.create_instance(). See the warning earlier in this
document. Every layer described here bakes in the matrix’s sparsity
pattern or its .data values as a constant; mutating them in-place
will silently corrupt results without raising any error.
Layer 1 — Mat_Mul precomputes as @njit CSC matvec¶
The code printer walks every residual RHS looking for Mat_Mul(A, v)
sub-trees, where A is a sparse parameter and v is any vector-valued
expression (a variable, a slice, or a larger expression). It performs
three transformations at code gen time:
Hoist. Each
Mat_Mul(A, v)is replaced by a fresh placeholder variable_sz_mm_N(whereNis a monotonically increasing integer unique across the whole system). The_sz_mm_prefix is reserved by Solverz — see Restrictions and reserved names below.Deduplicate. Two
Mat_Mulsub-trees that are structurally equal (same matrix, same operand expression after hoisting) are collapsed into a single placeholder. The power-flow example below computesG_nr @ eonce even though both the P-balance and Q-balance residuals reference it.Classify each placeholder. The classifier (
_classify_matmul_placeholders, with the inner shape predicate_shape_is_fast, inSolverz/code_printer/python/module/module_printer.py) decides whether the precompute can go on the fast path (matrix is a plain sparsedim=2Paraand the operand is a clean vector expression) or the fallback path (anything else:-Para, nestedMat_Mul, densedim=2matrices, matrix expressions, or operands that still contain unresolvedMat_Mul/ sparsedim=2Parareferences).A fast-path candidate is also demoted to the fallback path if another fallback placeholder’s
matrix_argoroperand_argreferences it as a free symbol — otherwise the wrapper would emit a scipy SpMV like_sz_mm_1 = (-A) @ _sz_mm_0whose_sz_mm_0operand was never materialised. The classifier iterates this demotion to a fixed point so chains of nestedMat_Muls land on a topologically valid emission order.
Fast path — the matvec is computed inside inner_F via the
existing SolCF.csc_matvec Numba helper
(Solverz/num_api/custom_function.py). Each sparse dim=2 Param is
already decomposed at model-build time into its CSC flat arrays
(A_data / A_indices / A_indptr / A_shape0) by
Solverz/model/basic.py; these flat arrays flow through
param_list to inner_F as ordinary numpy arrays, and inner_F
issues a SolCF.csc_matvec(A_data, A_indices, A_indptr, A_shape0, operand) call per placeholder. The wrapper F_() does not load the
sparse matrix object A at all — the whole matvec lives in Numba
land and pays zero scipy dispatch cost:
def F_(y_, p_):
e = y_[0:29]
f = y_[29:58]
G_nr_data = p_["G_nr_data"]
G_nr_indices = p_["G_nr_indices"]
G_nr_indptr = p_["G_nr_indptr"]
G_nr_shape0 = p_["G_nr_shape0"]
B_nr_data = p_["B_nr_data"]
... # dense vector params follow
return inner_F(_F_, e, f, p_ref, q_ref,
G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0,
B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0,
...)
@njit(cache=True)
def inner_F(_F_, e, f, p_ref, q_ref,
G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0,
B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, ...):
# Fast-path matvecs: all inside @njit, no scipy dispatch.
_sz_mm_0 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, e)
_sz_mm_1 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, f)
_sz_mm_2 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, e)
_sz_mm_3 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, f)
_F_[0:29] = inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3)
_F_[29:53] = inner_F1(...)
return _F_
@njit(cache=True)
def inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3):
return e * (p_ref - _sz_mm_1 + _sz_mm_0) + f * (q_ref + _sz_mm_2 + _sz_mm_3) - Pinj
Why this matters: scipy.sparse matrix-vector products are implemented
in C and fast per se, but each call crosses Python → scipy dispatch
→ C → back, which on a 58×58 matrix costs ~1.5 µs of pure dispatch
overhead — 10× the actual arithmetic. Eight such SpMVs in a hot F
made Mat_Mul catastrophically slow on small networks
(case30 hot F was 10× worse than the for-loop form). Moving the
matvec into SolCF.csc_matvec inside inner_F eliminates all Python
dispatch: the helper is @njit(cache=True) and inner_F calls it as
an intra-Numba function call, which the LLVM inline pass typically
flattens into the caller’s body. On case30, this drops hot F from
≈14 µs to ≈3 µs (~4.4× speedup) without any other code change.
Fallback path — when the matrix operand is not a plain sparse
Para (e.g. -G_nr, Mat_Mul(A, B), or a dense dim=2 parameter),
the old path is kept: F_() emits _sz_mm_N = matrix_expr @ operand
in the wrapper (using scipy.sparse or numpy @) and passes
_sz_mm_N as an extra dense argument to inner_F. These
placeholders do not benefit from the fast path and pay the scipy
dispatch cost per call. Dense dim=2 parameters fire a one-shot
UserWarning at FormJac time — see
Restrictions and reserved names.
Note
The fast path reuses infrastructure that predates the Mat_Mul
interface: the same SolCF.csc_matvec helper and the same
<name>_data / _indices / _indptr / _shape0 Param decomposition
were introduced for the legacy MatVecMul operator. Before 0.8.1
fast path, those fields were still emitted into inner_F’s argument
list but never referenced — they are now the conduit for the Layer 1
matvec. Legacy MatVecMul users already benefited from this pipeline;
from 0.8.1 Mat_Mul users do as well, without any model-level
changes.
When to use Mat_Mul¶
API-level rules of thumb. The detailed case study with full benchmark numbers, decision matrix, and the case30-scale hot-F regression analysis lives in the power-flow chapter of the Solverz Cookbook — this section just states the API guidance.
Use Mat_Mul when any of the following hold:
You want compact, paper-faithful equations.
Mat_Mul(G, e)replacesnbscalarEqndefinitions and lets the matrix calculus engine derive the Jacobian automatically.You iterate on the model. Every compile / build phase (
Modelconstruction,FormJac,made_numerical,module_printer.render, and the first-time Numba cold compile) is dramatically faster underMat_Multhan under the equivalent element-wise for-loop form.All matrix operands are plain
Param(..., dim=2, sparse=True)— this unlocks the Layer 1 fast path described above.
The narrow case where the for-loop form still wins (very small network, F-dominated hot loop, single compile) and the underlying case30 numbers are documented in the Cookbook chapter linked above.
Matrix shapes that fall out of the fast path and therefore pay
the scipy SpMV dispatch cost on every F_ call (~1.5 µs
per SpMV — consequential at small network scale, negligible
elsewhere):
Negated matrices —
Mat_Mul(-G, x).Gis a plainParabut-Gis aMul(-1, G)expression the classifier doesn’t recognise. Workaround: write-Mat_Mul(G, x)instead ofMat_Mul(-G, x)so the sign flows through the outer expression andGstays on the fast path.Dense
dim=2params —Param(A, dim=2, sparse=False).FormJacfires a one-shotUserWarningfor these; they fall through to theMutableMatJacDataModulepath. Convert to sparse withcsc_array(A)and declaresparse=True.Nested
Mat_Mul—Mat_Mul(A, Mat_Mul(B, x)). The classifier’s dependency-aware demotion takes care of the topological ordering, but only the innerMat_Mulends up on the fast path when both matrices are plain Paras; the outer pays the wrapper SpMV cost.Three-or-more-arg
Mat_Mul—Mat_Mul(A, B, x). The operand of the placeholder is itself aMat_Mul, so the classifier rejects it from the fast path.Matrix expressions —
Mat_Mul(A + B, x)or similar. The classifier recognises only a bareParaas the matrix operand. Workaround: materialise the combined matrix as a singleParam(A + B, dim=2, sparse=True)outside the equation.
Layer 2 — Mutable matrix Jacobian blocks¶
The Jacobian is handled by a separate, symbolic decomposition because its blocks are matrices (not vectors) whose structure depends on the full expression. Consider the power-flow block
All three terms are mutable — they depend on \(e\) and \(f\) and must be
re-evaluated on every J_(y, p) call. But their sparsity pattern is
constant: the union of the diagonal, the pattern of \(G\), and the
pattern of \(B\) — all fixed at modelling time.
Solverz exploits this in a pattern-match compiler:
Parse the block into typed terms. The symbolic analyser walks the SymPy expression tree (implemented in
Solverz/code_printer/python/module/mutable_mat_analyzer.py). Each additive term is classified as one of:Shape
Generated code contribution
Diag(u_expr)(diagonal term)data[out_pos] += u[src_idx]Mat_Mul(Diag(v), M)(row-scale)data[out_pos] += v[src_row] * M_data[k]Mat_Mul(M, Diag(v))(col-scale)data[out_pos] += v[src_col] * M_data[k]A term is deferred to the fallback path when any of the following is true:
it has a non-unit numeric coefficient other than
±1(e.g.3 * Diag(x)) — the analyser only unwraps±1via_extract_sign_and_core;it is
Mat_Mul(Diag(v), X)orMat_Mul(X, Diag(v))whereXis not a plainPara(or-Para) — for instance a matrix expression such asA @ B, or a user-constructed product that the matrix calculus engine hasn’t simplified down to a single leaf matrix;it is a
Mat_Mulwith three or more arguments that the matrix-calculus engine didn’t fold into a nested binary form;it is any other shape the classifier doesn’t recognise (transpose of a mutable matrix, element-wise matrix product, a non-sparse matrix leaf, etc.).
Unrecognised terms are deferred to the fallback path, which is correct but slower: it evaluates the full sub-expression via scipy.sparse on every
J_()call and extracts values by fancy indexing.Precompute the output sparsity pattern (once). At model initialisation (inside
FormJac,Solverz/equation/equations.py), every mutable-matrix block is evaluated one time with all variables perturbed to distinct non-zero values and allDiagnodes substituted withSpDiag(which emitssps.diagsinstead of the densenp.diagflat). This perturbed evaluation is the entire reason we go to the trouble of theSpDiagsubstitution: using non-zero variable values guarantees that no term accidentally collapses to zero at \(y_0\) (which would happen in a flat-start power flow where \(f = 0\) andsps.diags(f)@Bis empty). The resulting sparse matrix gives us the union of all terms’ patterns — the structural upper bound that the runtime block must fit into.Precompute the index mapping arrays (once). For each recognised term, the analyser builds three numpy arrays:
out_pos[k]— the index into the block’s outputdataarray where the \(k\)-th nonzero contribution of this term lands.src_idx[k]— for diagonal terms, the row \(i\) in the inner vector \(u\) to read from; for row-scale terms, the row of \(M\)’s \(k\)-th nonzero; for col-scale terms, the column.mat_data(row/col-scale terms only) — a direct reference toM.data(which is why the matrix must be immutable).
These arrays are stored inside
setting, a per-module dict that also holds things like the CSCrow,col, and initialdatafor the full Jacobian.Generate a dedicated Numba kernel per block. For each mutable matrix block the code printer emits a new top-level function named
_mut_block_N(whereNis the block index). Its signature takes the diag inner vectors, the base variables needed by any row/col-scale term, and each term’s three mapping arrays. All per-block helper arrays use the reserved_sz_mb_{N}_prefix so different blocks’ mappings never collide. The body is a pure- Python / Numba scatter-add loop:
# Module-level (loaded once from setting at import time):
_sz_mb_0_diag_out_0 = setting["_sz_mb_0_diag_out_0"]
_sz_mb_0_diag_src_0 = setting["_sz_mb_0_diag_src_0"]
_sz_mb_0_rs_out_0 = setting["_sz_mb_0_rs_out_0"]
_sz_mb_0_rs_src_0 = setting["_sz_mb_0_rs_src_0"]
_sz_mb_0_rs_dat_0 = setting["_sz_mb_0_rs_dat_0"] # frozen G_nr.data
...
@njit(cache=True)
def _mut_block_0(_sz_mb_0_u0, e, f,
_sz_mb_0_diag_out_0, _sz_mb_0_diag_src_0,
_sz_mb_0_rs_out_0, _sz_mb_0_rs_src_0, _sz_mb_0_rs_dat_0,
_sz_mb_0_rs_out_1, _sz_mb_0_rs_src_1, _sz_mb_0_rs_dat_1):
data = zeros(107)
# Diagonal term: diag(G_nr@e - B_nr@f + p_ref)
for i in range(_sz_mb_0_diag_out_0.shape[0]):
data[_sz_mb_0_diag_out_0[i]] += _sz_mb_0_u0[_sz_mb_0_diag_src_0[i]]
# Row-scale term: diag(e) @ G_nr
for i in range(_sz_mb_0_rs_out_0.shape[0]):
data[_sz_mb_0_rs_out_0[i]] += e[_sz_mb_0_rs_src_0[i]] * _sz_mb_0_rs_dat_0[i]
# Row-scale term: diag(f) @ B_nr
for i in range(_sz_mb_0_rs_out_1.shape[0]):
data[_sz_mb_0_rs_out_1[i]] += f[_sz_mb_0_rs_src_1[i]] * _sz_mb_0_rs_dat_1[i]
return data
Note
Every scatter-add loop uses +=, not =, including the diagonal
terms. This matters when a block contains multiple independent
Diag(...) additive terms whose output positions coincide — for
example the Jacobian of x*(A@x) + x*(B@x) produces diag(A@x) and
diag(B@x), both landing on the same (i,i) positions. The +=
accumulates their contributions correctly; an earlier implementation
used = and silently dropped one of the two terms.
Wire the kernel up in the
J_wrapper. For each block the wrapper emits: (a) one line per diag term that materialises its inner vector using scipy.sparse (e.g._sz_mb_0_u0 = p_ref - (B_nr@f) + (G_nr@e)); (b) a call to_mut_block_N(...)passing the inner vectors, base variables, and mapping arrays; (c) an assignment of the returneddataarray into the appropriate slice of the overall Jacobian_data_.Crucially, each block’s scipy work is capped at one SpMV per diag term — nothing else in the block touches sparse matrices at runtime. The data for row/col-scale terms comes straight from the pre-baked
_sz_mb_{N}_rs_dat_{k}arrays.
The net effect on each J_(y, p) call is that Jacobian assembly cost
is dominated by (i) the constant inner_J call for element-wise
blocks and (ii) a handful of SpMVs for the diag inner
vectors. The actual mutable-matrix blocks themselves are assembled in
\(\mathcal{O}(\text{nnz})\) vectorised Numba loops — no sparse matrix
allocation, no sparsity reconstruction, no format conversion.
This benefit applies to every solver that calls J_(y, p) —
algebraic-equation Newton solvers (nr_method, sicnm), DAE
integrators (which call J_ at every implicit substep), FDAE
solvers, and any future solver that consumes Solverz’s Jacobian
contract. The cost model is “per J_ call”, not “per Newton step”.
Two paths: scatter-add (fast) and fancy indexing (fallback)¶
Solverz keeps both assembly paths in every generated module and picks per Jacobian block at code-gen time. The fast path is the typed scatter-add described above; the fallback path exists for blocks whose term structure the symbolic analyser cannot classify (see Layer 2 step 1) and is described in the next subsection. There is no on/off switch — both paths co-exist, and the runtime always uses whichever the classifier picked.
The fallback path is what an earlier iteration of the pipeline used
everywhere: build the sparse matrix with sps.diags and friends,
then read back the nonzero values at the precomputed COO positions
via fancy indexing (expr.tocsr()[[rows], [cols]]). It is
correct and elegant but slow — scipy has to construct several
intermediate sparse matrices per call, sum them (with explicit-zero
elimination), and then perform a linear-scan lookup for each output
position. On a 30-bus power flow case, each J_(y, p) call took
≈ 280 µs in that pre-0.8 architecture. The scatter-add Numba path
introduced in 0.8.0 drops that to ≈ 50 µs while producing
bit-identical results.
If you want to know whether a particular block in your model is on
the fast path or the fallback, profile with pytest --durations=10
or cProfile and look for MutableMatJacDataModule in the traces
— its presence in the hot J_() line indicates that block is on
the fallback.
Layer 3 — Selective @njit¶
All generated inner loops — inner_F, every per-equation
inner_F{N}, inner_J, every per-block inner_J{N}, and every
_mut_block_{N} — carry @njit(cache=True). Only the F_ / J_
wrappers remain pure Python; they host any fallback-path
SpMV precomputes, the module-level parameter unpacks, and
the final sps.coo_array((data, (row, col)), shape).tocsc() that
packages the Jacobian back into a scipy object for the downstream
solver (algebraic Newton-style or DAE/ODE integrator).
The selectivity matters: Numba absolutely cannot digest a
scipy.sparse.csc_array argument, which is why the wrappers exist at
all. Anything that can run inside Numba is pushed there, at zero
syntactic cost to the user — they just write Mat_Mul(A, x) and
module_printer figures out the rest.
An important corollary for non-Mat_Mul models: if an equation system
contains no Mat_Mul at all, the entire precompute / mutable-
matrix machinery is disabled and F_ / J_ collapse back to the
original pre-0.8 architecture (direct @njit on inner_F receiving
only dense vectors and scalar params). Performance on pure
element-wise models is therefore unchanged by the Mat_Mul rewrite.
Fallback path — scipy.sparse + fancy indexing¶
When a Jacobian block contains a term shape the symbolic analyser
does not recognise — for example an unusual nested Mat_Mul, a
non-parameter matrix, a matrix-valued non-linear expression, or a
term with a non-±1 coefficient — that single block is handled by
the fallback path. The wrapper rebuilds the block with
scipy.sparse and reads the requested nonzero values back via
fancy indexing:
data[start:stop] = asarray(
(sps.diags(u) + sps.diags(v)@M + ...)
.tocsr()[[rows], [cols]]
).ravel()
It is \(\mathcal{O}(\text{nnz} \log n)\) per call instead of the \(\mathcal{O}(\text{nnz})\) scatter-add path, but it never misses a derivative term. The two paths co-exist in every generated module: only the single affected block falls back; the rest of the Jacobian still uses the fast scatter-add path. There is no fallback-only mode and no opt-in switch — Solverz simply picks per block based on what the classifier recognised at code-gen time.
Performance¶
The full case30 power-flow benchmark — every phase from Model
construction through cold compile to hot F / hot J — lives
in the
Solverz Cookbook power-flow chapter.
This subsection only carries one extra micro-benchmark (DHS
hydraulic on Barry Island) that exercises a Jacobian shape the
power-flow case does not.
Benchmark environment. All numbers below were measured under:
Hardware: 2025 MacBook Air, Apple M4, AC power, no thermal throttling observed during the runs.
OS: macOS 26.4 (build 25E246).
Python: 3.11.13.
Library versions:
numpy==2.3.3,scipy==1.16.0,numba==0.65.0,sympy==1.13.3,Solverz==0.8.1(the post-csc_matvecfast path).Methodology: 10 warm-up calls (to bake the Numba caches and prime the CPU branch predictor) followed by 5000–20000 timed iterations, median of three repeats. The cold compile measurement (Cookbook only) is run in a fresh Python subprocess with
__pycache__and Numba.nbi/.nbccaches wiped beforehand. The same per-call timing helper is used in the Cookbook’sbench_pf_matmul_vs_polar.py; the per-J_numbers below were captured by importing the rendered DHS module and callingmdl.J(y, mdl.p)in a tight loop.
DHS hydraulic subproblem on BarryIsland (35 unknowns, 1 loop,
1 mutable-matrix block — the loop pressure Jacobian contains two
col-scale terms \(L\,\operatorname{diag}(K \odot |m|)\) and
\(L\,\operatorname{diag}(K \odot m \odot \operatorname{sign}(m))\)
whose scaling vectors are non-trivial expressions of the variable):
Pipeline |
|
|---|---|
Element-wise inline (one scalar |
≈ 103 µs |
|
≈ 45 µs |
|
≈ 28 µs |
Element-wise vs
Mat_Mulinline. Writing the same hydraulic system asV @ m - m_inj = 0plusL @ (K m |m|) = 0(two vector equations total) instead of 35 scalar equations cuts inline time by more than half, because lambdify has far fewer expression nodes to walk on every call.Inline vs module printer. Even on a small system (only 79 nonzeros), the module printer wins once every mutable-matrix term reaches the vectorised scatter-add path. Here the loop- pressure Jacobian has two col-scale terms whose scaling vectors (
K * |m|andK * m * sign(m)) are computed once perJ_(y, p)call as dense numpy expressions in the wrapper, then handed to a Numba kernel that does two 10-iteration scatter-add loops. The only remaining overhead is the finalcoo_array(...).tocsc()packing.
Rule of thumb: the module printer wins whenever the mutable-matrix
blocks successfully match the fast path; it can lose to
inline mode for models that fall through to the
fallback path, or models small enough that the final
COO-to-CSC packing dominates. Profile with
%timeit mdl.J(y, mdl.p) on a representative iterate if in doubt,
and check the generated num_func.py for any lingering
MutableMatJacDataModule (= fallback) calls.
The scatter-add path is strictly \(\mathcal{O}(\text{nnz})\) in the block contents, while the old fancy-indexing path was dominated by the cost of constructing the intermediate sparse matrices — so the speed-up scales with nnz and with the number of mutable-matrix blocks. For models with many rows and few blocks the advantage is small; for models with many blocks per variable (such as power flow, with distinct \(\partial/\partial e\) and \(\partial/\partial f\) blocks for both \(P\) and \(Q\) balances) the advantage compounds.
Diagnostic warnings¶
When a Mat_Mul placeholder or a mutable Jacobian block cannot use the
fast path, Solverz emits a UserWarning describing what broke the fast
path, why, and how to rewrite the expression to recover it. Warnings
fire once per fallback item during code generation (module mode only —
the inline printer has no fast/fallback split).
All warnings use the standard Python warnings.warn(..., UserWarning)
mechanism. To silence them:
import warnings
warnings.simplefilter('ignore', UserWarning)
Layer 1 — Mat_Mul placeholder fallback¶
Expression |
Reason |
Suggested rewrite |
|---|---|---|
|
Negation of a sparse Param |
|
|
Scalar multiple of a Param (numeric or scalar |
|
|
Transposed matrix Param |
Predeclare |
|
Element-wise |
Nest as |
|
Sum of matrices |
|
|
Multi-arg |
|
|
Operand references sparse |
Precompute the lookup as a |
|
Dense |
Declare with |
Demotion cascade |
Upstream fallback placeholder references this one |
Fix the upstream fallback first; demoted placeholders auto-recover |
Layer 2 — Mutable Jacobian block fallback¶
When a mutable Jacobian block contains a term that doesn’t match the
Diag / row-scale / col-scale / biscale fast path, the warning names
the equation, variable, and the offending term.
Common causes (the L2 classifier reports each one distinctly):
Element-wise
MulwithDiagfactors — e.g.Diag(u) * M * Diag(v)written with Python*. Rewrite asMat_Mul(Diag(u), M, Diag(v)).Single-Diag
Mat_Mulwith an unsupported matrix factor —Mat_Mul(Diag(v), A+B)or similar where the analyzer’s_sparse_matrix_nnzcan’t materialise the matrix factor. Split the term so each piece has a single sparsedim=2Param.Biscale
Mat_Mul(Diag(u), M, Diag(v))with unsupported middle matrixM.No-
DiagMat_Mulof two matrices — wrap one operand inDiag(...)or split into supported shapes.Two-argument
Mat_Mul(Diag(u), Diag(v))— degenerate; rewrite asDiag(u * v).
Restrictions and reserved names¶
Mat_Mul imposes a few hard restrictions that are enforced at model
build time. Each one prevents a class of silent-wrong-answer bugs that
would otherwise slip past the generated Jacobian.
Time-varying sparse dim=2 parameters are rejected at construction¶
Solverz does not support sparse dim=2 Params whose values change
at runtime. A declaration like:
from scipy.sparse import csc_array
m.K = Param('K',
csc_array([[1, 0], [0, 1]]),
dim=2, sparse=True,
triggerable=True,
trigger_var=['x'],
trigger_fun=lambda x: csc_array([[2*x[0], 0], [0, 2*x[1]]]))
# ⇒ NotImplementedError: Parameter 'K': a sparse 2-D ``Param`` cannot
# be declared ``triggerable=True``. Time-varying sparse matrices are
# unsupported because Solverz's code generation caches the matrix's
# CSC fields at model-build time; a runtime trigger update would
# silently be ignored.
and the equivalent TimeSeriesParam:
m.G = TimeSeriesParam('G',
v_series=[...],
time_series=[...],
value=np.eye(3),
dim=2, sparse=True)
# ⇒ NotImplementedError: Parameter 'G': a sparse 2-D
# ``TimeSeriesParam`` is not supported.
both raise at the point of construction, not deep inside
FormJac, so the error points at the exact line where the user
wrote the offending declaration.
Why: every Solverz code path that consumes a sparse dim=2
Param caches its CSC decomposition at model-build time:
The legacy
MatVecMulcode path decomposes every sparsedim=2Paraminto flat arrays (<name>_data,<name>_indices,<name>_indptr,<name>_shape0) inSolverz/model/basic.py:56, which are stored inp_and passed toinner_Fas read-only numpy arrays.The 0.8.1
Mat_MulSolCF.csc_matvecfast path reuses those same CSC flats insideinner_Fto do the matvec in@njitland.The mutable-matrix Jacobian scatter-add kernel caches the matrix’s
.dataarray directly as a_sz_mb_{N}_rs_dat_{k}setting entry loaded at module import time.
None of these caches are invalidated when p_["K"].trigger_fun
fires or p_["G"].get_v_t(t) returns a fresh matrix. The runtime
update simply gets lost, the Jacobian keeps using the initial
values, and the downstream solver (Newton-Raphson, DAE integrator,
or any other consumer of J_) either diverges or converges to the
wrong solution — silently.
Historically the check was narrow (only equations containing
Mat_Mul were rejected, and the check ran at FormJac time). In
the 0.8.1 hotfix the policy broadened to cover every sparse
dim=2 time-varying Param, regardless of where or whether it is
used. The check now lives in ParamBase.__init__ and
TimeSeriesParam.__init__, with an unconditional backstop in
Equations._check_no_timevar_sparse_matrices that runs on every
FormJac call in case a tainted Param was constructed via
__new__ + attribute assignment (which bypasses the __init__
check).
Workarounds — all of these are supported:
Triggerable / time-series vectors next to a
Mat_Mul. Scalar or 1-D time-varying parameters are fine; only thedim=2sparse=Truecombination is rejected. Writem.K * m.x + Mat_Mul(m.A, m.x) - m.bwithKa triggerable 1-D vector andAa plain sparse 2-D matrix — the wrapper rebuildsKon every solver step via thetrigger_varmechanism and theMat_Mul(A, x)fast path uses the unchangedA.
Element-wise rewrite. If you genuinely need the matrix itself to evolve, write the residual out one scalar
Eqnper row, with the per-row coefficients as 1-DParams orTimeSeriesParams. The element-wise form has no CSC caching — every call re-evaluates the full trigonometric / algebraic expression via pure numpy / numba.Dense
dim=2parameters (sparse=False). TheMutableMatJacDataModulefallback path used for dense matrices re-evaluates the entire block expression via scipy/numpy on everyJ_()call, soget_v_t(t)updates DO propagate. Dense parameters do not get thecsc_matvecfast path, but they’re a correct (if slower) way to carry a time-varying matrix through the solver. AUserWarningstill fires once per such parameter to flag the performance cost.
Reserved symbol prefixes¶
User symbols (Var, Param, iVar, Para, …) are rejected at
construction time if their name starts with any of:
Prefix |
Used for |
|---|---|
|
Mat_Mul precompute placeholders in the |
|
Mutable-matrix Jacobian block helpers (diag inner vectors, row/col-scale scaling vectors, mapping arrays). |
m.x = Var('_sz_mm_0', [1.0])
# ⇒ ValueError: Symbol name '_sz_mm_0' starts with Solverz-reserved
# internal prefix '_sz_mm_'. This prefix is used by the code
# generator for Mat_Mul / mutable-matrix Jacobian helper names and
# cannot be used for user variables or parameters.
Why: the code printer emits an unbounded family of helper
identifiers of the form <prefix><int> and needs to know that none of
them will collide with a user symbol. The prefix ban is the simplest
way to guarantee that — users pick any name that does not start with
_sz_mm_ / _sz_mb_, and the code printer is free to use any name
that does.
Dense dim=2 parameters emit a performance warning¶
A parameter declared with dim=2, sparse=False that appears inside a
Mat_Mul is legal — the fallback path handles it correctly — but it
forfeits every optimisation described above:
no sparse pattern precomputation (the “output sparsity pattern” step produces a fully dense block),
no
_sz_mb_{N}_rs_dat_kscatter-add (the block falls through to theMutableMatJacDataModulefancy-indexing path),each Jacobian call allocates and populates a dense intermediate.
Solverz emits a one-shot UserWarning at FormJac time, once per
offending parameter, to flag the performance cost:
m.A = Param('A', np.array([[2.0, 1.0], [1.0, 3.0]]), dim=2, sparse=False)
m.eqn = Eqn('f', m.x * Mat_Mul(m.A, m.x) - m.b)
smdl, y0 = m.create_instance()
smdl.FormJac(y0)
# UserWarning: Parameter 'A' is a dense 2-D ``Param(..., dim=2,
# sparse=False)`` used inside ``Mat_Mul``. Dense matrices bypass
# the vectorised mutable-matrix Jacobian fast path and fall back
# to a slower scipy/ndarray indexing path. Declare 'A' with
# ``sparse=True`` for substantially better performance.
How to fix: wrap the value in a scipy.sparse container and declare
sparse=True:
from scipy.sparse import csc_array
m.A = Param('A', csc_array(np.array([[2.0, 1.0], [1.0, 3.0]])),
dim=2, sparse=True)
On a dense 2×2 matrix the difference is invisible, but for anything larger than a few dozen rows the sparse path is substantially faster and (more importantly) keeps the Jacobian assembly off the fallback path.
API Reference¶
- Solverz.sym_algebra.matrix_calculus.MixedEquationDiff(expr: Expr, symbol: Symbol)¶
Calculate the derivative of a mixed matrix-vector expression.
- Parameters:
expr : sympy.Expr
The expression to differentiate (may contain Mat_Mul, Diag, etc.)
symbol : sympy.Symbol
The variable to differentiate with respect to
- Returns:
sympy.Expr
The symbolic derivative expression