(matrix_calculus)= # Matrix-Vector Calculus Solverz provides automatic symbolic differentiation for mixed matrix-vector expressions. When an equation contains matrix-vector products (`Mat_Mul`), Solverz automatically computes analytical Jacobian blocks using tensor calculus with Einstein index notation. This module is based on the approach described in: > Laue, Mitterreiter, Giesen. *A Simple and Efficient Tensor Calculus.* > Proceedings of the AAAI Conference on Artificial Intelligence, 2020. See also: [matrixcalculus.org](https://www.matrixcalculus.org/) (matrix-calculus-glossary)= ## Glossary This page uses several abbreviations and design terms that recur throughout the document. They are collected here for reference; each one is hyperlinked from its first use in the body so you don't have to scroll back. The terms describe Solverz's code-generation strategy and the underlying numerical building blocks. | Term | Definition | |---|---| | **SpMV** | *Sparse Matrix-Vector multiplication* — computing $y = M\,x$ where $M$ is stored in a sparse format. | | **SpMM** | *Sparse Matrix-Matrix multiplication* — computing $C = M\,B$ where $B$ has more than one column. | | **CSC / CSR** | *Compressed Sparse Column* / *Compressed Sparse Row* — the two scipy.sparse storage formats Solverz uses. CSC stores `data`, `indices`, `indptr` walking columns; CSR walks rows. The legacy `MatVecMul` and the 0.8.1 `Mat_Mul` fast path both consume CSC. | | **`@njit`** | The Numba decorator that compiles a Python function to native LLVM-backed code at first call. Solverz uses `@njit(cache=True)` so the compiled code is cached on disk and reused across processes. | | **scatter-add** | An assembly pattern where many indexed reads are accumulated (`+=`) into a fixed output buffer. The mutable-matrix Jacobian kernels are pure scatter-add loops over precomputed index arrays — see [Layer 2](#layer-2-mutable-matrix-jacobian-blocks). | | **fancy indexing** | NumPy / scipy.sparse indexing with arrays of indices, e.g. `M[[r0, r1, r2], [c0, c1, c2]]`, returning the values at those positions. The slow {term}`fallback path` uses this to extract block values from a freshly-built sparse expression. | | **fast path** | The Solverz code-gen path used for `Mat_Mul(A, x)` placeholders where `A` is a plain sparse `dim=2` `Para`: the matvec runs **inside** `inner_F` via the Numba `SolCF.csc_matvec` helper, with zero scipy.sparse calls per evaluation. Mutable-matrix Jacobian blocks have a parallel fast path of typed scatter-add kernels. | | **fallback path** | The slower-but-correct path used whenever a `Mat_Mul` operand or a Jacobian block doesn't fit the fast-path shape — for instance negated matrices, nested `Mat_Mul`, dense `dim=2`, or general matrix expressions. The wrapper falls back to `scipy.sparse @` and (for Jacobian blocks) fancy indexing into a freshly-built sparse matrix. The fast and fallback paths **co-exist**: the runtime picks per placeholder / per Jacobian block based on what the symbolic classifier recognises. | | **`lambdify`** | SymPy's symbolic-to-callable converter. Solverz's "inline" mode uses `lambdify` to turn each equation's symbolic expression into a Python function on every call (no JIT). | | **hot F / hot J** | Steady-state per-call wall-clock time for `F_(y, p)` / `J_(y, p)` after JIT warm-up. Distinguished from {term}`cold compile`. | | **cold compile** | The first call into a freshly-rendered module — pays the Numba `@njit` compile cost for every decorated helper. Cached to disk so subsequent process starts skip it. | | **LICM** | *Loop-Invariant Code Motion*, an LLVM optimisation pass that hoists computations out of inner loops when their inputs do not change inside the loop. | | **inline mode** | Solverz's `made_numerical(spf, y0)` code path — uses `lambdify` to build callable F/J from the SymPy expressions on every call. No Numba compilation. | | **module printer mode** | Solverz's `module_printer(spf, y0).render()` code path — emits a Python module file with `@njit`-decorated helpers, cached to disk. The fast / fallback distinction only applies to this mode. | ## Supported Operations For each supported operation the table below shows three things: the Solverz symbolic constructor, the mathematical form it represents, and the **Jacobian block** $\partial f / \partial x$ that the matrix calculus engine emits when the equation is differentiated with respect to a vector variable `x`. The "Jacobian block" column is **not** the symbolic intermediate Solverz computes for each equation. For an element-wise function like `sin(x)`, Solverz first produces the elementwise vector `cos(x)` — the same shape as the input — and only at Jacobian assembly time wraps that vector in `diag(...)` to express the diagonal structure of the matrix block. Equation residuals stay vector-valued throughout symbolic processing; the matrix wrapper only appears in the final Jacobian. Read the third column as "the matrix you would get if you materialised $\partial f / \partial x$ as a dense block". | Operation | Symbolic Form | Example | Jacobian block ($\partial f / \partial x$ for vector $x$) | |-----------|---------------|---------|---| | Matrix-vector multiply | `Mat_Mul(A, x)` | $Ax$ | $A$ | | Element-wise multiply | `x * y` | $x \odot y$ | $\operatorname{diag}(y)$ | | Addition | `x + y` | $x + y$ | $I$ | | Absolute value | `Abs(x)` | $\|x\|$ | $\operatorname{diag}(\operatorname{sign}(x))$ | | Exponential | `exp(x)` | $e^x$ | $\operatorname{diag}(e^x)$ | | Sine / Cosine | `sin(x)` / `cos(x)` | $\sin(x)$ | $\operatorname{diag}(\cos(x))$ | | Logarithm | `ln(x)` | $\ln(x)$ | $\operatorname{diag}(x^{-1})$ | | Power | `x**n` | $x^n$ | $\operatorname{diag}(n x^{n-1})$ | | Transpose | `transpose(A)` | $A^T$ | $A^T$ | | Diagonal | `Diag(x)` | $\operatorname{diag}(x)$ | $\operatorname{diag}(\cdot)$ | ```{note} `Mat_Mul` is the recommended interface for matrix-vector products in new Solverz models. It is *symbolic-first*: the symbolic engine sees each `Mat_Mul(A, x)` as a real operator, walks it through the matrix-calculus pipeline, and emits an analytical Jacobian block automatically. On the {term}`fast path` — when `A` is a plain sparse `dim=2` `Param` — `Mat_Mul` ends up calling exactly the same Numba helper (`SolCF.csc_matvec`) the legacy `MatVecMul` interface uses. The two have the same per-call cost; the difference is that `Mat_Mul` survives the matrix-calculus engine and `MatVecMul` does not. On the {term}`fallback path` (negated matrices, nested `Mat_Mul`, dense `dim=2` parameters, or other matrix expressions) `Mat_Mul` falls through to `scipy.sparse @` in the wrapper — see the [Layer 1](#layer-1-mat_mul-precomputes-as-njit-csc-matvec) discussion below. `MatVecMul` is still deprecated for new models; pick `Mat_Mul`. ``` ```{warning} **Sparse matrices MUST be immutable after modelling** All sparse matrix parameters declared with `Param(..., dim=2, sparse=True)` are assumed to be **immutable** (both sparsity pattern *and* numerical values) once modelling is complete and `model.create_instance()` has been called. Solverz relies on this assumption for maximum runtime efficiency: - At code-generation time, the sparsity pattern of every mutable-matrix Jacobian block is analysed *once*, the output data-array layout is fixed, and a pre- computed index mapping is baked into a Numba-compiled scatter-add loop. - At runtime, every `J_(y, p)` call assembles Jacobian block data by directly indexing into `Matrix.data` using the pre-computed row/column mappings — **no scipy.sparse matrix is constructed per call**. - For sparse-matrix parameters used in `Mat_Mul(A, x)`, the matrix-vector product `A @ x` is precomputed in the `F_` wrapper and passed as a dense vector to the `@njit`-compiled equation functions. If your application truly needs to update sparse-matrix values between solves (e.g., sweeping a parameter), rebuild the model with `model.create_instance()` each time. Mutating `p_["A"].data` after creation will silently produce wrong Jacobians **for matrices that flow through the vectorised Mat_Mul fast path** — scatter-add kernels cache `M.data` at analysis time and the cache becomes stale. Matrices that only land in the `MutableMatJacDataModule` fallback path (see {ref}`Fallback path `) re-evaluate the full expression on every `J_()` call and would reflect the mutation, but there is no way to guarantee a particular block won't be *promoted* to the fast path in a future release, so the rule is still: **do not mutate sparse matrix params post-build**. Dense parameters (`Param(..., dim=2, sparse=False)`) are fine to update via `mdl.p["name"] = new_value`; the restriction only applies to **sparse** dim=2 parameters. ``` All element-wise functions (exp, sin, cos, ln, Abs) can be composed with matrix operations: ```python from Solverz import Var, Param, Eqn, Model from Solverz.sym_algebra.functions import Mat_Mul, exp, sin m = Model() m.x = Var('x', [1, 2, 3]) m.A = Param('A', [[1, 0, 0], [0, 2, 0], [0, 0, 3]], dim=2) m.f = Eqn('f', exp(Mat_Mul(m.A, m.x))) # exp(A@x) ``` Solverz will automatically compute $\frac{\partial}{\partial x} e^{Ax} = \operatorname{diag}(e^{Ax}) \cdot A$. ## How It Works The matrix calculus engine is a four-stage pipeline: each expression is lifted into Einstein index notation, rebuilt as a computation DAG, differentiated forward through the DAG, and finally collapsed back to ordinary matrix/vector operations. We walk through each stage with a running example from power flow. ### 1. Einstein Index Notation Every tensor is labelled with one index per axis: - **Scalar**: no index (e.g., $\alpha$). Zeroth-order tensor, no axes. - **Vector** $x \in \mathbb{R}^n$: a single index $x_i$, where $i$ ranges over $\{1, \dots, n\}$. The index *names* the axis; the letter used is arbitrary, only its pattern of repetition matters. - **Matrix** $A \in \mathbb{R}^{m \times n}$: two indices $A_{ij}$. The first index $i$ labels rows, the second $j$ labels columns. A **repeated index inside a product** implies *summation* (Einstein summation convention, or *contraction*): ```{math} A_{ij}\, x_j \;\equiv\; \sum_{j=1}^{n} A_{ij}\, x_j \;=\; (Ax)_i. ``` Here $j$ is repeated (contracted — it disappears in the result), while $i$ is a **free index** that labels the resulting vector's axis. Conventionally we write the non-contracted indices on the left of the expression, so the indices of the left-hand side $(Ax)_i$ match the remaining free indices $i$ on the right. The benefit of this notation is that the *shape* of an expression is fully encoded in its index signature. Element-wise multiplication uses repeated indices *without* contraction: ```{math} (x \odot y)_i \;=\; x_i\, y_i, ``` and the outer product uses two different free indices: ```{math} (x\, y^\top)_{ij} \;=\; x_i\, y_j. ``` ### 2. Computation DAG Solverz parses each equation's symbolic expression into a **directed acyclic graph (DAG)**: - **Leaf nodes** — variables ($x$, $y$) and parameters ($A$, $B$, $G$). Each leaf carries its index signature ($x_i$, $A_{ij}$, etc.). - **Internal nodes** — operations: addition $+$, element-wise product $\odot$, matrix product $@$, element-wise unary function $f(\cdot)$. - **Root node** — the output expression whose Jacobian we want. For the power flow active-power sub-expression $e \odot (Ge - Bf)$: ``` *[i] ← element-wise multiply (free index i) / \ e[i] -[i] ← subtraction (free index i) / \ @[i] @[i] ← matrix-vector products, contraction on j / \ / \ G[ij] e[j] B[ij] f[j] ``` Each node remembers its free index set. The matrix product nodes `@[i]` each *contract* index $j$ between $G_{ij}$ and $e_j$, so the result has free index $i$. The subtraction and element-wise product keep $i$ as their only free index, so the whole expression has index signature $[i]$ — a vector. ### 3. Forward-Mode Automatic Differentiation To obtain $\partial f / \partial x$ where $f$ is the root and $x$ is some leaf variable, derivatives propagate from **leaves upward to the root** (pushforward, a.k.a. forward-mode AD). Every internal node computes its own Jacobian contribution and combines it with the Jacobians already accumulated at its children. Throughout this section, $A_{s_1}$ and $B_{s_2}$ denote the free indices of the two operands (before differentiation), $s_3$ denotes the free index of the result, and $s_4$ is the **derivative-direction index** — the axis along which we differentiate, equal to the free index of the variable $x$ we differentiate with respect to. Tilded tensors $\tilde A, \tilde B$ carry one *extra* axis (labelled $s_4$) compared to $A, B$: the derivative of a rank-$k$ tensor w.r.t. a rank-1 variable is a rank-$(k{+}1)$ tensor. 1. **Leaf variables.** Differentiating $x_j$ w.r.t. $x_{s_4}$ gives the Kronecker delta: ```{math} \tilde{x}_{j\,s_4} \;=\; \frac{\partial x_j}{\partial x_{s_4}} \;=\; \delta_{j s_4}. ``` All other variables (and all parameters) have derivative $0$ w.r.t. $x$. 2. **Multiplication** (product rule in index notation): ```{math} \tilde{C}_{s_3\,s_4} \;=\; B_{s_2}\cdot\tilde{A}_{s_1\,s_4} \;+\; A_{s_1}\cdot\tilde{B}_{s_2\,s_4} ``` where $A_{s_1} \cdot B_{s_2} = C_{s_3}$ is the original product (the specific index pattern $s_1, s_2 \to s_3$ depends on the product type — element-wise, matrix-vector, or matrix-matrix). The two summands encode "differentiate the first factor, keep the second" plus "keep the first factor, differentiate the second". 3. **Element-wise unary function** $f$ (e.g. $\exp$, $\sin$, $\ln$): since $f$ acts point-wise, the chain rule gives ```{math} \frac{\partial f(A_{s_1})}{\partial x_{s_4}} \;=\; f'(A_{s_1})\odot\tilde{A}_{s_1\,s_4}. ``` The derivative $f'$ is also computed element-wise and multiplied (Hadamard) with the child's Jacobian. 4. **Addition**. Linearity: ```{math} \tilde{(A+B)}_{s_3\,s_4} \;=\; \tilde{A}_{s_1\,s_4} + \tilde{B}_{s_2\,s_4}. ``` ### 4. TMul2Mul: Index Resolution After the forward pass, derivatives live as tensor expressions labelled with free indices. The **TMul2Mul** stage reads the index pattern of each product and emits the corresponding concrete matrix / vector operation. Each row of the table has the form $(s_1, s_2, s_3)$: the free indices of the left operand, right operand, and the result. | Index pattern $(s_1, s_2, s_3)$ | Matrix operation | Meaning | |---|---|---| | $(i,\; i,\; i)$ | $x \odot y$ | element-wise multiply — both operands share axis $i$, no contraction | | $(i,\; j,\; ij)$ | $x\,y^\top$ | outer product — two disjoint free indices survive into the result | | $(ij,\; j,\; i)$ | $A\,y$ | matrix-vector: index $j$ is shared (contracted), $i$ survives | | $(ij,\; jk,\; ik)$ | $A\,B$ | matrix-matrix: inner $j$ is contracted, $(i,k)$ survive | | $(i,\; ij,\; ij)$ | $\operatorname{diag}(x)\,A$ | left diagonal scaling — scale each row of $A$ by the matching entry of $x$ | | $(ij,\; j,\; ij)$ | $A\,\operatorname{diag}(y)$ | right diagonal scaling — scale each column of $A$ by the matching entry of $y$ | | $(i,\; ii,\; ii)$ | $\operatorname{diag}(x \odot y)$ | diagonal-from-element-wise — used when a product has matching repeated indices $(ii)$ on one operand | The key insight is that every well-formed index pattern maps to one concrete sparse-preserving operation. Patterns like $(ij, ij, ij)$ that don't match the table are either simplified earlier (e.g. $(ij, ij, ij) \to A \odot B$, element-wise matrix product) or emitted via a fallback that produces a dense matrix. ## Application Examples ### Example 1: Power Flow Equations The active power balance equation in power systems: ```{math} P = e \odot (Ge - Bf) + f \odot (Be + Gf) ``` ```python from Solverz.sym_algebra.symbols import iVar, Para from Solverz.sym_algebra.functions import Mat_Mul from Solverz.sym_algebra.matrix_calculus import TensorExpr e = iVar('e') f = iVar('f') G = Para('G', dim=2) B = Para('B', dim=2) P = e * (Mat_Mul(G, e) - Mat_Mul(B, f)) + f * (Mat_Mul(B, e) + Mat_Mul(G, f)) TP = TensorExpr(P) print(TP.diff(e)) # diag(-B@f + G@e) + diag(e)@G + diag(f)@B print(TP.diff(f)) # diag(B@e + G@f) + diag(e)@(-B) + diag(f)@G ``` ### Example 2: Nonlinear Matrix Equations ```python from Solverz.sym_algebra.functions import exp A = Para('A', dim=2) x = iVar('x') expr = exp(Mat_Mul(A, x)) TE = TensorExpr(expr) print(TE.diff(x)) # diag(exp(A@x))@A ``` ### Example 3: Heat Network Equations ```python mL = Para('mL', dim=2) K = Para('K') m = iVar('m') from Solverz.sym_algebra.functions import Abs expr = Mat_Mul(mL, K * Abs(m) * m) TE = TensorExpr(expr) print(TE.diff(m)) # mL@(diag(K*Abs(m)) + diag(K*m*Sign(m))) ``` ## How numerical code is generated and optimized When `module_printer` compiles an equation system that contains `Mat_Mul`, it applies a **three-layer optimisation strategy** to make the generated `F_`/`J_` functions run as fast as possible. All three layers operate on the same insight: *at modelling time we already know what the equations and their Jacobians look like* — so every computation that does not depend on the current iterate $y$ can be lifted out of the hot loop and paid for once, at code generation or module load. ```{warning} Everything below assumes **sparse matrix parameters are immutable** after `model.create_instance()`. See the warning earlier in this document. Every layer described here bakes in the matrix's sparsity pattern or its `.data` values as a constant; mutating them in-place will silently corrupt results without raising any error. ``` ### Layer 1 — `Mat_Mul` precomputes as ``@njit`` CSC matvec The code printer walks every residual RHS looking for `Mat_Mul(A, v)` sub-trees, where `A` is a sparse parameter and `v` is any vector-valued expression (a variable, a slice, or a larger expression). It performs **three transformations at code gen time**: 1. **Hoist.** Each `Mat_Mul(A, v)` is replaced by a fresh placeholder variable `_sz_mm_N` (where `N` is a monotonically increasing integer unique across the whole system). The `_sz_mm_` prefix is reserved by Solverz — see {ref}`Restrictions and reserved names ` below. 2. **Deduplicate.** Two `Mat_Mul` sub-trees that are structurally equal (same matrix, same operand expression after hoisting) are collapsed into a single placeholder. The power-flow example below computes `G_nr @ e` once even though both the P-balance and Q-balance residuals reference it. 3. **Classify each placeholder.** The classifier (`_classify_matmul_placeholders`, with the inner shape predicate `_shape_is_fast`, in `Solverz/code_printer/python/module/module_printer.py`) decides whether the precompute can go on the {term}`fast path` (matrix is a plain sparse `dim=2` `Para` **and** the operand is a clean vector expression) or the {term}`fallback path` (anything else: `-Para`, nested `Mat_Mul`, dense `dim=2` matrices, matrix expressions, or operands that still contain unresolved `Mat_Mul` / sparse `dim=2` `Para` references). A fast-path candidate is also **demoted** to the fallback path if another fallback placeholder's `matrix_arg` or `operand_arg` references it as a free symbol — otherwise the wrapper would emit a scipy {term}`SpMV` like `_sz_mm_1 = (-A) @ _sz_mm_0` whose `_sz_mm_0` operand was never materialised. The classifier iterates this demotion to a fixed point so chains of nested `Mat_Mul`s land on a topologically valid emission order. **Fast path** — the matvec is computed **inside `inner_F`** via the existing `SolCF.csc_matvec` Numba helper (`Solverz/num_api/custom_function.py`). Each sparse dim=2 `Param` is already decomposed at model-build time into its CSC flat arrays (`A_data` / `A_indices` / `A_indptr` / `A_shape0`) by `Solverz/model/basic.py`; these flat arrays flow through `param_list` to `inner_F` as ordinary numpy arrays, and `inner_F` issues a `SolCF.csc_matvec(A_data, A_indices, A_indptr, A_shape0, operand)` call per placeholder. The wrapper `F_()` does not load the sparse matrix object `A` at all — the whole matvec lives in Numba land and pays zero scipy dispatch cost: ```python def F_(y_, p_): e = y_[0:29] f = y_[29:58] G_nr_data = p_["G_nr_data"] G_nr_indices = p_["G_nr_indices"] G_nr_indptr = p_["G_nr_indptr"] G_nr_shape0 = p_["G_nr_shape0"] B_nr_data = p_["B_nr_data"] ... # dense vector params follow return inner_F(_F_, e, f, p_ref, q_ref, G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, ...) @njit(cache=True) def inner_F(_F_, e, f, p_ref, q_ref, G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, ...): # Fast-path matvecs: all inside @njit, no scipy dispatch. _sz_mm_0 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, e) _sz_mm_1 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, f) _sz_mm_2 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, e) _sz_mm_3 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, f) _F_[0:29] = inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3) _F_[29:53] = inner_F1(...) return _F_ @njit(cache=True) def inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3): return e * (p_ref - _sz_mm_1 + _sz_mm_0) + f * (q_ref + _sz_mm_2 + _sz_mm_3) - Pinj ``` Why this matters: scipy.sparse matrix-vector products are implemented in C and fast per se, but each call crosses Python → scipy dispatch → C → back, which on a 58×58 matrix costs ~1.5 µs of pure dispatch overhead — 10× the actual arithmetic. Eight such SpMVs in a hot F made `Mat_Mul` catastrophically slow on small networks (`case30` hot F was 10× worse than the for-loop form). Moving the matvec into `SolCF.csc_matvec` inside `inner_F` eliminates all Python dispatch: the helper is `@njit(cache=True)` and `inner_F` calls it as an intra-Numba function call, which the LLVM inline pass typically flattens into the caller's body. On `case30`, this drops hot F from ≈14 µs to ≈3 µs (~4.4× speedup) without any other code change. **Fallback path** — when the matrix operand is *not* a plain sparse `Para` (e.g. `-G_nr`, `Mat_Mul(A, B)`, or a dense `dim=2` parameter), the old path is kept: `F_()` emits `_sz_mm_N = matrix_expr @ operand` in the wrapper (using scipy.sparse or numpy `@`) and passes `_sz_mm_N` as an extra dense argument to `inner_F`. These placeholders do not benefit from the fast path and pay the scipy dispatch cost per call. Dense `dim=2` parameters fire a one-shot `UserWarning` at `FormJac` time — see {ref}`Restrictions and reserved names `. ```{note} The fast path reuses infrastructure that predates the `Mat_Mul` interface: the same `SolCF.csc_matvec` helper and the same `_data` / `_indices` / `_indptr` / `_shape0` Param decomposition were introduced for the legacy `MatVecMul` operator. Before 0.8.1 fast path, those fields were still emitted into `inner_F`'s argument list but never referenced — they are now the conduit for the Layer 1 matvec. Legacy `MatVecMul` users already benefited from this pipeline; from 0.8.1 `Mat_Mul` users do as well, without any model-level changes. ``` (when-to-use-mat_mul)= #### When to use `Mat_Mul` API-level rules of thumb. The detailed case study with full benchmark numbers, decision matrix, and the case30-scale hot-F regression analysis lives in the [power-flow chapter of the Solverz Cookbook](https://cookbook.solverz.org/latest/ae/pf/pf.html#performance-comparison-mat_mul-vs-for-loop) — this section just states the API guidance. **Use `Mat_Mul`** when any of the following hold: - You want compact, paper-faithful equations. `Mat_Mul(G, e)` replaces `nb` scalar `Eqn` definitions and lets the matrix calculus engine derive the Jacobian automatically. - You iterate on the model. Every compile / build phase (`Model` construction, `FormJac`, `made_numerical`, `module_printer.render`, and the first-time Numba {term}`cold compile`) is dramatically faster under `Mat_Mul` than under the equivalent element-wise for-loop form. - All matrix operands are plain `Param(..., dim=2, sparse=True)` — this unlocks the Layer 1 {term}`fast path` described above. The narrow case where the for-loop form still wins (very small network, F-dominated hot loop, single compile) and the underlying case30 numbers are documented in the Cookbook chapter linked above. **Matrix shapes that fall out of the fast path** and therefore pay the scipy {term}`SpMV` dispatch cost on every `F_` call (~1.5 µs per SpMV — consequential at small network scale, negligible elsewhere): - **Negated matrices** — `Mat_Mul(-G, x)`. `G` is a plain `Para` but `-G` is a `Mul(-1, G)` expression the classifier doesn't recognise. Workaround: write `-Mat_Mul(G, x)` instead of `Mat_Mul(-G, x)` so the sign flows through the outer expression and `G` stays on the fast path. - **Dense `dim=2` params** — `Param(A, dim=2, sparse=False)`. `FormJac` fires a one-shot `UserWarning` for these; they fall through to the `MutableMatJacDataModule` path. Convert to sparse with `csc_array(A)` and declare `sparse=True`. - **Nested `Mat_Mul`** — `Mat_Mul(A, Mat_Mul(B, x))`. The classifier's dependency-aware demotion takes care of the topological ordering, but only the inner `Mat_Mul` ends up on the fast path when both matrices are plain Paras; the outer pays the wrapper SpMV cost. - **Three-or-more-arg `Mat_Mul`** — `Mat_Mul(A, B, x)`. The operand of the placeholder is itself a `Mat_Mul`, so the classifier rejects it from the fast path. - **Matrix expressions** — `Mat_Mul(A + B, x)` or similar. The classifier recognises only a bare `Para` as the matrix operand. Workaround: materialise the combined matrix as a single `Param(A + B, dim=2, sparse=True)` outside the equation. ### Layer 2 — Mutable matrix Jacobian blocks The Jacobian is handled by a **separate, symbolic decomposition** because its blocks are matrices (not vectors) whose structure depends on the full expression. Consider the power-flow block ```{math} \frac{\partial P}{\partial e} \;=\; \operatorname{diag}\!\bigl(G\,e - B\,f + p_\text{ref}\bigr) \;+\; \operatorname{diag}(e)\,G \;+\; \operatorname{diag}(f)\,B. ``` All three terms are *mutable* — they depend on $e$ and $f$ and must be re-evaluated on every `J_(y, p)` call. But their **sparsity pattern is constant**: the union of the diagonal, the pattern of $G$, and the pattern of $B$ — all fixed at modelling time. Solverz exploits this in a pattern-match compiler: 1. **Parse the block into typed terms.** The symbolic analyser walks the SymPy expression tree (implemented in `Solverz/code_printer/python/module/mutable_mat_analyzer.py`). Each additive term is classified as one of: | Shape | Generated code contribution | |---|---| | `Diag(u_expr)` (diagonal term) | `data[out_pos] += u[src_idx]` | | `Mat_Mul(Diag(v), M)` (row-scale) | `data[out_pos] += v[src_row] * M_data[k]` | | `Mat_Mul(M, Diag(v))` (col-scale) | `data[out_pos] += v[src_col] * M_data[k]` | **A term is deferred to the fallback path when any of the following is true**: - it has a non-unit numeric coefficient other than `±1` (e.g. `3 * Diag(x)`) — the analyser only unwraps `±1` via `_extract_sign_and_core`; - it is `Mat_Mul(Diag(v), X)` or `Mat_Mul(X, Diag(v))` where ``X`` is **not** a plain `Para` (or `-Para`) — for instance a matrix expression such as `A @ B`, or a user-constructed product that the matrix calculus engine hasn't simplified down to a single leaf matrix; - it is a ``Mat_Mul`` with three or more arguments that the matrix-calculus engine didn't fold into a nested binary form; - it is any other shape the classifier doesn't recognise (transpose of a mutable matrix, element-wise matrix product, a non-sparse matrix leaf, etc.). Unrecognised terms are deferred to the {ref}`fallback path `, which is correct but slower: it evaluates the full sub-expression via scipy.sparse on every `J_()` call and extracts values by fancy indexing. 2. **Precompute the output sparsity pattern (once).** At model initialisation (inside `FormJac`, `Solverz/equation/equations.py`), every mutable-matrix block is evaluated *one* time with all variables perturbed to distinct non-zero values and all `Diag` nodes substituted with `SpDiag` (which emits `sps.diags` instead of the dense `np.diagflat`). This perturbed evaluation is the entire reason we go to the trouble of the `SpDiag` substitution: using non-zero variable values *guarantees* that no term accidentally collapses to zero at $y_0$ (which would happen in a flat-start power flow where $f = 0$ and `sps.diags(f)@B` is empty). The resulting sparse matrix gives us the **union of all terms' patterns** — the structural upper bound that the runtime block must fit into. 3. **Precompute the index mapping arrays (once).** For each recognised term, the analyser builds three numpy arrays: - `out_pos[k]` — the index into the block's output `data` array where the $k$-th nonzero contribution of this term lands. - `src_idx[k]` — for diagonal terms, the row $i$ in the inner vector $u$ to read from; for row-scale terms, the row of $M$'s $k$-th nonzero; for col-scale terms, the column. - `mat_data` (row/col-scale terms only) — a direct reference to `M.data` (which is why the matrix must be immutable). These arrays are stored inside `setting`, a per-module dict that also holds things like the CSC `row`, `col`, and initial `data` for the full Jacobian. 4. **Generate a dedicated Numba kernel per block.** For each mutable matrix block the code printer emits a new top-level function named `_mut_block_N` (where `N` is the block index). Its signature takes the diag inner vectors, the base variables needed by any row/col-scale term, and each term's three mapping arrays. All per-block helper arrays use the reserved `_sz_mb_{N}_` prefix so different blocks' mappings never collide. The body is a pure- Python / Numba scatter-add loop: ```python # Module-level (loaded once from setting at import time): _sz_mb_0_diag_out_0 = setting["_sz_mb_0_diag_out_0"] _sz_mb_0_diag_src_0 = setting["_sz_mb_0_diag_src_0"] _sz_mb_0_rs_out_0 = setting["_sz_mb_0_rs_out_0"] _sz_mb_0_rs_src_0 = setting["_sz_mb_0_rs_src_0"] _sz_mb_0_rs_dat_0 = setting["_sz_mb_0_rs_dat_0"] # frozen G_nr.data ... @njit(cache=True) def _mut_block_0(_sz_mb_0_u0, e, f, _sz_mb_0_diag_out_0, _sz_mb_0_diag_src_0, _sz_mb_0_rs_out_0, _sz_mb_0_rs_src_0, _sz_mb_0_rs_dat_0, _sz_mb_0_rs_out_1, _sz_mb_0_rs_src_1, _sz_mb_0_rs_dat_1): data = zeros(107) # Diagonal term: diag(G_nr@e - B_nr@f + p_ref) for i in range(_sz_mb_0_diag_out_0.shape[0]): data[_sz_mb_0_diag_out_0[i]] += _sz_mb_0_u0[_sz_mb_0_diag_src_0[i]] # Row-scale term: diag(e) @ G_nr for i in range(_sz_mb_0_rs_out_0.shape[0]): data[_sz_mb_0_rs_out_0[i]] += e[_sz_mb_0_rs_src_0[i]] * _sz_mb_0_rs_dat_0[i] # Row-scale term: diag(f) @ B_nr for i in range(_sz_mb_0_rs_out_1.shape[0]): data[_sz_mb_0_rs_out_1[i]] += f[_sz_mb_0_rs_src_1[i]] * _sz_mb_0_rs_dat_1[i] return data ``` ```{note} Every scatter-add loop uses `+=`, not `=`, including the diagonal terms. This matters when a block contains **multiple independent `Diag(...)` additive terms** whose output positions coincide — for example the Jacobian of `x*(A@x) + x*(B@x)` produces `diag(A@x)` and `diag(B@x)`, both landing on the same `(i,i)` positions. The `+=` accumulates their contributions correctly; an earlier implementation used `=` and silently dropped one of the two terms. ``` 5. **Wire the kernel up in the `J_` wrapper.** For each block the wrapper emits: (a) one line per diag term that materialises its inner vector using scipy.sparse (e.g. `_sz_mb_0_u0 = p_ref - (B_nr@f) + (G_nr@e)`); (b) a call to `_mut_block_N(...)` passing the inner vectors, base variables, and mapping arrays; (c) an assignment of the returned `data` array into the appropriate slice of the overall Jacobian `_data_`. Crucially, each block's **scipy work is capped at one SpMV per diag term** — nothing else in the block touches sparse matrices at runtime. The data for row/col-scale terms comes straight from the pre-baked `_sz_mb_{N}_rs_dat_{k}` arrays. The net effect on each `J_(y, p)` call is that Jacobian assembly cost is dominated by (i) the constant `inner_J` call for element-wise blocks and (ii) a handful of {term}`SpMV`s for the diag inner vectors. The actual mutable-matrix blocks themselves are assembled in $\mathcal{O}(\text{nnz})$ vectorised Numba loops — no sparse matrix allocation, no sparsity reconstruction, no format conversion. This benefit applies to **every solver that calls `J_(y, p)`** — algebraic-equation Newton solvers (`nr_method`, `sicnm`), DAE integrators (which call `J_` at every implicit substep), FDAE solvers, and any future solver that consumes Solverz's Jacobian contract. The cost model is "per `J_` call", not "per Newton step". #### Two paths: scatter-add (fast) and fancy indexing (fallback) Solverz keeps **both** assembly paths in every generated module and picks per Jacobian block at code-gen time. The {term}`fast path` is the typed scatter-add described above; the {term}`fallback path` exists for blocks whose term structure the symbolic analyser cannot classify (see [Layer 2 step 1](#parse-the-block-into-typed-terms)) and is described in the next subsection. There is no on/off switch — both paths co-exist, and the runtime always uses whichever the classifier picked. The fallback path is what an earlier iteration of the pipeline used *everywhere*: build the sparse matrix with `sps.diags` and friends, then read back the nonzero values at the precomputed COO positions via {term}`fancy indexing` (`expr.tocsr()[[rows], [cols]]`). It is correct and elegant but slow — scipy has to construct several intermediate sparse matrices per call, sum them (with explicit-zero elimination), and then perform a linear-scan lookup for each output position. On a 30-bus power flow case, each `J_(y, p)` call took ≈ 280 µs in that pre-0.8 architecture. The scatter-add Numba path introduced in 0.8.0 drops that to ≈ 50 µs while producing bit-identical results. If you want to know whether a particular block in your model is on the fast path or the fallback, profile with `pytest --durations=10` or `cProfile` and look for `MutableMatJacDataModule` in the traces — its presence in the hot `J_()` line indicates that block is on the fallback. ### Layer 3 — Selective `@njit` All generated inner loops — `inner_F`, every per-equation `inner_F{N}`, `inner_J`, every per-block `inner_J{N}`, and every `_mut_block_{N}` — carry `@njit(cache=True)`. Only the `F_` / `J_` wrappers remain pure Python; they host any fallback-path {term}`SpMV` precomputes, the module-level parameter unpacks, and the final `sps.coo_array((data, (row, col)), shape).tocsc()` that packages the Jacobian back into a scipy object for the downstream solver (algebraic Newton-style or DAE/ODE integrator). The selectivity matters: Numba absolutely cannot digest a `scipy.sparse.csc_array` argument, which is why the wrappers exist at all. Anything that *can* run inside Numba is pushed there, at zero syntactic cost to the user — they just write `Mat_Mul(A, x)` and `module_printer` figures out the rest. An important corollary for non-`Mat_Mul` models: if an equation system contains **no** `Mat_Mul` at all, the entire precompute / mutable- matrix machinery is disabled and `F_` / `J_` collapse back to the original pre-0.8 architecture (direct `@njit` on `inner_F` receiving only dense vectors and scalar params). Performance on pure element-wise models is therefore unchanged by the Mat_Mul rewrite. (fallback-path)= ### Fallback path — scipy.sparse + fancy indexing When a Jacobian block contains a term shape the symbolic analyser does not recognise — for example an unusual nested `Mat_Mul`, a non-parameter matrix, a matrix-valued non-linear expression, or a term with a non-`±1` coefficient — that single block is handled by the {term}`fallback path`. The wrapper rebuilds the block with `scipy.sparse` and reads the requested nonzero values back via {term}`fancy indexing`: ```python data[start:stop] = asarray( (sps.diags(u) + sps.diags(v)@M + ...) .tocsr()[[rows], [cols]] ).ravel() ``` It is $\mathcal{O}(\text{nnz} \log n)$ per call instead of the $\mathcal{O}(\text{nnz})$ scatter-add path, but it never misses a derivative term. **The two paths co-exist** in every generated module: only the single affected block falls back; the rest of the Jacobian still uses the fast scatter-add path. There is no fallback-only mode and no opt-in switch — Solverz simply picks per block based on what the classifier recognised at code-gen time. ### Performance The full case30 power-flow benchmark — every phase from `Model` construction through cold compile to {term}`hot F` / hot J — lives in the [Solverz Cookbook power-flow chapter](https://cookbook.solverz.org/latest/ae/pf/pf.html#performance-comparison-mat_mul-vs-for-loop). This subsection only carries one extra micro-benchmark (DHS hydraulic on Barry Island) that exercises a Jacobian shape the power-flow case does not. **Benchmark environment.** All numbers below were measured under: - **Hardware:** 2025 MacBook Air, Apple M4, AC power, no thermal throttling observed during the runs. - **OS:** macOS 26.4 (build 25E246). - **Python:** 3.11.13. - **Library versions:** `numpy==2.3.3`, `scipy==1.16.0`, `numba==0.65.0`, `sympy==1.13.3`, `Solverz==0.8.1` (the post-`csc_matvec` {term}`fast path`). - **Methodology:** 10 warm-up calls (to bake the Numba caches and prime the CPU branch predictor) followed by 5000–20000 timed iterations, median of three repeats. The {term}`cold compile` measurement (Cookbook only) is run in a fresh Python subprocess with `__pycache__` and Numba `.nbi` / `.nbc` caches wiped beforehand. The same per-call timing helper is used in the Cookbook's `bench_pf_matmul_vs_polar.py`; the per-`J_` numbers below were captured by importing the rendered DHS module and calling `mdl.J(y, mdl.p)` in a tight loop. **DHS hydraulic subproblem on `BarryIsland`** (35 unknowns, 1 loop, 1 mutable-matrix block — the loop pressure Jacobian contains two col-scale terms $L\,\operatorname{diag}(K \odot |m|)$ and $L\,\operatorname{diag}(K \odot m \odot \operatorname{sign}(m))$ whose scaling vectors are non-trivial expressions of the variable): | Pipeline | `J_()` per call | |---|---| | Element-wise inline (one scalar `Eqn` per node + loop) | ≈ 103 µs | | `Mat_Mul` inline (lambdify) | ≈ 45 µs | | **`Mat_Mul` module printer, vectorized @njit** | **≈ 28 µs** | - **Element-wise vs `Mat_Mul` inline.** Writing the same hydraulic system as `V @ m - m_inj = 0` plus `L @ (K m |m|) = 0` (two vector equations total) instead of 35 scalar equations cuts inline time by more than half, because {term}`lambdify` has far fewer expression nodes to walk on every call. - **Inline vs module printer.** Even on a small system (only 79 nonzeros), the module printer wins once every mutable-matrix term reaches the vectorised {term}`scatter-add` path. Here the loop- pressure Jacobian has two col-scale terms whose scaling vectors (`K * |m|` and `K * m * sign(m)`) are computed once per `J_(y, p)` call as dense numpy expressions in the wrapper, then handed to a Numba kernel that does two 10-iteration scatter-add loops. The only remaining overhead is the final `coo_array(...).tocsc()` packing. Rule of thumb: **the module printer wins whenever the mutable-matrix blocks successfully match the fast path**; it can lose to {term}`inline mode` for models that fall through to the {term}`fallback path`, or models small enough that the final COO-to-CSC packing dominates. Profile with `%timeit mdl.J(y, mdl.p)` on a representative iterate if in doubt, and check the generated `num_func.py` for any lingering `MutableMatJacDataModule` (= fallback) calls. The scatter-add path is strictly $\mathcal{O}(\text{nnz})$ in the block contents, while the old fancy-indexing path was dominated by the cost of constructing the intermediate sparse matrices — so the speed-up scales with nnz and with the number of mutable-matrix blocks. For models with many rows and few blocks the advantage is small; for models with many blocks per variable (such as power flow, with distinct $\partial/\partial e$ and $\partial/\partial f$ blocks for both $P$ and $Q$ balances) the advantage compounds. (restrictions)= ## Restrictions and reserved names `Mat_Mul` imposes a few hard restrictions that are enforced at model build time. Each one prevents a class of silent-wrong-answer bugs that would otherwise slip past the generated Jacobian. ### Time-varying sparse `dim=2` parameters are rejected at construction Solverz does not support sparse `dim=2` `Param`s whose values change at runtime. A declaration like: ```python from scipy.sparse import csc_array m.K = Param('K', csc_array([[1, 0], [0, 1]]), dim=2, sparse=True, triggerable=True, trigger_var=['x'], trigger_fun=lambda x: csc_array([[2*x[0], 0], [0, 2*x[1]]])) # ⇒ NotImplementedError: Parameter 'K': a sparse 2-D ``Param`` cannot # be declared ``triggerable=True``. Time-varying sparse matrices are # unsupported because Solverz's code generation caches the matrix's # CSC fields at model-build time; a runtime trigger update would # silently be ignored. ``` and the equivalent `TimeSeriesParam`: ```python m.G = TimeSeriesParam('G', v_series=[...], time_series=[...], value=np.eye(3), dim=2, sparse=True) # ⇒ NotImplementedError: Parameter 'G': a sparse 2-D # ``TimeSeriesParam`` is not supported. ``` both raise **at the point of construction**, not deep inside `FormJac`, so the error points at the exact line where the user wrote the offending declaration. **Why:** every Solverz code path that consumes a sparse `dim=2` `Param` caches its CSC decomposition at model-build time: - The legacy `MatVecMul` code path decomposes every sparse `dim=2` `Param` into flat arrays (`_data`, `_indices`, `_indptr`, `_shape0`) in `Solverz/model/basic.py:56`, which are stored in `p_` and passed to `inner_F` as read-only numpy arrays. - The 0.8.1 `Mat_Mul` `SolCF.csc_matvec` fast path reuses those same CSC flats inside `inner_F` to do the matvec in `@njit` land. - The mutable-matrix Jacobian scatter-add kernel caches the matrix's `.data` array directly as a `_sz_mb_{N}_rs_dat_{k}` setting entry loaded at module import time. None of these caches are invalidated when `p_["K"].trigger_fun` fires or `p_["G"].get_v_t(t)` returns a fresh matrix. The runtime update simply gets lost, the Jacobian keeps using the initial values, and the downstream solver (Newton-Raphson, DAE integrator, or any other consumer of `J_`) either diverges or converges to the wrong solution — silently. Historically the check was narrow (only equations containing `Mat_Mul` were rejected, and the check ran at `FormJac` time). In the 0.8.1 hotfix the policy broadened to cover every sparse `dim=2` time-varying `Param`, regardless of where or whether it is used. The check now lives in `ParamBase.__init__` and `TimeSeriesParam.__init__`, with an unconditional backstop in `Equations._check_no_timevar_sparse_matrices` that runs on every `FormJac` call in case a tainted `Param` was constructed via `__new__` + attribute assignment (which bypasses the `__init__` check). **Workarounds** — all of these are supported: - **Triggerable / time-series *vectors* next to a `Mat_Mul`.** Scalar or 1-D time-varying parameters are fine; only the `dim=2` + `sparse=True` combination is rejected. Write `m.K * m.x + Mat_Mul(m.A, m.x) - m.b` with `K` a triggerable 1-D vector and `A` a plain sparse 2-D matrix — the wrapper rebuilds `K` on every solver step via the `trigger_var` mechanism and the `Mat_Mul(A, x)` fast path uses the unchanged `A`. - **Element-wise rewrite.** If you genuinely need the matrix itself to evolve, write the residual out one scalar `Eqn` per row, with the per-row coefficients as 1-D `Param`s or `TimeSeriesParam`s. The element-wise form has no CSC caching — every call re-evaluates the full trigonometric / algebraic expression via pure numpy / numba. - **Dense `dim=2` parameters (`sparse=False`).** The `MutableMatJacDataModule` fallback path used for dense matrices re-evaluates the entire block expression via scipy/numpy on every `J_()` call, so `get_v_t(t)` updates DO propagate. Dense parameters do not get the `csc_matvec` fast path, but they're a correct (if slower) way to carry a time-varying matrix through the solver. A `UserWarning` still fires once per such parameter to flag the performance cost. ### Reserved symbol prefixes User symbols (`Var`, `Param`, `iVar`, `Para`, ...) are rejected at construction time if their name starts with any of: | Prefix | Used for | |---|---| | `_sz_mm_` | Mat_Mul precompute placeholders in the `F_`/`J_` wrapper (e.g. `_sz_mm_0 = G @ e`). | | `_sz_mb_` | Mutable-matrix Jacobian block helpers (diag inner vectors, row/col-scale scaling vectors, mapping arrays). | ```python m.x = Var('_sz_mm_0', [1.0]) # ⇒ ValueError: Symbol name '_sz_mm_0' starts with Solverz-reserved # internal prefix '_sz_mm_'. This prefix is used by the code # generator for Mat_Mul / mutable-matrix Jacobian helper names and # cannot be used for user variables or parameters. ``` **Why:** the code printer emits an unbounded family of helper identifiers of the form `` and needs to know that none of them will collide with a user symbol. The prefix ban is the simplest way to guarantee that — users pick any name that does *not* start with `_sz_mm_` / `_sz_mb_`, and the code printer is free to use any name that *does*. ### Dense `dim=2` parameters emit a performance warning A parameter declared with `dim=2, sparse=False` that appears inside a `Mat_Mul` is legal — the fallback path handles it correctly — but it forfeits every optimisation described above: - no sparse pattern precomputation (the "output sparsity pattern" step produces a fully dense block), - no `_sz_mb_{N}_rs_dat_k` scatter-add (the block falls through to the `MutableMatJacDataModule` fancy-indexing path), - each Jacobian call allocates and populates a dense intermediate. Solverz emits a one-shot `UserWarning` at `FormJac` time, **once per offending parameter**, to flag the performance cost: ```python m.A = Param('A', np.array([[2.0, 1.0], [1.0, 3.0]]), dim=2, sparse=False) m.eqn = Eqn('f', m.x * Mat_Mul(m.A, m.x) - m.b) smdl, y0 = m.create_instance() smdl.FormJac(y0) # UserWarning: Parameter 'A' is a dense 2-D ``Param(..., dim=2, # sparse=False)`` used inside ``Mat_Mul``. Dense matrices bypass # the vectorised mutable-matrix Jacobian fast path and fall back # to a slower scipy/ndarray indexing path. Declare 'A' with # ``sparse=True`` for substantially better performance. ``` **How to fix:** wrap the value in a scipy.sparse container and declare `sparse=True`: ```python from scipy.sparse import csc_array m.A = Param('A', csc_array(np.array([[2.0, 1.0], [1.0, 3.0]])), dim=2, sparse=True) ``` On a dense 2×2 matrix the difference is invisible, but for anything larger than a few dozen rows the sparse path is substantially faster and (more importantly) keeps the Jacobian assembly off the fallback path. ## API Reference ```{eval-rst} .. autofunction:: Solverz.sym_algebra.matrix_calculus.MixedEquationDiff .. autoclass:: Solverz.sym_algebra.matrix_calculus.TensorExpr :members: diff, visualize ```