(matrix_calculus)= # Matrix-Vector Calculus Solverz provides automatic symbolic differentiation for mixed matrix-vector expressions. When an equation contains matrix-vector products (`Mat_Mul`), Solverz automatically computes analytical Jacobian blocks using tensor calculus with Einstein index notation. This module is based on the approach described in: > Laue, Mitterreiter, Giesen. *A Simple and Efficient Tensor Calculus.* > Proceedings of the AAAI Conference on Artificial Intelligence, 2020. See also: [matrixcalculus.org](https://www.matrixcalculus.org/) ## Supported Operations | Operation | Symbolic Form | Example | Derivative | |-----------|---------------|---------|------------| | Matrix-vector multiply | `Mat_Mul(A, x)` | $Ax$ | $A$ | | Element-wise multiply | `x * y` | $x \odot y$ | $\operatorname{diag}(y)$ | | Addition | `x + y` | $x + y$ | $I$ | | Absolute value | `Abs(x)` | $\|x\|$ | $\operatorname{diag}(\operatorname{sign}(x))$ | | Exponential | `exp(x)` | $e^x$ | $\operatorname{diag}(e^x)$ | | Sine / Cosine | `sin(x)` / `cos(x)` | $\sin(x)$ | $\operatorname{diag}(\cos(x))$ | | Logarithm | `ln(x)` | $\ln(x)$ | $\operatorname{diag}(x^{-1})$ | | Power | `x**n` | $x^n$ | $\operatorname{diag}(n x^{n-1})$ | | Transpose | `transpose(A)` | $A^T$ | $A^T$ | | Diagonal | `Diag(x)` | $\operatorname{diag}(x)$ | $\operatorname{diag}(\cdot)$ | ```{note} `Mat_Mul` is the recommended interface for matrix-vector products. The legacy `MatVecMul` function (which decomposes sparse matrices into CSC components for Numba) is deprecated. `Mat_Mul` uses `scipy.sparse` directly, which is both faster and supports full matrix calculus. ``` ```{warning} **Sparse matrices MUST be immutable after modelling** All sparse matrix parameters declared with `Param(..., dim=2, sparse=True)` are assumed to be **immutable** (both sparsity pattern *and* numerical values) once modelling is complete and `model.create_instance()` has been called. Solverz relies on this assumption for maximum runtime efficiency: - At code-generation time, the sparsity pattern of every mutable-matrix Jacobian block is analysed *once*, the output data-array layout is fixed, and a pre- computed index mapping is baked into a Numba-compiled scatter-add loop. - At runtime, each Newton step assembles Jacobian block data by directly indexing into `Matrix.data` using the pre-computed row/column mappings — **no scipy.sparse matrix is constructed per iteration**. - For sparse-matrix parameters used in `Mat_Mul(A, x)`, the matrix-vector product `A @ x` is precomputed in the `F_` wrapper and passed as a dense vector to the `@njit`-compiled equation functions. If your application truly needs to update sparse-matrix values between solves (e.g., sweeping a parameter), rebuild the model with `model.create_instance()` each time. Mutating `p_["A"].data` after creation will silently produce wrong Jacobians **for matrices that flow through the vectorised Mat_Mul fast path** — scatter-add kernels cache `M.data` at analysis time and the cache becomes stale. Matrices that only land in the `MutableMatJacDataModule` fallback path (see {ref}`Fallback path `) re-evaluate the full expression on every `J_()` call and would reflect the mutation, but there is no way to guarantee a particular block won't be *promoted* to the fast path in a future release, so the rule is still: **do not mutate sparse matrix params post-build**. Dense parameters (`Param(..., dim=2, sparse=False)`) are fine to update via `mdl.p["name"] = new_value`; the restriction only applies to **sparse** dim=2 parameters. ``` All element-wise functions (exp, sin, cos, ln, Abs) can be composed with matrix operations: ```python from Solverz import Var, Param, Eqn, Model from Solverz.sym_algebra.functions import Mat_Mul, exp, sin m = Model() m.x = Var('x', [1, 2, 3]) m.A = Param('A', [[1, 0, 0], [0, 2, 0], [0, 0, 3]], dim=2) m.f = Eqn('f', exp(Mat_Mul(m.A, m.x))) # exp(A@x) ``` Solverz will automatically compute $\frac{\partial}{\partial x} e^{Ax} = \operatorname{diag}(e^{Ax}) \cdot A$. ## How It Works The matrix calculus engine is a four-stage pipeline: each expression is lifted into Einstein index notation, rebuilt as a computation DAG, differentiated forward through the DAG, and finally collapsed back to ordinary matrix/vector operations. We walk through each stage with a running example from power flow. ### 1. Einstein Index Notation Every tensor is labelled with one index per axis: - **Scalar**: no index (e.g., $\alpha$). Zeroth-order tensor, no axes. - **Vector** $x \in \mathbb{R}^n$: a single index $x_i$, where $i$ ranges over $\{1, \dots, n\}$. The index *names* the axis; the letter used is arbitrary, only its pattern of repetition matters. - **Matrix** $A \in \mathbb{R}^{m \times n}$: two indices $A_{ij}$. The first index $i$ labels rows, the second $j$ labels columns. A **repeated index inside a product** implies *summation* (Einstein summation convention, or *contraction*): ```{math} A_{ij}\, x_j \;\equiv\; \sum_{j=1}^{n} A_{ij}\, x_j \;=\; (Ax)_i. ``` Here $j$ is repeated (contracted — it disappears in the result), while $i$ is a **free index** that labels the resulting vector's axis. Conventionally we write the non-contracted indices on the left of the expression, so the indices of the left-hand side $(Ax)_i$ match the remaining free indices $i$ on the right. The benefit of this notation is that the *shape* of an expression is fully encoded in its index signature. Element-wise multiplication uses repeated indices *without* contraction: ```{math} (x \odot y)_i \;=\; x_i\, y_i, ``` and the outer product uses two different free indices: ```{math} (x\, y^\top)_{ij} \;=\; x_i\, y_j. ``` ### 2. Computation DAG Solverz parses each equation's symbolic expression into a **directed acyclic graph (DAG)**: - **Leaf nodes** — variables ($x$, $y$) and parameters ($A$, $B$, $G$). Each leaf carries its index signature ($x_i$, $A_{ij}$, etc.). - **Internal nodes** — operations: addition $+$, element-wise product $\odot$, matrix product $@$, element-wise unary function $f(\cdot)$. - **Root node** — the output expression whose Jacobian we want. For the power flow active-power sub-expression $e \odot (Ge - Bf)$: ``` *[i] ← element-wise multiply (free index i) / \ e[i] -[i] ← subtraction (free index i) / \ @[i] @[i] ← matrix-vector products, contraction on j / \ / \ G[ij] e[j] B[ij] f[j] ``` Each node remembers its free index set. The matrix product nodes `@[i]` each *contract* index $j$ between $G_{ij}$ and $e_j$, so the result has free index $i$. The subtraction and element-wise product keep $i$ as their only free index, so the whole expression has index signature $[i]$ — a vector. ### 3. Forward-Mode Automatic Differentiation To obtain $\partial f / \partial x$ where $f$ is the root and $x$ is some leaf variable, derivatives propagate from **leaves upward to the root** (pushforward, a.k.a. forward-mode AD). Every internal node computes its own Jacobian contribution and combines it with the Jacobians already accumulated at its children. Throughout this section, $A_{s_1}$ and $B_{s_2}$ denote the free indices of the two operands (before differentiation), $s_3$ denotes the free index of the result, and $s_4$ is the **derivative-direction index** — the axis along which we differentiate, equal to the free index of the variable $x$ we differentiate with respect to. Tilded tensors $\tilde A, \tilde B$ carry one *extra* axis (labelled $s_4$) compared to $A, B$: the derivative of a rank-$k$ tensor w.r.t. a rank-1 variable is a rank-$(k{+}1)$ tensor. 1. **Leaf variables.** Differentiating $x_j$ w.r.t. $x_{s_4}$ gives the Kronecker delta: ```{math} \tilde{x}_{j\,s_4} \;=\; \frac{\partial x_j}{\partial x_{s_4}} \;=\; \delta_{j s_4}. ``` All other variables (and all parameters) have derivative $0$ w.r.t. $x$. 2. **Multiplication** (product rule in index notation): ```{math} \tilde{C}_{s_3\,s_4} \;=\; B_{s_2}\cdot\tilde{A}_{s_1\,s_4} \;+\; A_{s_1}\cdot\tilde{B}_{s_2\,s_4} ``` where $A_{s_1} \cdot B_{s_2} = C_{s_3}$ is the original product (the specific index pattern $s_1, s_2 \to s_3$ depends on the product type — element-wise, matrix-vector, or matrix-matrix). The two summands encode "differentiate the first factor, keep the second" plus "keep the first factor, differentiate the second". 3. **Element-wise unary function** $f$ (e.g. $\exp$, $\sin$, $\ln$): since $f$ acts point-wise, the chain rule gives ```{math} \frac{\partial f(A_{s_1})}{\partial x_{s_4}} \;=\; f'(A_{s_1})\odot\tilde{A}_{s_1\,s_4}. ``` The derivative $f'$ is also computed element-wise and multiplied (Hadamard) with the child's Jacobian. 4. **Addition**. Linearity: ```{math} \tilde{(A+B)}_{s_3\,s_4} \;=\; \tilde{A}_{s_1\,s_4} + \tilde{B}_{s_2\,s_4}. ``` ### 4. TMul2Mul: Index Resolution After the forward pass, derivatives live as tensor expressions labelled with free indices. The **TMul2Mul** stage reads the index pattern of each product and emits the corresponding concrete matrix / vector operation. Each row of the table has the form $(s_1, s_2, s_3)$: the free indices of the left operand, right operand, and the result. | Index pattern $(s_1, s_2, s_3)$ | Matrix operation | Meaning | |---|---|---| | $(i,\; i,\; i)$ | $x \odot y$ | element-wise multiply — both operands share axis $i$, no contraction | | $(i,\; j,\; ij)$ | $x\,y^\top$ | outer product — two disjoint free indices survive into the result | | $(ij,\; j,\; i)$ | $A\,y$ | matrix-vector: index $j$ is shared (contracted), $i$ survives | | $(ij,\; jk,\; ik)$ | $A\,B$ | matrix-matrix: inner $j$ is contracted, $(i,k)$ survive | | $(i,\; ij,\; ij)$ | $\operatorname{diag}(x)\,A$ | left diagonal scaling — scale each row of $A$ by the matching entry of $x$ | | $(ij,\; j,\; ij)$ | $A\,\operatorname{diag}(y)$ | right diagonal scaling — scale each column of $A$ by the matching entry of $y$ | | $(i,\; ii,\; ii)$ | $\operatorname{diag}(x \odot y)$ | diagonal-from-element-wise — used when a product has matching repeated indices $(ii)$ on one operand | The key insight is that every well-formed index pattern maps to one concrete sparse-preserving operation. Patterns like $(ij, ij, ij)$ that don't match the table are either simplified earlier (e.g. $(ij, ij, ij) \to A \odot B$, element-wise matrix product) or emitted via a fallback that produces a dense matrix. ## Application Examples ### Example 1: Power Flow Equations The active power balance equation in power systems: ```{math} P = e \odot (Ge - Bf) + f \odot (Be + Gf) ``` ```python from Solverz.sym_algebra.symbols import iVar, Para from Solverz.sym_algebra.functions import Mat_Mul from Solverz.sym_algebra.matrix_calculus import TensorExpr e = iVar('e') f = iVar('f') G = Para('G', dim=2) B = Para('B', dim=2) P = e * (Mat_Mul(G, e) - Mat_Mul(B, f)) + f * (Mat_Mul(B, e) + Mat_Mul(G, f)) TP = TensorExpr(P) print(TP.diff(e)) # diag(-B@f + G@e) + diag(e)@G + diag(f)@B print(TP.diff(f)) # diag(B@e + G@f) + diag(e)@(-B) + diag(f)@G ``` ### Example 2: Nonlinear Matrix Equations ```python from Solverz.sym_algebra.functions import exp A = Para('A', dim=2) x = iVar('x') expr = exp(Mat_Mul(A, x)) TE = TensorExpr(expr) print(TE.diff(x)) # diag(exp(A@x))@A ``` ### Example 3: Heat Network Equations ```python mL = Para('mL', dim=2) K = Para('K') m = iVar('m') from Solverz.sym_algebra.functions import Abs expr = Mat_Mul(mL, K * Abs(m) * m) TE = TensorExpr(expr) print(TE.diff(m)) # mL@(diag(K*Abs(m)) + diag(K*m*Sign(m))) ``` ## How numerical code is generated and optimized When `module_printer` compiles an equation system that contains `Mat_Mul`, it applies a **three-layer optimisation strategy** to make the generated `F_`/`J_` functions run as fast as possible. All three layers operate on the same insight: *at modelling time we already know what the equations and their Jacobians look like* — so every computation that does not depend on the current iterate $y$ can be lifted out of the hot loop and paid for once, at code generation or module load. ```{warning} Everything below assumes **sparse matrix parameters are immutable** after `model.create_instance()`. See the warning earlier in this document. Every layer described here bakes in the matrix's sparsity pattern or its `.data` values as a constant; mutating them in-place will silently corrupt results without raising any error. ``` ### Layer 1 — `Mat_Mul` precomputes as ``@njit`` CSC matvec The code printer walks every residual RHS looking for `Mat_Mul(A, v)` sub-trees, where `A` is a sparse parameter and `v` is any vector-valued expression (a variable, a slice, or a larger expression). It performs **three transformations at code gen time**: 1. **Hoist.** Each `Mat_Mul(A, v)` is replaced by a fresh placeholder variable `_sz_mm_N` (where `N` is a monotonically increasing integer unique across the whole system). The `_sz_mm_` prefix is reserved by Solverz — see {ref}`Restrictions and reserved names ` below. 2. **Deduplicate.** Two `Mat_Mul` sub-trees that are structurally equal (same matrix, same operand expression after hoisting) are collapsed into a single placeholder. The power-flow example below computes `G_nr @ e` once even though both the P-balance and Q-balance residuals reference it. 3. **Classify each placeholder.** The classifier (implemented in `_is_csc_matvec_fast_path` in `Solverz/code_printer/python/module/module_printer.py`) decides whether the precompute can go on the **fast path** (matrix is a plain sparse `dim=2` `Para`) or the **fallback path** (anything else: `-Para`, nested `Mat_Mul`, dense `dim=2` matrices, or a matrix expression). **Fast path** — the matvec is computed **inside `inner_F`** via the existing `SolCF.csc_matvec` Numba helper (`Solverz/num_api/custom_function.py`). Each sparse dim=2 `Param` is already decomposed at model-build time into its CSC flat arrays (`A_data` / `A_indices` / `A_indptr` / `A_shape0`) by `Solverz/model/basic.py`; these flat arrays flow through `param_list` to `inner_F` as ordinary numpy arrays, and `inner_F` issues a `SolCF.csc_matvec(A_data, A_indices, A_indptr, A_shape0, operand)` call per placeholder. The wrapper `F_()` does not load the sparse matrix object `A` at all — the whole matvec lives in Numba land and pays zero scipy dispatch cost: ```python def F_(y_, p_): e = y_[0:29] f = y_[29:58] G_nr_data = p_["G_nr_data"] G_nr_indices = p_["G_nr_indices"] G_nr_indptr = p_["G_nr_indptr"] G_nr_shape0 = p_["G_nr_shape0"] B_nr_data = p_["B_nr_data"] ... # dense vector params follow return inner_F(_F_, e, f, p_ref, q_ref, G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, ...) @njit(cache=True) def inner_F(_F_, e, f, p_ref, q_ref, G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, ...): # Fast-path matvecs: all inside @njit, no scipy dispatch. _sz_mm_0 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, e) _sz_mm_1 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, f) _sz_mm_2 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, e) _sz_mm_3 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, f) _F_[0:29] = inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3) _F_[29:53] = inner_F1(...) return _F_ @njit(cache=True) def inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3): return e * (p_ref - _sz_mm_1 + _sz_mm_0) + f * (q_ref + _sz_mm_2 + _sz_mm_3) - Pinj ``` Why this matters: scipy.sparse matrix-vector products are implemented in C and fast per se, but each call crosses Python → scipy dispatch → C → back, which on a 58×58 matrix costs ~1.5 µs of pure dispatch overhead — 10× the actual arithmetic. Eight such SpMVs in a hot F made `Mat_Mul` catastrophically slow on small networks (`case30` hot F was 10× worse than the for-loop form). Moving the matvec into `SolCF.csc_matvec` inside `inner_F` eliminates all Python dispatch: the helper is `@njit(cache=True)` and `inner_F` calls it as an intra-Numba function call, which the LLVM inline pass typically flattens into the caller's body. On `case30`, this drops hot F from ≈14 µs to ≈3 µs (~4.4× speedup) without any other code change. **Fallback path** — when the matrix operand is *not* a plain sparse `Para` (e.g. `-G_nr`, `Mat_Mul(A, B)`, or a dense `dim=2` parameter), the old path is kept: `F_()` emits `_sz_mm_N = matrix_expr @ operand` in the wrapper (using scipy.sparse or numpy `@`) and passes `_sz_mm_N` as an extra dense argument to `inner_F`. These placeholders do not benefit from the fast path and pay the scipy dispatch cost per call. Dense `dim=2` parameters fire a one-shot `UserWarning` at `FormJac` time — see {ref}`Restrictions and reserved names `. ```{note} The fast path reuses infrastructure that predates the `Mat_Mul` interface: the same `SolCF.csc_matvec` helper and the same `_data` / `_indices` / `_indptr` / `_shape0` Param decomposition were introduced for the legacy `MatVecMul` operator. Before 0.8.1 fast path, those fields were still emitted into `inner_F`'s argument list but never referenced — they are now the conduit for the Layer 1 matvec. Legacy `MatVecMul` users already benefited from this pipeline; from 0.8.1 `Mat_Mul` users do as well, without any model-level changes. ``` (when-to-use-mat_mul)= #### When to use (and not use) `Mat_Mul` `Mat_Mul` wins decisively for **every compile and build phase** of the workflow: `Model` construction, `FormJac`, `made_numerical` (inline compile), `module_printer.render`, and — most importantly — the first-time Numba module compile. On a typical matpower case30 power-flow model the compile phases are **20–40× faster** under `Mat_Mul` than under the equivalent element-wise for-loop formulation. The hot `J` call is also slightly faster (vectorised scatter-add in the mutable-matrix blocks). See the [power-flow chapter of the Solverz Cookbook](https://solverz-cookbook.readthedocs.io/en/latest/ae/pf/pf.html) for the full benchmark table. **Use `Mat_Mul` when**: - You are iterating on the model (changing parameters, equations, or dimensions) — compile-time savings are paid on *every* rebuild and dwarf the runtime differences. - The system has more than a handful of unknowns — for ≳ 50 unknowns the scipy/Numba dispatch overhead is amortised over enough arithmetic that hot F is within striking distance of (or beats) the for-loop form. - You want compact model code. `Mat_Mul(G, e)` replaces `nb` scalar `Eqn` definitions per bus and makes the equations read like the paper. - All sparse matrices used in `Mat_Mul` are plain `Param(..., dim=2, sparse=True)` — this unlocks the Layer 1 Numba fast path described above. **Consider the for-loop form when** *all* of the following hold: - The system is very small (≲ 30 unknowns); - Your hot loop is dominated by `F` evaluations (not Jacobian, not linear solve), so the raw hot-F number actually matters; - The module is compiled once and then reused many times, so the compile-time saving of `Mat_Mul` is not recovered. This combination of conditions is rare in practice — most power-flow and DHS workloads hit the Jacobian assembly and linear solve at least as often as F, and both of those are independent of (or favourable to) `Mat_Mul`. **Known performance regression**: on very small networks (≈ case30-scale) the `Mat_Mul` hot F is ~3 µs per call while the equivalent for-loop form runs at ~1.1 µs (≈ 2.9× slower). The difference is the structural cost of 8 `SolCF.csc_matvec` calls dispatched through `inner_F` plus the sub-function dispatch layer, versus 53 inlined scalar kernels the element-wise form collapses into. The 2 µs gap is usually invisible next to the J call (~55 µs) and the linear solve, but if your workload is millions of pure F evaluations on a tiny network, the for-loop form still wins. The gap shrinks and flips with problem size. For networks large enough that each SpMV does meaningful arithmetic — roughly case118 and above — the element-wise form's huge cold-compile cost and per-call dispatch overhead on every scalar `inner_F{i}` make `Mat_Mul` the strictly better choice on all metrics. **Matrices that fall out of the fast path** (and therefore pay the scipy SpMV dispatch cost on every `F_` call, ≈ 1.5 µs per SpMV): - **Negated matrices** — `Mat_Mul(-G, x)`. `G` is a plain `Para` but `-G` is a `Mul(-1, G)` expression the classifier doesn't recognise. Workaround: fold the sign into the coefficient of the surrounding expression — write `-Mat_Mul(G, x)` instead of `Mat_Mul(-G, x)`. - **Dense `dim=2` params** — `Param(A, dim=2, sparse=False)`. `FormJac` fires a one-shot `UserWarning` for these; they fall back to the `MutableMatJacDataModule` path. Convert to sparse with `csc_array(A)` and declare `sparse=True`. - **Nested `Mat_Mul`** — `Mat_Mul(A, Mat_Mul(B, x))`. The outer call receives the inner placeholder `_sz_mm_N` as its operand, which hits the fast path; the *inner* call still hits the fast path because its matrix is a plain `Para`. Only the inner is made fast, though, and only if both are plain Paras. Triple-nested is uncommon but would degrade similarly. - **Matrix expressions** — `Mat_Mul(A + B, x)` or similar. The classifier recognises only a bare `Para` as the matrix operand. Workaround: materialise the combined matrix as a single `Param` (`A + B`) outside the equation. ### Layer 2 — Mutable matrix Jacobian blocks The Jacobian is handled by a **separate, symbolic decomposition** because its blocks are matrices (not vectors) whose structure depends on the full expression. Consider the power-flow block ```{math} \frac{\partial P}{\partial e} \;=\; \operatorname{diag}\!\bigl(G\,e - B\,f + p_\text{ref}\bigr) \;+\; \operatorname{diag}(e)\,G \;+\; \operatorname{diag}(f)\,B. ``` All three terms are *mutable* — they depend on $e$ and $f$ and must be re-evaluated every Newton step. But their **sparsity pattern is constant**: the union of the diagonal, the pattern of $G$, and the pattern of $B$ — all fixed at modelling time. Solverz exploits this in a pattern-match compiler: 1. **Parse the block into typed terms.** The symbolic analyser walks the SymPy expression tree (implemented in `Solverz/code_printer/python/module/mutable_mat_analyzer.py`). Each additive term is classified as one of: | Shape | Generated code contribution | |---|---| | `Diag(u_expr)` (diagonal term) | `data[out_pos] += u[src_idx]` | | `Mat_Mul(Diag(v), M)` (row-scale) | `data[out_pos] += v[src_row] * M_data[k]` | | `Mat_Mul(M, Diag(v))` (col-scale) | `data[out_pos] += v[src_col] * M_data[k]` | **A term is deferred to the fallback path when any of the following is true**: - it has a non-unit numeric coefficient other than `±1` (e.g. `3 * Diag(x)`) — the analyser only unwraps `±1` via `_extract_sign_and_core`; - it is `Mat_Mul(Diag(v), X)` or `Mat_Mul(X, Diag(v))` where ``X`` is **not** a plain `Para` (or `-Para`) — for instance a matrix expression such as `A @ B`, or a user-constructed product that the matrix calculus engine hasn't simplified down to a single leaf matrix; - it is a ``Mat_Mul`` with three or more arguments that the matrix-calculus engine didn't fold into a nested binary form; - it is any other shape the classifier doesn't recognise (transpose of a mutable matrix, element-wise matrix product, a non-sparse matrix leaf, etc.). Unrecognised terms are deferred to the {ref}`fallback path `, which is correct but slower: it evaluates the full sub-expression via scipy.sparse on every `J_()` call and extracts values by fancy indexing. 2. **Precompute the output sparsity pattern (once).** At model initialisation (inside `FormJac`, `Solverz/equation/equations.py`), every mutable-matrix block is evaluated *one* time with all variables perturbed to distinct non-zero values and all `Diag` nodes substituted with `SpDiag` (which emits `sps.diags` instead of the dense `np.diagflat`). This perturbed evaluation is the entire reason we go to the trouble of the `SpDiag` substitution: using non-zero variable values *guarantees* that no term accidentally collapses to zero at $y_0$ (which would happen in a flat-start power flow where $f = 0$ and `sps.diags(f)@B` is empty). The resulting sparse matrix gives us the **union of all terms' patterns** — the structural upper bound that the runtime block must fit into. 3. **Precompute the index mapping arrays (once).** For each recognised term, the analyser builds three numpy arrays: - `out_pos[k]` — the index into the block's output `data` array where the $k$-th nonzero contribution of this term lands. - `src_idx[k]` — for diagonal terms, the row $i$ in the inner vector $u$ to read from; for row-scale terms, the row of $M$'s $k$-th nonzero; for col-scale terms, the column. - `mat_data` (row/col-scale terms only) — a direct reference to `M.data` (which is why the matrix must be immutable). These arrays are stored inside `setting`, a per-module dict that also holds things like the CSC `row`, `col`, and initial `data` for the full Jacobian. 4. **Generate a dedicated Numba kernel per block.** For each mutable matrix block the code printer emits a new top-level function named `_mut_block_N` (where `N` is the block index). Its signature takes the diag inner vectors, the base variables needed by any row/col-scale term, and each term's three mapping arrays. All per-block helper arrays use the reserved `_sz_mb_{N}_` prefix so different blocks' mappings never collide. The body is a pure- Python / Numba scatter-add loop: ```python # Module-level (loaded once from setting at import time): _sz_mb_0_diag_out_0 = setting["_sz_mb_0_diag_out_0"] _sz_mb_0_diag_src_0 = setting["_sz_mb_0_diag_src_0"] _sz_mb_0_rs_out_0 = setting["_sz_mb_0_rs_out_0"] _sz_mb_0_rs_src_0 = setting["_sz_mb_0_rs_src_0"] _sz_mb_0_rs_dat_0 = setting["_sz_mb_0_rs_dat_0"] # frozen G_nr.data ... @njit(cache=True) def _mut_block_0(_sz_mb_0_u0, e, f, _sz_mb_0_diag_out_0, _sz_mb_0_diag_src_0, _sz_mb_0_rs_out_0, _sz_mb_0_rs_src_0, _sz_mb_0_rs_dat_0, _sz_mb_0_rs_out_1, _sz_mb_0_rs_src_1, _sz_mb_0_rs_dat_1): data = zeros(107) # Diagonal term: diag(G_nr@e - B_nr@f + p_ref) for i in range(_sz_mb_0_diag_out_0.shape[0]): data[_sz_mb_0_diag_out_0[i]] += _sz_mb_0_u0[_sz_mb_0_diag_src_0[i]] # Row-scale term: diag(e) @ G_nr for i in range(_sz_mb_0_rs_out_0.shape[0]): data[_sz_mb_0_rs_out_0[i]] += e[_sz_mb_0_rs_src_0[i]] * _sz_mb_0_rs_dat_0[i] # Row-scale term: diag(f) @ B_nr for i in range(_sz_mb_0_rs_out_1.shape[0]): data[_sz_mb_0_rs_out_1[i]] += f[_sz_mb_0_rs_src_1[i]] * _sz_mb_0_rs_dat_1[i] return data ``` ```{note} Every scatter-add loop uses `+=`, not `=`, including the diagonal terms. This matters when a block contains **multiple independent `Diag(...)` additive terms** whose output positions coincide — for example the Jacobian of `x*(A@x) + x*(B@x)` produces `diag(A@x)` and `diag(B@x)`, both landing on the same `(i,i)` positions. The `+=` accumulates their contributions correctly; an earlier implementation used `=` and silently dropped one of the two terms. ``` 5. **Wire the kernel up in the `J_` wrapper.** For each block the wrapper emits: (a) one line per diag term that materialises its inner vector using scipy.sparse (e.g. `_sz_mb_0_u0 = p_ref - (B_nr@f) + (G_nr@e)`); (b) a call to `_mut_block_N(...)` passing the inner vectors, base variables, and mapping arrays; (c) an assignment of the returned `data` array into the appropriate slice of the overall Jacobian `_data_`. Crucially, each block's **scipy work is capped at one SpMV per diag term** — nothing else in the block touches sparse matrices at runtime. The data for row/col-scale terms comes straight from the pre-baked `_sz_mb_{N}_rs_dat_{k}` arrays. The net effect on a Newton step is that the Jacobian assembly cost is dominated by (i) the constant `inner_J` call for element-wise blocks and (ii) a handful of SpMVs for the diag inner vectors. The actual mutable-matrix blocks themselves are assembled in $\mathcal{O}(\text{nnz})$ vectorised Numba loops — no sparse matrix allocation, no sparsity reconstruction, no format conversion. #### Why not fancy indexing? An earlier iteration of the pipeline used `expr.tocsr()[[row],[col]]` fancy indexing — build the sparse matrix with `sps.diags` and read back the nonzero values at the precomputed COO positions. It is correct and elegant but slow: scipy has to construct several intermediate sparse matrices per step, sum them (with explicit-zero elimination), and then perform a linear-scan lookup for each output position. On a 30-bus power flow case, each Jacobian call took ≈ 280 µs. The scatter-add Numba path drops that to ≈ 50 µs while producing bit-identical results. The fallback path still uses fancy indexing, so if you write an equation whose Jacobian term structure the analyser doesn't recognise, you get *correct* results — just slower. Profile with `pytest --durations=10` and look for mutable-matrix blocks lingering in the slow path if performance matters. ### Layer 3 — Selective `@njit` All generated inner loops — `inner_F`, every per-equation `inner_F{N}`, `inner_J`, every per-block `inner_J{N}`, and every `_mut_block_{N}` — carry `@njit(cache=True)`. Only the `F_` / `J_` wrappers remain pure Python; they host the scipy.sparse SpMV precomputes, the module-level parameter unpacks, and the final `sps.coo_array((data, (row, col)), shape).tocsc()` that packages the Jacobian back into a scipy object for the downstream Newton solver. The selectivity matters: Numba absolutely cannot digest a `scipy.sparse.csc_array` argument, which is why the wrappers exist at all. Anything that *can* run inside Numba is pushed there, at zero syntactic cost to the user — they just write `Mat_Mul(A, x)` and `module_printer` figures out the rest. An important corollary for non-`Mat_Mul` models: if an equation system contains **no** `Mat_Mul` at all, the entire precompute / mutable- matrix machinery is disabled and `F_` / `J_` collapse back to the original pre-0.8 architecture (direct `@njit` on `inner_F` receiving only dense vectors and scalar params). Performance on pure element-wise models is therefore unchanged by the Mat_Mul rewrite. (fallback-path)= ### Fallback path If a Jacobian block contains a term whose shape the symbolic analyser does not recognise — for example an unusual nested `Mat_Mul`, a non-parameter matrix, or a matrix-valued non-linear expression — that single block is handled by a slower but always-correct path: ```python data[start:stop] = asarray( (sps.diags(u) + sps.diags(v)@M + ...) .tocsr()[[rows], [cols]] ).ravel() ``` This builds the block with scipy.sparse and reads its values at the precomputed COO positions. It is $\mathcal{O}(\text{nnz} \log n)$ per call instead of the $\mathcal{O}(\text{nnz})$ scatter-add path, but it never misses a derivative term. Only the single affected block falls back; the rest of the Jacobian still uses the fast path. If you want to know whether your model is hitting the fallback, profile with `pytest --durations=20` or `cProfile` and look for `MutableMatJacDataModule` in the traces — its presence in the hot `J_()` line indicates a fallback block. ### Performance Measurements below are per `J_()` call on a 2025 MacBook Air (Apple M4), averaged over 5000–10000 iterations. **Case A — rectangular power flow on MATPOWER `case30`** (58 unknowns, 450 Jacobian nonzeros, 6 mutable-matrix blocks, the flow has meaningful $G\,e$, $B\,f$, $B\,e$, $G\,f$ combinations): | Pipeline | `J_()` per call | |---|---| | Inline mode (lambdify + scipy SpMV) | ≈ 268 µs | | Module printer, old `tocsr()[rows, cols]` | ≈ 283 µs | | **Module printer, vectorized @njit** | **≈ 50 µs** | **Case B — DHS hydraulic subproblem on `BarryIsland`** (35 unknowns, 1 loop, 1 mutable-matrix block — the loop pressure Jacobian contains two col-scale terms $L\,\operatorname{diag}(K \odot |m|)$ and $L\,\operatorname{diag}(K \odot m \odot \operatorname{sign}(m))$ whose scaling vectors are non-trivial expressions of the variable): | Pipeline | `J_()` per call | |---|---| | Element-wise inline (one scalar `Eqn` per node + loop) | ≈ 103 µs | | Mat_Mul inline (lambdify) | ≈ 45 µs | | **Mat_Mul module printer, vectorized @njit** | **≈ 28 µs** | - **Element-wise vs Mat_Mul inline.** Writing the same hydraulic system as `V @ m - m_inj = 0` plus `L @ (K m |m|) = 0` (two vector equations total) instead of 35 scalar equations cuts inline time by more than half, because lambdify has far fewer expression nodes to walk at runtime. - **Inline vs module printer.** Even on a small system (only 79 nonzeros), the module printer wins once every mutable-matrix term reaches the vectorised scatter-add path. Here the loop-pressure Jacobian has two col-scale terms whose scaling vectors (``K * |m|`` and ``K * m * sign(m)``) are computed once per `J_` call as dense numpy expressions in the wrapper, then handed to a Numba kernel that does two 10-iteration scatter-add loops. The only remaining overhead is the final `coo_array(...).tocsc()` packing. Rule of thumb: **module printer wins whenever the mutable-matrix blocks successfully match the fast path**; it can lose to inline for models that fall through to the fallback (scipy-sparse + fancy indexing), or models small enough that the final COO-to-CSC packing dominates. Profile with `%timeit mdl.J(y, mdl.p)` on a representative iterate if in doubt, and check the generated `num_func.py` for any lingering `MutableMatJacDataModule` (= fallback) calls. The scatter-add path is strictly $\mathcal{O}(\text{nnz})$ in the block contents, while the old fancy-indexing path was dominated by the cost of constructing the intermediate sparse matrices — so the speed-up scales with nnz and with the number of mutable-matrix blocks. For models with many rows and few blocks the advantage is small; for models with many blocks per variable (such as power flow, with distinct $\partial/\partial e$ and $\partial/\partial f$ blocks for both $P$ and $Q$ balances) the advantage compounds. (restrictions)= ## Restrictions and reserved names `Mat_Mul` imposes a few hard restrictions that are enforced at model build time. Each one prevents a class of silent-wrong-answer bugs that would otherwise slip past the generated Jacobian. ### Time-varying sparse `dim=2` parameters are rejected at construction Solverz does not support sparse `dim=2` `Param`s whose values change at runtime. A declaration like: ```python from scipy.sparse import csc_array m.K = Param('K', csc_array([[1, 0], [0, 1]]), dim=2, sparse=True, triggerable=True, trigger_var=['x'], trigger_fun=lambda x: csc_array([[2*x[0], 0], [0, 2*x[1]]])) # ⇒ NotImplementedError: Parameter 'K': a sparse 2-D ``Param`` cannot # be declared ``triggerable=True``. Time-varying sparse matrices are # unsupported because Solverz's code generation caches the matrix's # CSC fields at model-build time; a runtime trigger update would # silently be ignored. ``` and the equivalent `TimeSeriesParam`: ```python m.G = TimeSeriesParam('G', v_series=[...], time_series=[...], value=np.eye(3), dim=2, sparse=True) # ⇒ NotImplementedError: Parameter 'G': a sparse 2-D # ``TimeSeriesParam`` is not supported. ``` both raise **at the point of construction**, not deep inside `FormJac`, so the error points at the exact line where the user wrote the offending declaration. **Why:** every Solverz code path that consumes a sparse `dim=2` `Param` caches its CSC decomposition at model-build time: - The legacy `MatVecMul` code path decomposes every sparse `dim=2` `Param` into flat arrays (`_data`, `_indices`, `_indptr`, `_shape0`) in `Solverz/model/basic.py:56`, which are stored in `p_` and passed to `inner_F` as read-only numpy arrays. - The 0.8.1 `Mat_Mul` `SolCF.csc_matvec` fast path reuses those same CSC flats inside `inner_F` to do the matvec in `@njit` land. - The mutable-matrix Jacobian scatter-add kernel caches the matrix's `.data` array directly as a `_sz_mb_{N}_rs_dat_{k}` setting entry loaded at module import time. None of these caches are invalidated when `p_["K"].trigger_fun` fires or `p_["G"].get_v_t(t)` returns a fresh matrix. The runtime update simply gets lost, the Jacobian keeps using the initial values, and the Newton iteration either diverges or converges to the wrong solution — silently. Historically the check was narrow (only equations containing `Mat_Mul` were rejected, and the check ran at `FormJac` time). In the 0.8.1 hotfix the policy broadened to cover every sparse `dim=2` time-varying `Param`, regardless of where or whether it is used. The check now lives in `ParamBase.__init__` and `TimeSeriesParam.__init__`, with an unconditional backstop in `Equations._check_no_timevar_sparse_matrices` that runs on every `FormJac` call in case a tainted `Param` was constructed via `__new__` + attribute assignment (which bypasses the `__init__` check). **Workarounds** — all of these are supported: - **Triggerable / time-series *vectors* next to a `Mat_Mul`.** Scalar or 1-D time-varying parameters are fine; only the `dim=2` + `sparse=True` combination is rejected. Write `m.K * m.x + Mat_Mul(m.A, m.x) - m.b` with `K` a triggerable 1-D vector and `A` a plain sparse 2-D matrix — the wrapper rebuilds `K` on every Newton step via the `trigger_var` mechanism and the `Mat_Mul(A, x)` fast path uses the unchanged `A`. - **Element-wise rewrite.** If you genuinely need the matrix itself to evolve, write the residual out one scalar `Eqn` per row, with the per-row coefficients as 1-D `Param`s or `TimeSeriesParam`s. The element-wise form has no CSC caching — every call re-evaluates the full trigonometric / algebraic expression via pure numpy / numba. - **Dense `dim=2` parameters (`sparse=False`).** The `MutableMatJacDataModule` fallback path used for dense matrices re-evaluates the entire block expression via scipy/numpy on every `J_()` call, so `get_v_t(t)` updates DO propagate. Dense parameters do not get the `csc_matvec` fast path, but they're a correct (if slower) way to carry a time-varying matrix through the solver. A `UserWarning` still fires once per such parameter to flag the performance cost. ### Reserved symbol prefixes User symbols (`Var`, `Param`, `iVar`, `Para`, ...) are rejected at construction time if their name starts with any of: | Prefix | Used for | |---|---| | `_sz_mm_` | Mat_Mul precompute placeholders in the `F_`/`J_` wrapper (e.g. `_sz_mm_0 = G @ e`). | | `_sz_mb_` | Mutable-matrix Jacobian block helpers (diag inner vectors, row/col-scale scaling vectors, mapping arrays). | ```python m.x = Var('_sz_mm_0', [1.0]) # ⇒ ValueError: Symbol name '_sz_mm_0' starts with Solverz-reserved # internal prefix '_sz_mm_'. This prefix is used by the code # generator for Mat_Mul / mutable-matrix Jacobian helper names and # cannot be used for user variables or parameters. ``` **Why:** the code printer emits an unbounded family of helper identifiers of the form `` and needs to know that none of them will collide with a user symbol. The prefix ban is the simplest way to guarantee that — users pick any name that does *not* start with `_sz_mm_` / `_sz_mb_`, and the code printer is free to use any name that *does*. ### Dense `dim=2` parameters emit a performance warning A parameter declared with `dim=2, sparse=False` that appears inside a `Mat_Mul` is legal — the fallback path handles it correctly — but it forfeits every optimisation described above: - no sparse pattern precomputation (the "output sparsity pattern" step produces a fully dense block), - no `_sz_mb_{N}_rs_dat_k` scatter-add (the block falls through to the `MutableMatJacDataModule` fancy-indexing path), - each Jacobian call allocates and populates a dense intermediate. Solverz emits a one-shot `UserWarning` at `FormJac` time, **once per offending parameter**, to flag the performance cost: ```python m.A = Param('A', np.array([[2.0, 1.0], [1.0, 3.0]]), dim=2, sparse=False) m.eqn = Eqn('f', m.x * Mat_Mul(m.A, m.x) - m.b) smdl, y0 = m.create_instance() smdl.FormJac(y0) # UserWarning: Parameter 'A' is a dense 2-D ``Param(..., dim=2, # sparse=False)`` used inside ``Mat_Mul``. Dense matrices bypass # the vectorised mutable-matrix Jacobian fast path and fall back # to a slower scipy/ndarray indexing path. Declare 'A' with # ``sparse=True`` for substantially better performance. ``` **How to fix:** wrap the value in a scipy.sparse container and declare `sparse=True`: ```python from scipy.sparse import csc_array m.A = Param('A', csc_array(np.array([[2.0, 1.0], [1.0, 3.0]])), dim=2, sparse=True) ``` On a dense 2×2 matrix the difference is invisible, but for anything larger than a few dozen rows the sparse path is substantially faster and (more importantly) keeps the Jacobian assembly off the fallback path. ## API Reference ```{eval-rst} .. autofunction:: Solverz.sym_algebra.matrix_calculus.MixedEquationDiff .. autoclass:: Solverz.sym_algebra.matrix_calculus.TensorExpr :members: diff, visualize ```