(matrix_calculus)=

# Matrix-Vector Calculus

Solverz provides automatic symbolic differentiation for mixed matrix-vector expressions.
When an equation contains matrix-vector products (`Mat_Mul`), Solverz automatically computes
analytical Jacobian blocks using tensor calculus with Einstein index notation.

This module is based on the approach described in:

> Laue, Mitterreiter, Giesen. *A Simple and Efficient Tensor Calculus.*
> Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

See also: [matrixcalculus.org](https://www.matrixcalculus.org/)

(matrix-calculus-glossary)=
## Glossary

This page uses several abbreviations and design terms that recur
throughout the document. They are collected here for reference; each
one is hyperlinked from its first use in the body so you don't have
to scroll back. The terms describe Solverz's code-generation strategy
and the underlying numerical building blocks.

| Term | Definition |
|---|---|
| **SpMV** | *Sparse Matrix-Vector multiplication* — computing $y = M\,x$ where $M$ is stored in a sparse format. |
| **SpMM** | *Sparse Matrix-Matrix multiplication* — computing $C = M\,B$ where $B$ has more than one column. |
| **CSC / CSR** | *Compressed Sparse Column* / *Compressed Sparse Row* — the two scipy.sparse storage formats Solverz uses. CSC stores `data`, `indices`, `indptr` walking columns; CSR walks rows. The legacy `MatVecMul` and the 0.8.1 `Mat_Mul` fast path both consume CSC. |
| **`@njit`** | The Numba decorator that compiles a Python function to native LLVM-backed code at first call. Solverz uses `@njit(cache=True)` so the compiled code is cached on disk and reused across processes. |
| **scatter-add** | An assembly pattern where many indexed reads are accumulated (`+=`) into a fixed output buffer. The mutable-matrix Jacobian kernels are pure scatter-add loops over precomputed index arrays — see [Layer 2](#layer-2-mutable-matrix-jacobian-blocks). |
| **fancy indexing** | NumPy / scipy.sparse indexing with arrays of indices, e.g. `M[[r0, r1, r2], [c0, c1, c2]]`, returning the values at those positions. The slow {term}`fallback path` uses this to extract block values from a freshly-built sparse expression. |
| **fast path** | The Solverz code-gen path used for `Mat_Mul(A, x)` placeholders where `A` is a plain sparse `dim=2` `Para`: the matvec runs **inside** `inner_F` via the Numba `SolCF.csc_matvec` helper, with zero scipy.sparse calls per evaluation. Mutable-matrix Jacobian blocks have a parallel fast path of typed scatter-add kernels. |
| **fallback path** | The slower-but-correct path used whenever a `Mat_Mul` operand or a Jacobian block doesn't fit the fast-path shape — for instance negated matrices, nested `Mat_Mul`, dense `dim=2`, or general matrix expressions. The wrapper falls back to `scipy.sparse @` and (for Jacobian blocks) fancy indexing into a freshly-built sparse matrix. The fast and fallback paths **co-exist**: the runtime picks per placeholder / per Jacobian block based on what the symbolic classifier recognises. |
| **`lambdify`** | SymPy's symbolic-to-callable converter. Solverz's "inline" mode uses `lambdify` to turn each equation's symbolic expression into a Python function on every call (no JIT). |
| **hot F / hot J** | Steady-state per-call wall-clock time for `F_(y, p)` / `J_(y, p)` after JIT warm-up. Distinguished from {term}`cold compile`. |
| **cold compile** | The first call into a freshly-rendered module — pays the Numba `@njit` compile cost for every decorated helper. Cached to disk so subsequent process starts skip it. |
| **LICM** | *Loop-Invariant Code Motion*, an LLVM optimisation pass that hoists computations out of inner loops when their inputs do not change inside the loop. |
| **inline mode** | Solverz's `made_numerical(spf, y0)` code path — uses `lambdify` to build callable F/J from the SymPy expressions on every call. No Numba compilation. |
| **module printer mode** | Solverz's `module_printer(spf, y0).render()` code path — emits a Python module file with `@njit`-decorated helpers, cached to disk. The fast / fallback distinction only applies to this mode. |

## Supported Operations

For each supported operation the table below shows three things: the
Solverz symbolic constructor, the mathematical form it represents, and
the **Jacobian block** $\partial f / \partial x$ that the matrix
calculus engine emits when the equation is differentiated with respect
to a vector variable `x`.

The "Jacobian block" column is **not** the symbolic intermediate
Solverz computes for each equation. For an element-wise function like
`sin(x)`, Solverz first produces the elementwise vector `cos(x)` —
the same shape as the input — and only at Jacobian assembly time wraps
that vector in `diag(...)` to express the diagonal structure of the
matrix block. Equation residuals stay vector-valued throughout
symbolic processing; the matrix wrapper only appears in the final
Jacobian. Read the third column as "the matrix you would get if you
materialised $\partial f / \partial x$ as a dense block".

| Operation | Symbolic Form | Example | Jacobian block ($\partial f / \partial x$ for vector $x$) |
|-----------|---------------|---------|---|
| Matrix-vector multiply | `Mat_Mul(A, x)` | $Ax$ | $A$ |
| Element-wise multiply | `x * y` | $x \odot y$ | $\operatorname{diag}(y)$ |
| Addition | `x + y` | $x + y$ | $I$ |
| Absolute value | `Abs(x)` | $\|x\|$ | $\operatorname{diag}(\operatorname{sign}(x))$ |
| Exponential | `exp(x)` | $e^x$ | $\operatorname{diag}(e^x)$ |
| Sine / Cosine | `sin(x)` / `cos(x)` | $\sin(x)$ | $\operatorname{diag}(\cos(x))$ |
| Logarithm | `ln(x)` | $\ln(x)$ | $\operatorname{diag}(x^{-1})$ |
| Power | `x**n` | $x^n$ | $\operatorname{diag}(n x^{n-1})$ |
| Transpose | `transpose(A)` | $A^T$ | $A^T$ |
| Diagonal | `Diag(x)` | $\operatorname{diag}(x)$ | $\operatorname{diag}(\cdot)$ |

```{note}
`Mat_Mul` is the recommended interface for matrix-vector products in
new Solverz models. It is *symbolic-first*: the symbolic engine sees
each `Mat_Mul(A, x)` as a real operator, walks it through the
matrix-calculus pipeline, and emits an analytical Jacobian block
automatically.

On the {term}`fast path` — when `A` is a plain sparse `dim=2` `Param`
— `Mat_Mul` ends up calling exactly the same Numba helper
(`SolCF.csc_matvec`) the legacy `MatVecMul` interface uses. The two
have the same per-call cost; the difference is that `Mat_Mul` survives
the matrix-calculus engine and `MatVecMul` does not. On the
{term}`fallback path` (negated matrices, nested `Mat_Mul`, dense
`dim=2` parameters, or other matrix expressions) `Mat_Mul` falls
through to `scipy.sparse @` in the wrapper — see the
[Layer 1](#layer-1-mat_mul-precomputes-as-njit-csc-matvec)
discussion below.

`MatVecMul` is still deprecated for new models; pick `Mat_Mul`.
```

```{warning} **Sparse matrices MUST be immutable after modelling**

All sparse matrix parameters declared with `Param(..., dim=2, sparse=True)` are
assumed to be **immutable** (both sparsity pattern *and* numerical values) once
modelling is complete and `model.create_instance()` has been called.

Solverz relies on this assumption for maximum runtime efficiency:

- At code-generation time, the sparsity pattern of every mutable-matrix Jacobian
  block is analysed *once*, the output data-array layout is fixed, and a pre-
  computed index mapping is baked into a Numba-compiled scatter-add loop.
- At runtime, every `J_(y, p)` call assembles Jacobian block data by
  directly indexing into `Matrix.data` using the pre-computed
  row/column mappings — **no scipy.sparse matrix is constructed per
  call**.
- For sparse-matrix parameters used in `Mat_Mul(A, x)`, the matrix-vector
  product `A @ x` is precomputed in the `F_` wrapper and passed as a dense
  vector to the `@njit`-compiled equation functions.

If your application truly needs to update sparse-matrix values between solves
(e.g., sweeping a parameter), rebuild the model with `model.create_instance()`
each time. Mutating `p_["A"].data` after creation will silently produce wrong
Jacobians **for matrices that flow through the vectorised Mat_Mul fast path** —
scatter-add kernels cache `M.data` at analysis time and the cache becomes
stale. Matrices that only land in the `MutableMatJacDataModule` fallback path
(see {ref}`Fallback path <fallback-path>`) re-evaluate the full expression on
every `J_()` call and would reflect the mutation, but there is no way to
guarantee a particular block won't be *promoted* to the fast path in a future
release, so the rule is still: **do not mutate sparse matrix params
post-build**.

Dense parameters (`Param(..., dim=2, sparse=False)`) are fine to update via
`mdl.p["name"] = new_value`; the restriction only applies to **sparse**
dim=2 parameters.
```

All element-wise functions (exp, sin, cos, ln, Abs) can be composed with matrix operations:

```python
from Solverz import Var, Param, Eqn, Model
from Solverz.sym_algebra.functions import Mat_Mul, exp, sin

m = Model()
m.x = Var('x', [1, 2, 3])
m.A = Param('A', [[1, 0, 0], [0, 2, 0], [0, 0, 3]], dim=2)
m.f = Eqn('f', exp(Mat_Mul(m.A, m.x)))  # exp(A@x)
```

Solverz will automatically compute $\frac{\partial}{\partial x} e^{Ax} = \operatorname{diag}(e^{Ax}) \cdot A$.

## How It Works

The matrix calculus engine is a four-stage pipeline: each expression is
lifted into Einstein index notation, rebuilt as a computation DAG,
differentiated forward through the DAG, and finally collapsed back to
ordinary matrix/vector operations. We walk through each stage with a
running example from power flow.

### 1. Einstein Index Notation

Every tensor is labelled with one index per axis:

- **Scalar**: no index (e.g., $\alpha$). Zeroth-order tensor, no axes.
- **Vector** $x \in \mathbb{R}^n$: a single index $x_i$, where $i$ ranges
  over $\{1, \dots, n\}$. The index *names* the axis; the letter used is
  arbitrary, only its pattern of repetition matters.
- **Matrix** $A \in \mathbb{R}^{m \times n}$: two indices $A_{ij}$. The
  first index $i$ labels rows, the second $j$ labels columns.

A **repeated index inside a product** implies *summation*
(Einstein summation convention, or *contraction*):

```{math}
A_{ij}\, x_j \;\equiv\; \sum_{j=1}^{n} A_{ij}\, x_j \;=\; (Ax)_i.
```

Here $j$ is repeated (contracted — it disappears in the result), while
$i$ is a **free index** that labels the resulting vector's axis.
Conventionally we write the non-contracted indices on the left of the
expression, so the indices of the left-hand side $(Ax)_i$ match the
remaining free indices $i$ on the right.

The benefit of this notation is that the *shape* of an expression is
fully encoded in its index signature. Element-wise multiplication uses
repeated indices *without* contraction:

```{math}
(x \odot y)_i \;=\; x_i\, y_i,
```

and the outer product uses two different free indices:

```{math}
(x\, y^\top)_{ij} \;=\; x_i\, y_j.
```

### 2. Computation DAG

Solverz parses each equation's symbolic expression into a **directed
acyclic graph (DAG)**:

- **Leaf nodes** — variables ($x$, $y$) and parameters ($A$, $B$, $G$).
  Each leaf carries its index signature ($x_i$, $A_{ij}$, etc.).
- **Internal nodes** — operations: addition $+$, element-wise product
  $\odot$, matrix product $@$, element-wise unary function $f(\cdot)$.
- **Root node** — the output expression whose Jacobian we want.

For the power flow active-power sub-expression $e \odot (Ge - Bf)$:

```
        *[i]          ← element-wise multiply (free index i)
       /    \
     e[i]   -[i]      ← subtraction (free index i)
           /    \
       @[i]    @[i]   ← matrix-vector products, contraction on j
       / \     / \
     G[ij] e[j] B[ij] f[j]
```

Each node remembers its free index set. The matrix product nodes
`@[i]` each *contract* index $j$ between $G_{ij}$ and $e_j$, so the
result has free index $i$. The subtraction and element-wise product
keep $i$ as their only free index, so the whole expression has index
signature $[i]$ — a vector.

### 3. Forward-Mode Automatic Differentiation

To obtain $\partial f / \partial x$ where $f$ is the root and $x$ is
some leaf variable, derivatives propagate from **leaves upward to the
root** (pushforward, a.k.a. forward-mode AD). Every internal node
computes its own Jacobian contribution and combines it with the
Jacobians already accumulated at its children.

Throughout this section, $A_{s_1}$ and $B_{s_2}$ denote the free
indices of the two operands (before differentiation), $s_3$ denotes the
free index of the result, and $s_4$ is the **derivative-direction
index** — the axis along which we differentiate, equal to the free
index of the variable $x$ we differentiate with respect to. Tilded
tensors $\tilde A, \tilde B$ carry one *extra* axis
(labelled $s_4$) compared to $A, B$: the derivative of a rank-$k$
tensor w.r.t. a rank-1 variable is a rank-$(k{+}1)$ tensor.

1. **Leaf variables.** Differentiating $x_j$ w.r.t. $x_{s_4}$ gives the
   Kronecker delta:

   ```{math}
   \tilde{x}_{j\,s_4} \;=\; \frac{\partial x_j}{\partial x_{s_4}} \;=\; \delta_{j s_4}.
   ```

   All other variables (and all parameters) have derivative $0$ w.r.t.
   $x$.

2. **Multiplication** (product rule in index notation):

```{math}
\tilde{C}_{s_3\,s_4} \;=\; B_{s_2}\cdot\tilde{A}_{s_1\,s_4}
                      \;+\; A_{s_1}\cdot\tilde{B}_{s_2\,s_4}
```

   where $A_{s_1} \cdot B_{s_2} = C_{s_3}$ is the original product (the
   specific index pattern $s_1, s_2 \to s_3$ depends on the product
   type — element-wise, matrix-vector, or matrix-matrix). The two
   summands encode "differentiate the first factor, keep the second"
   plus "keep the first factor, differentiate the second".

3. **Element-wise unary function** $f$ (e.g. $\exp$, $\sin$, $\ln$):
   since $f$ acts point-wise, the chain rule gives

```{math}
\frac{\partial f(A_{s_1})}{\partial x_{s_4}} \;=\; f'(A_{s_1})\odot\tilde{A}_{s_1\,s_4}.
```

   The derivative $f'$ is also computed element-wise and multiplied
   (Hadamard) with the child's Jacobian.

4. **Addition**. Linearity:

```{math}
\tilde{(A+B)}_{s_3\,s_4} \;=\; \tilde{A}_{s_1\,s_4} + \tilde{B}_{s_2\,s_4}.
```

### 4. TMul2Mul: Index Resolution

After the forward pass, derivatives live as tensor expressions labelled
with free indices. The **TMul2Mul** stage reads the index pattern of
each product and emits the corresponding concrete matrix / vector
operation. Each row of the table has the form $(s_1, s_2, s_3)$: the
free indices of the left operand, right operand, and the result.

| Index pattern $(s_1, s_2, s_3)$ | Matrix operation | Meaning |
|---|---|---|
| $(i,\; i,\; i)$                 | $x \odot y$                            | element-wise multiply — both operands share axis $i$, no contraction |
| $(i,\; j,\; ij)$                | $x\,y^\top$                            | outer product — two disjoint free indices survive into the result |
| $(ij,\; j,\; i)$                | $A\,y$                                 | matrix-vector: index $j$ is shared (contracted), $i$ survives |
| $(ij,\; jk,\; ik)$              | $A\,B$                                 | matrix-matrix: inner $j$ is contracted, $(i,k)$ survive |
| $(i,\; ij,\; ij)$               | $\operatorname{diag}(x)\,A$            | left diagonal scaling — scale each row of $A$ by the matching entry of $x$ |
| $(ij,\; j,\; ij)$               | $A\,\operatorname{diag}(y)$            | right diagonal scaling — scale each column of $A$ by the matching entry of $y$ |
| $(i,\; ii,\; ii)$               | $\operatorname{diag}(x \odot y)$       | diagonal-from-element-wise — used when a product has matching repeated indices $(ii)$ on one operand |

The key insight is that every well-formed index pattern maps to one
concrete sparse-preserving operation. Patterns like $(ij, ij, ij)$ that
don't match the table are either simplified earlier (e.g. $(ij, ij, ij)
\to A \odot B$, element-wise matrix product) or emitted via a fallback
that produces a dense matrix.

## Application Examples

### Example 1: Power Flow Equations

The active power balance equation in power systems:

```{math}
P = e \odot (Ge - Bf) + f \odot (Be + Gf)
```

```python
from Solverz.sym_algebra.symbols import iVar, Para
from Solverz.sym_algebra.functions import Mat_Mul
from Solverz.sym_algebra.matrix_calculus import TensorExpr

e = iVar('e')
f = iVar('f')
G = Para('G', dim=2)
B = Para('B', dim=2)

P = e * (Mat_Mul(G, e) - Mat_Mul(B, f)) + f * (Mat_Mul(B, e) + Mat_Mul(G, f))
TP = TensorExpr(P)

print(TP.diff(e))  # diag(-B@f + G@e) + diag(e)@G + diag(f)@B
print(TP.diff(f))  # diag(B@e + G@f) + diag(e)@(-B) + diag(f)@G
```

### Example 2: Nonlinear Matrix Equations

```python
from Solverz.sym_algebra.functions import exp

A = Para('A', dim=2)
x = iVar('x')

expr = exp(Mat_Mul(A, x))
TE = TensorExpr(expr)
print(TE.diff(x))  # diag(exp(A@x))@A
```

### Example 3: Heat Network Equations

```python
mL = Para('mL', dim=2)
K = Para('K')
m = iVar('m')
from Solverz.sym_algebra.functions import Abs

expr = Mat_Mul(mL, K * Abs(m) * m)
TE = TensorExpr(expr)
print(TE.diff(m))  # mL@(diag(K*Abs(m)) + diag(K*m*Sign(m)))
```

## How numerical code is generated and optimized

When `module_printer` compiles an equation system that contains
`Mat_Mul`, it applies a **three-layer optimisation strategy** to make
the generated `F_`/`J_` functions run as fast as possible. All three
layers operate on the same insight: *at modelling time we already know
what the equations and their Jacobians look like* — so every
computation that does not depend on the current iterate $y$ can be
lifted out of the hot loop and paid for once, at code generation or
module load.

```{warning}
Everything below assumes **sparse matrix parameters are immutable**
after `model.create_instance()`. See the warning earlier in this
document. Every layer described here bakes in the matrix's sparsity
pattern or its `.data` values as a constant; mutating them in-place
will silently corrupt results without raising any error.
```

### Layer 1 — `Mat_Mul` precomputes as ``@njit`` CSC matvec

The code printer walks every residual RHS looking for `Mat_Mul(A, v)`
sub-trees, where `A` is a sparse parameter and `v` is any vector-valued
expression (a variable, a slice, or a larger expression). It performs
**three transformations at code gen time**:

1. **Hoist.** Each `Mat_Mul(A, v)` is replaced by a fresh placeholder
   variable `_sz_mm_N` (where `N` is a monotonically increasing integer
   unique across the whole system). The `_sz_mm_` prefix is reserved
   by Solverz — see {ref}`Restrictions and reserved names
   <restrictions>` below.
2. **Deduplicate.** Two `Mat_Mul` sub-trees that are structurally equal
   (same matrix, same operand expression after hoisting) are collapsed
   into a single placeholder. The power-flow example below computes
   `G_nr @ e` once even though both the P-balance and Q-balance
   residuals reference it.
3. **Classify each placeholder.** The classifier
   (`_classify_matmul_placeholders`, with the inner shape predicate
   `_shape_is_fast`, in
   `Solverz/code_printer/python/module/module_printer.py`) decides
   whether the precompute can go on the {term}`fast path` (matrix is
   a plain sparse `dim=2` `Para` **and** the operand is a clean vector
   expression) or the {term}`fallback path` (anything else:
   `-Para`, nested `Mat_Mul`, dense `dim=2` matrices, matrix
   expressions, or operands that still contain unresolved
   `Mat_Mul` / sparse `dim=2` `Para` references).

   A fast-path candidate is also **demoted** to the fallback path if
   another fallback placeholder's `matrix_arg` or `operand_arg`
   references it as a free symbol — otherwise the wrapper would emit
   a scipy {term}`SpMV` like `_sz_mm_1 = (-A) @ _sz_mm_0` whose
   `_sz_mm_0` operand was never materialised. The classifier iterates
   this demotion to a fixed point so chains of nested `Mat_Mul`s
   land on a topologically valid emission order.

**Fast path** — the matvec is computed **inside `inner_F`** via the
existing `SolCF.csc_matvec` Numba helper
(`Solverz/num_api/custom_function.py`). Each sparse dim=2 `Param` is
already decomposed at model-build time into its CSC flat arrays
(`A_data` / `A_indices` / `A_indptr` / `A_shape0`) by
`Solverz/model/basic.py`; these flat arrays flow through
`param_list` to `inner_F` as ordinary numpy arrays, and `inner_F`
issues a `SolCF.csc_matvec(A_data, A_indices, A_indptr, A_shape0,
operand)` call per placeholder. The wrapper `F_()` does not load the
sparse matrix object `A` at all — the whole matvec lives in Numba
land and pays zero scipy dispatch cost:

```python
def F_(y_, p_):
    e = y_[0:29]
    f = y_[29:58]
    G_nr_data = p_["G_nr_data"]
    G_nr_indices = p_["G_nr_indices"]
    G_nr_indptr = p_["G_nr_indptr"]
    G_nr_shape0 = p_["G_nr_shape0"]
    B_nr_data = p_["B_nr_data"]
    ...   # dense vector params follow
    return inner_F(_F_, e, f, p_ref, q_ref,
                   G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0,
                   B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0,
                   ...)

@njit(cache=True)
def inner_F(_F_, e, f, p_ref, q_ref,
            G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0,
            B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, ...):
    # Fast-path matvecs: all inside @njit, no scipy dispatch.
    _sz_mm_0 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, e)
    _sz_mm_1 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, f)
    _sz_mm_2 = SolCF.csc_matvec(B_nr_data, B_nr_indices, B_nr_indptr, B_nr_shape0, e)
    _sz_mm_3 = SolCF.csc_matvec(G_nr_data, G_nr_indices, G_nr_indptr, G_nr_shape0, f)
    _F_[0:29] = inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3)
    _F_[29:53] = inner_F1(...)
    return _F_

@njit(cache=True)
def inner_F0(e, f, p_ref, q_ref, _sz_mm_0, _sz_mm_1, _sz_mm_2, _sz_mm_3):
    return e * (p_ref - _sz_mm_1 + _sz_mm_0) + f * (q_ref + _sz_mm_2 + _sz_mm_3) - Pinj
```

Why this matters: scipy.sparse matrix-vector products are implemented
in C and fast per se, but each call crosses Python → scipy dispatch
→ C → back, which on a 58×58 matrix costs ~1.5 µs of pure dispatch
overhead — 10× the actual arithmetic. Eight such SpMVs in a hot F
made `Mat_Mul` catastrophically slow on small networks
(`case30` hot F was 10× worse than the for-loop form). Moving the
matvec into `SolCF.csc_matvec` inside `inner_F` eliminates all Python
dispatch: the helper is `@njit(cache=True)` and `inner_F` calls it as
an intra-Numba function call, which the LLVM inline pass typically
flattens into the caller's body. On `case30`, this drops hot F from
≈14 µs to ≈3 µs (~4.4× speedup) without any other code change.

**Fallback path** — when the matrix operand is *not* a plain sparse
`Para` (e.g. `-G_nr`, `Mat_Mul(A, B)`, or a dense `dim=2` parameter),
the old path is kept: `F_()` emits `_sz_mm_N = matrix_expr @ operand`
in the wrapper (using scipy.sparse or numpy `@`) and passes
`_sz_mm_N` as an extra dense argument to `inner_F`. These
placeholders do not benefit from the fast path and pay the scipy
dispatch cost per call. Dense `dim=2` parameters fire a one-shot
`UserWarning` at `FormJac` time — see
{ref}`Restrictions and reserved names <restrictions>`.

```{note}
The fast path reuses infrastructure that predates the `Mat_Mul`
interface: the same `SolCF.csc_matvec` helper and the same
`<name>_data` / `_indices` / `_indptr` / `_shape0` Param decomposition
were introduced for the legacy `MatVecMul` operator. Before 0.8.1
fast path, those fields were still emitted into `inner_F`'s argument
list but never referenced — they are now the conduit for the Layer 1
matvec. Legacy `MatVecMul` users already benefited from this pipeline;
from 0.8.1 `Mat_Mul` users do as well, without any model-level
changes.
```

(when-to-use-mat_mul)=
#### When to use `Mat_Mul`

API-level rules of thumb. The detailed case study with full
benchmark numbers, decision matrix, and the case30-scale hot-F
regression analysis lives in the
[power-flow chapter of the Solverz Cookbook](https://cookbook.solverz.org/latest/ae/pf/pf.html#performance-comparison-mat_mul-vs-for-loop)
— this section just states the API guidance.

**Use `Mat_Mul`** when any of the following hold:

- You want compact, paper-faithful equations. `Mat_Mul(G, e)`
  replaces `nb` scalar `Eqn` definitions and lets the matrix
  calculus engine derive the Jacobian automatically.
- You iterate on the model. Every compile / build phase
  (`Model` construction, `FormJac`, `made_numerical`,
  `module_printer.render`, and the first-time Numba {term}`cold compile`)
  is dramatically faster under `Mat_Mul` than under the equivalent
  element-wise for-loop form.
- All matrix operands are plain `Param(..., dim=2, sparse=True)` —
  this unlocks the Layer 1 {term}`fast path` described above.

The narrow case where the for-loop form still wins (very small
network, F-dominated hot loop, single compile) and the underlying
case30 numbers are documented in the Cookbook chapter linked above.

**Matrix shapes that fall out of the fast path** and therefore pay
the scipy {term}`SpMV` dispatch cost on every `F_` call (~1.5 µs
per SpMV — consequential at small network scale, negligible
elsewhere):

- **Negated matrices** — `Mat_Mul(-G, x)`. `G` is a plain `Para` but
  `-G` is a `Mul(-1, G)` expression the classifier doesn't recognise.
  Workaround: write `-Mat_Mul(G, x)` instead of `Mat_Mul(-G, x)` so
  the sign flows through the outer expression and `G` stays on the
  fast path.
- **Dense `dim=2` params** — `Param(A, dim=2, sparse=False)`. `FormJac`
  fires a one-shot `UserWarning` for these; they fall through to the
  `MutableMatJacDataModule` path. Convert to sparse with
  `csc_array(A)` and declare `sparse=True`.
- **Nested `Mat_Mul`** — `Mat_Mul(A, Mat_Mul(B, x))`. The
  classifier's dependency-aware demotion takes care of the
  topological ordering, but only the inner `Mat_Mul` ends up on
  the fast path when both matrices are plain Paras; the outer
  pays the wrapper SpMV cost.
- **Three-or-more-arg `Mat_Mul`** — `Mat_Mul(A, B, x)`. The
  operand of the placeholder is itself a `Mat_Mul`, so the
  classifier rejects it from the fast path.
- **Matrix expressions** — `Mat_Mul(A + B, x)` or similar. The
  classifier recognises only a bare `Para` as the matrix operand.
  Workaround: materialise the combined matrix as a single
  `Param(A + B, dim=2, sparse=True)` outside the equation.

### Layer 2 — Mutable matrix Jacobian blocks

The Jacobian is handled by a **separate, symbolic decomposition**
because its blocks are matrices (not vectors) whose structure depends
on the full expression. Consider the power-flow block

```{math}
\frac{\partial P}{\partial e} \;=\;
  \operatorname{diag}\!\bigl(G\,e - B\,f + p_\text{ref}\bigr)
  \;+\; \operatorname{diag}(e)\,G
  \;+\; \operatorname{diag}(f)\,B.
```

All three terms are *mutable* — they depend on $e$ and $f$ and must be
re-evaluated on every `J_(y, p)` call. But their **sparsity pattern is
constant**: the union of the diagonal, the pattern of $G$, and the
pattern of $B$ — all fixed at modelling time.

Solverz exploits this in a pattern-match compiler:

1. **Parse the block into typed terms.** The symbolic analyser walks
   the SymPy expression tree (implemented in
   `Solverz/code_printer/python/module/mutable_mat_analyzer.py`). Each
   additive term is classified as one of:

   | Shape | Generated code contribution |
   |---|---|
   | `Diag(u_expr)` (diagonal term) | `data[out_pos] += u[src_idx]` |
   | `Mat_Mul(Diag(v), M)` (row-scale) | `data[out_pos] += v[src_row] * M_data[k]` |
   | `Mat_Mul(M, Diag(v))` (col-scale) | `data[out_pos] += v[src_col] * M_data[k]` |

   **A term is deferred to the fallback path when any of the
   following is true**:

   - it has a non-unit numeric coefficient other than `±1` (e.g.
     `3 * Diag(x)`) — the analyser only unwraps `±1` via
     `_extract_sign_and_core`;
   - it is `Mat_Mul(Diag(v), X)` or `Mat_Mul(X, Diag(v))` where
     ``X`` is **not** a plain `Para` (or `-Para`) — for instance
     a matrix expression such as `A @ B`, or a user-constructed
     product that the matrix calculus engine hasn't simplified
     down to a single leaf matrix;
   - it is a ``Mat_Mul`` with three or more arguments that the
     matrix-calculus engine didn't fold into a nested binary form;
   - it is any other shape the classifier doesn't recognise
     (transpose of a mutable matrix, element-wise matrix product,
     a non-sparse matrix leaf, etc.).

   Unrecognised terms are deferred to the
   {ref}`fallback path <fallback-path>`, which is correct but slower:
   it evaluates the full sub-expression via scipy.sparse on every
   `J_()` call and extracts values by fancy indexing.

2. **Precompute the output sparsity pattern (once).** At model
   initialisation (inside `FormJac`, `Solverz/equation/equations.py`),
   every mutable-matrix block is evaluated *one* time with all
   variables perturbed to distinct non-zero values and all `Diag`
   nodes substituted with `SpDiag` (which emits `sps.diags` instead of
   the dense `np.diagflat`). This perturbed evaluation is the entire
   reason we go to the trouble of the `SpDiag` substitution: using
   non-zero variable values *guarantees* that no term accidentally
   collapses to zero at $y_0$ (which would happen in a flat-start power
   flow where $f = 0$ and `sps.diags(f)@B` is empty). The resulting
   sparse matrix gives us the **union of all terms' patterns** — the
   structural upper bound that the runtime block must fit into.

3. **Precompute the index mapping arrays (once).** For each recognised
   term, the analyser builds three numpy arrays:

   - `out_pos[k]` — the index into the block's output `data` array
     where the $k$-th nonzero contribution of this term lands.
   - `src_idx[k]` — for diagonal terms, the row $i$ in the inner
     vector $u$ to read from; for row-scale terms, the row of $M$'s
     $k$-th nonzero; for col-scale terms, the column.
   - `mat_data` (row/col-scale terms only) — a direct reference to
     `M.data` (which is why the matrix must be immutable).

   These arrays are stored inside `setting`, a per-module dict that
   also holds things like the CSC `row`, `col`, and initial `data`
   for the full Jacobian.

4. **Generate a dedicated Numba kernel per block.** For each mutable
   matrix block the code printer emits a new top-level function named
   `_mut_block_N` (where `N` is the block index). Its signature takes
   the diag inner vectors, the base variables needed by any
   row/col-scale term, and each term's three mapping arrays. All
   per-block helper arrays use the reserved `_sz_mb_{N}_` prefix so
   different blocks' mappings never collide. The body is a pure-
   Python / Numba scatter-add loop:

```python
# Module-level (loaded once from setting at import time):
_sz_mb_0_diag_out_0 = setting["_sz_mb_0_diag_out_0"]
_sz_mb_0_diag_src_0 = setting["_sz_mb_0_diag_src_0"]
_sz_mb_0_rs_out_0   = setting["_sz_mb_0_rs_out_0"]
_sz_mb_0_rs_src_0   = setting["_sz_mb_0_rs_src_0"]
_sz_mb_0_rs_dat_0   = setting["_sz_mb_0_rs_dat_0"]   # frozen G_nr.data
...

@njit(cache=True)
def _mut_block_0(_sz_mb_0_u0, e, f,
                 _sz_mb_0_diag_out_0, _sz_mb_0_diag_src_0,
                 _sz_mb_0_rs_out_0,  _sz_mb_0_rs_src_0,  _sz_mb_0_rs_dat_0,
                 _sz_mb_0_rs_out_1,  _sz_mb_0_rs_src_1,  _sz_mb_0_rs_dat_1):
    data = zeros(107)
    # Diagonal term: diag(G_nr@e - B_nr@f + p_ref)
    for i in range(_sz_mb_0_diag_out_0.shape[0]):
        data[_sz_mb_0_diag_out_0[i]] += _sz_mb_0_u0[_sz_mb_0_diag_src_0[i]]
    # Row-scale term: diag(e) @ G_nr
    for i in range(_sz_mb_0_rs_out_0.shape[0]):
        data[_sz_mb_0_rs_out_0[i]] += e[_sz_mb_0_rs_src_0[i]] * _sz_mb_0_rs_dat_0[i]
    # Row-scale term: diag(f) @ B_nr
    for i in range(_sz_mb_0_rs_out_1.shape[0]):
        data[_sz_mb_0_rs_out_1[i]] += f[_sz_mb_0_rs_src_1[i]] * _sz_mb_0_rs_dat_1[i]
    return data
```

```{note}
Every scatter-add loop uses `+=`, not `=`, including the diagonal
terms. This matters when a block contains **multiple independent
`Diag(...)` additive terms** whose output positions coincide — for
example the Jacobian of `x*(A@x) + x*(B@x)` produces `diag(A@x)` and
`diag(B@x)`, both landing on the same `(i,i)` positions. The `+=`
accumulates their contributions correctly; an earlier implementation
used `=` and silently dropped one of the two terms.
```

5. **Wire the kernel up in the `J_` wrapper.** For each block the
   wrapper emits: (a) one line per diag term that materialises its
   inner vector using scipy.sparse (e.g. `_sz_mb_0_u0 = p_ref -
   (B_nr@f) + (G_nr@e)`); (b) a call to `_mut_block_N(...)` passing
   the inner vectors, base variables, and mapping arrays; (c) an
   assignment of the returned `data` array into the appropriate slice
   of the overall Jacobian `_data_`.

   Crucially, each block's **scipy work is capped at one SpMV per diag
   term** — nothing else in the block touches sparse matrices at
   runtime. The data for row/col-scale terms comes straight from the
   pre-baked `_sz_mb_{N}_rs_dat_{k}` arrays.

The net effect on each `J_(y, p)` call is that Jacobian assembly cost
is dominated by (i) the constant `inner_J` call for element-wise
blocks and (ii) a handful of {term}`SpMV`s for the diag inner
vectors. The actual mutable-matrix blocks themselves are assembled in
$\mathcal{O}(\text{nnz})$ vectorised Numba loops — no sparse matrix
allocation, no sparsity reconstruction, no format conversion.

This benefit applies to **every solver that calls `J_(y, p)`** —
algebraic-equation Newton solvers (`nr_method`, `sicnm`), DAE
integrators (which call `J_` at every implicit substep), FDAE
solvers, and any future solver that consumes Solverz's Jacobian
contract. The cost model is "per `J_` call", not "per Newton step".

#### Two paths: scatter-add (fast) and fancy indexing (fallback)

Solverz keeps **both** assembly paths in every generated module and
picks per Jacobian block at code-gen time. The {term}`fast path` is
the typed scatter-add described above; the {term}`fallback path`
exists for blocks whose term structure the symbolic analyser cannot
classify (see [Layer 2 step 1](#parse-the-block-into-typed-terms))
and is described in the next subsection. There is no on/off switch —
both paths co-exist, and the runtime always uses whichever the
classifier picked.

The fallback path is what an earlier iteration of the pipeline used
*everywhere*: build the sparse matrix with `sps.diags` and friends,
then read back the nonzero values at the precomputed COO positions
via {term}`fancy indexing` (`expr.tocsr()[[rows], [cols]]`). It is
correct and elegant but slow — scipy has to construct several
intermediate sparse matrices per call, sum them (with explicit-zero
elimination), and then perform a linear-scan lookup for each output
position. On a 30-bus power flow case, each `J_(y, p)` call took
≈ 280 µs in that pre-0.8 architecture. The scatter-add Numba path
introduced in 0.8.0 drops that to ≈ 50 µs while producing
bit-identical results.

If you want to know whether a particular block in your model is on
the fast path or the fallback, profile with `pytest --durations=10`
or `cProfile` and look for `MutableMatJacDataModule` in the traces
— its presence in the hot `J_()` line indicates that block is on
the fallback.

### Layer 3 — Selective `@njit`

All generated inner loops — `inner_F`, every per-equation
`inner_F{N}`, `inner_J`, every per-block `inner_J{N}`, and every
`_mut_block_{N}` — carry `@njit(cache=True)`. Only the `F_` / `J_`
wrappers remain pure Python; they host any fallback-path
{term}`SpMV` precomputes, the module-level parameter unpacks, and
the final `sps.coo_array((data, (row, col)), shape).tocsc()` that
packages the Jacobian back into a scipy object for the downstream
solver (algebraic Newton-style or DAE/ODE integrator).

The selectivity matters: Numba absolutely cannot digest a
`scipy.sparse.csc_array` argument, which is why the wrappers exist at
all. Anything that *can* run inside Numba is pushed there, at zero
syntactic cost to the user — they just write `Mat_Mul(A, x)` and
`module_printer` figures out the rest.

An important corollary for non-`Mat_Mul` models: if an equation system
contains **no** `Mat_Mul` at all, the entire precompute / mutable-
matrix machinery is disabled and `F_` / `J_` collapse back to the
original pre-0.8 architecture (direct `@njit` on `inner_F` receiving
only dense vectors and scalar params). Performance on pure
element-wise models is therefore unchanged by the Mat_Mul rewrite.

(fallback-path)=
### Fallback path — scipy.sparse + fancy indexing

When a Jacobian block contains a term shape the symbolic analyser
does not recognise — for example an unusual nested `Mat_Mul`, a
non-parameter matrix, a matrix-valued non-linear expression, or a
term with a non-`±1` coefficient — that single block is handled by
the {term}`fallback path`. The wrapper rebuilds the block with
`scipy.sparse` and reads the requested nonzero values back via
{term}`fancy indexing`:

```python
data[start:stop] = asarray(
    (sps.diags(u) + sps.diags(v)@M + ...)
    .tocsr()[[rows], [cols]]
).ravel()
```

It is $\mathcal{O}(\text{nnz} \log n)$ per call instead of the
$\mathcal{O}(\text{nnz})$ scatter-add path, but it never misses a
derivative term. **The two paths co-exist** in every generated
module: only the single affected block falls back; the rest of the
Jacobian still uses the fast scatter-add path. There is no
fallback-only mode and no opt-in switch — Solverz simply picks per
block based on what the classifier recognised at code-gen time.

### Performance

The full case30 power-flow benchmark — every phase from `Model`
construction through cold compile to {term}`hot F` / hot J — lives
in the
[Solverz Cookbook power-flow chapter](https://cookbook.solverz.org/latest/ae/pf/pf.html#performance-comparison-mat_mul-vs-for-loop).
This subsection only carries one extra micro-benchmark (DHS
hydraulic on Barry Island) that exercises a Jacobian shape the
power-flow case does not.

**Benchmark environment.** All numbers below were measured under:

- **Hardware:** 2025 MacBook Air, Apple M4, AC power, no thermal
  throttling observed during the runs.
- **OS:** macOS 26.4 (build 25E246).
- **Python:** 3.11.13.
- **Library versions:** `numpy==2.3.3`, `scipy==1.16.0`,
  `numba==0.65.0`, `sympy==1.13.3`, `Solverz==0.8.1` (the
  post-`csc_matvec` {term}`fast path`).
- **Methodology:** 10 warm-up calls (to bake the Numba caches and
  prime the CPU branch predictor) followed by 5000–20000 timed
  iterations, median of three repeats. The {term}`cold compile`
  measurement (Cookbook only) is run in a fresh Python subprocess
  with `__pycache__` and Numba `.nbi` / `.nbc` caches wiped
  beforehand. The same per-call timing helper is used in the
  Cookbook's `bench_pf_matmul_vs_polar.py`; the per-`J_` numbers
  below were captured by importing the rendered DHS module and
  calling `mdl.J(y, mdl.p)` in a tight loop.

**DHS hydraulic subproblem on `BarryIsland`** (35 unknowns, 1 loop,
1 mutable-matrix block — the loop pressure Jacobian contains two
col-scale terms $L\,\operatorname{diag}(K \odot |m|)$ and
$L\,\operatorname{diag}(K \odot m \odot \operatorname{sign}(m))$
whose scaling vectors are non-trivial expressions of the variable):

| Pipeline | `J_()` per call |
|---|---|
| Element-wise inline (one scalar `Eqn` per node + loop) | ≈ 103 µs |
| `Mat_Mul` inline (lambdify) | ≈ 45 µs |
| **`Mat_Mul` module printer, vectorized @njit** | **≈ 28 µs** |

- **Element-wise vs `Mat_Mul` inline.** Writing the same hydraulic
  system as `V @ m - m_inj = 0` plus `L @ (K m |m|) = 0` (two vector
  equations total) instead of 35 scalar equations cuts inline time by
  more than half, because {term}`lambdify` has far fewer expression
  nodes to walk on every call.
- **Inline vs module printer.** Even on a small system (only 79
  nonzeros), the module printer wins once every mutable-matrix term
  reaches the vectorised {term}`scatter-add` path. Here the loop-
  pressure Jacobian has two col-scale terms whose scaling vectors
  (`K * |m|` and `K * m * sign(m)`) are computed once per `J_(y, p)`
  call as dense numpy expressions in the wrapper, then handed to a
  Numba kernel that does two 10-iteration scatter-add loops. The
  only remaining overhead is the final `coo_array(...).tocsc()`
  packing.

Rule of thumb: **the module printer wins whenever the mutable-matrix
blocks successfully match the fast path**; it can lose to
{term}`inline mode` for models that fall through to the
{term}`fallback path`, or models small enough that the final
COO-to-CSC packing dominates. Profile with
`%timeit mdl.J(y, mdl.p)` on a representative iterate if in doubt,
and check the generated `num_func.py` for any lingering
`MutableMatJacDataModule` (= fallback) calls.

The scatter-add path is strictly $\mathcal{O}(\text{nnz})$ in the
block contents, while the old fancy-indexing path was dominated by
the cost of constructing the intermediate sparse matrices — so the
speed-up scales with nnz and with the number of mutable-matrix
blocks. For models with many rows and few blocks the advantage is
small; for models with many blocks per variable (such as power flow,
with distinct $\partial/\partial e$ and $\partial/\partial f$ blocks
for both $P$ and $Q$ balances) the advantage compounds.

(restrictions)=
## Restrictions and reserved names

`Mat_Mul` imposes a few hard restrictions that are enforced at model
build time. Each one prevents a class of silent-wrong-answer bugs that
would otherwise slip past the generated Jacobian.

### Time-varying sparse `dim=2` parameters are rejected at construction

Solverz does not support sparse `dim=2` `Param`s whose values change
at runtime. A declaration like:

```python
from scipy.sparse import csc_array
m.K = Param('K',
            csc_array([[1, 0], [0, 1]]),
            dim=2, sparse=True,
            triggerable=True,
            trigger_var=['x'],
            trigger_fun=lambda x: csc_array([[2*x[0], 0], [0, 2*x[1]]]))
# ⇒ NotImplementedError: Parameter 'K': a sparse 2-D ``Param`` cannot
#   be declared ``triggerable=True``. Time-varying sparse matrices are
#   unsupported because Solverz's code generation caches the matrix's
#   CSC fields at model-build time; a runtime trigger update would
#   silently be ignored.
```

and the equivalent `TimeSeriesParam`:

```python
m.G = TimeSeriesParam('G',
                      v_series=[...],
                      time_series=[...],
                      value=np.eye(3),
                      dim=2, sparse=True)
# ⇒ NotImplementedError: Parameter 'G': a sparse 2-D
#   ``TimeSeriesParam`` is not supported.
```

both raise **at the point of construction**, not deep inside
`FormJac`, so the error points at the exact line where the user
wrote the offending declaration.

**Why:** every Solverz code path that consumes a sparse `dim=2`
`Param` caches its CSC decomposition at model-build time:

- The legacy `MatVecMul` code path decomposes every sparse `dim=2`
  `Param` into flat arrays (`<name>_data`, `<name>_indices`,
  `<name>_indptr`, `<name>_shape0`) in `Solverz/model/basic.py:56`,
  which are stored in `p_` and passed to `inner_F` as read-only
  numpy arrays.
- The 0.8.1 `Mat_Mul` `SolCF.csc_matvec` fast path reuses those
  same CSC flats inside `inner_F` to do the matvec in `@njit` land.
- The mutable-matrix Jacobian scatter-add kernel caches the
  matrix's `.data` array directly as a `_sz_mb_{N}_rs_dat_{k}`
  setting entry loaded at module import time.

None of these caches are invalidated when `p_["K"].trigger_fun`
fires or `p_["G"].get_v_t(t)` returns a fresh matrix. The runtime
update simply gets lost, the Jacobian keeps using the initial
values, and the downstream solver (Newton-Raphson, DAE integrator,
or any other consumer of `J_`) either diverges or converges to the
wrong solution — silently.

Historically the check was narrow (only equations containing
`Mat_Mul` were rejected, and the check ran at `FormJac` time). In
the 0.8.1 hotfix the policy broadened to cover every sparse
`dim=2` time-varying `Param`, regardless of where or whether it is
used. The check now lives in `ParamBase.__init__` and
`TimeSeriesParam.__init__`, with an unconditional backstop in
`Equations._check_no_timevar_sparse_matrices` that runs on every
`FormJac` call in case a tainted `Param` was constructed via
`__new__` + attribute assignment (which bypasses the `__init__`
check).

**Workarounds** — all of these are supported:

- **Triggerable / time-series *vectors* next to a `Mat_Mul`.**
  Scalar or 1-D time-varying parameters are fine; only the `dim=2`
  + `sparse=True` combination is rejected. Write
  `m.K * m.x + Mat_Mul(m.A, m.x) - m.b` with `K` a triggerable 1-D
  vector and `A` a plain sparse 2-D matrix — the wrapper rebuilds
  `K` on every solver step via the `trigger_var` mechanism and the
  `Mat_Mul(A, x)` fast path uses the unchanged `A`.
- **Element-wise rewrite.** If you genuinely need the matrix
  itself to evolve, write the residual out one scalar `Eqn` per
  row, with the per-row coefficients as 1-D `Param`s or
  `TimeSeriesParam`s. The element-wise form has no CSC caching —
  every call re-evaluates the full trigonometric / algebraic
  expression via pure numpy / numba.
- **Dense `dim=2` parameters (`sparse=False`).** The
  `MutableMatJacDataModule` fallback path used for dense matrices
  re-evaluates the entire block expression via scipy/numpy on
  every `J_()` call, so `get_v_t(t)` updates DO propagate. Dense
  parameters do not get the `csc_matvec` fast path, but they're a
  correct (if slower) way to carry a time-varying matrix through
  the solver. A `UserWarning` still fires once per such parameter
  to flag the performance cost.

### Reserved symbol prefixes

User symbols (`Var`, `Param`, `iVar`, `Para`, ...) are rejected at
construction time if their name starts with any of:

| Prefix | Used for |
|---|---|
| `_sz_mm_` | Mat_Mul precompute placeholders in the `F_`/`J_` wrapper (e.g. `_sz_mm_0 = G @ e`). |
| `_sz_mb_` | Mutable-matrix Jacobian block helpers (diag inner vectors, row/col-scale scaling vectors, mapping arrays). |

```python
m.x = Var('_sz_mm_0', [1.0])
# ⇒ ValueError: Symbol name '_sz_mm_0' starts with Solverz-reserved
#   internal prefix '_sz_mm_'. This prefix is used by the code
#   generator for Mat_Mul / mutable-matrix Jacobian helper names and
#   cannot be used for user variables or parameters.
```

**Why:** the code printer emits an unbounded family of helper
identifiers of the form `<prefix><int>` and needs to know that none of
them will collide with a user symbol. The prefix ban is the simplest
way to guarantee that — users pick any name that does *not* start with
`_sz_mm_` / `_sz_mb_`, and the code printer is free to use any name
that *does*.

### Dense `dim=2` parameters emit a performance warning

A parameter declared with `dim=2, sparse=False` that appears inside a
`Mat_Mul` is legal — the fallback path handles it correctly — but it
forfeits every optimisation described above:

- no sparse pattern precomputation (the "output sparsity pattern" step
  produces a fully dense block),
- no `_sz_mb_{N}_rs_dat_k` scatter-add (the block falls through to the
  `MutableMatJacDataModule` fancy-indexing path),
- each Jacobian call allocates and populates a dense intermediate.

Solverz emits a one-shot `UserWarning` at `FormJac` time, **once per
offending parameter**, to flag the performance cost:

```python
m.A = Param('A', np.array([[2.0, 1.0], [1.0, 3.0]]), dim=2, sparse=False)
m.eqn = Eqn('f', m.x * Mat_Mul(m.A, m.x) - m.b)
smdl, y0 = m.create_instance()
smdl.FormJac(y0)
# UserWarning: Parameter 'A' is a dense 2-D ``Param(..., dim=2,
#   sparse=False)`` used inside ``Mat_Mul``. Dense matrices bypass
#   the vectorised mutable-matrix Jacobian fast path and fall back
#   to a slower scipy/ndarray indexing path. Declare 'A' with
#   ``sparse=True`` for substantially better performance.
```

**How to fix:** wrap the value in a scipy.sparse container and declare
`sparse=True`:

```python
from scipy.sparse import csc_array
m.A = Param('A', csc_array(np.array([[2.0, 1.0], [1.0, 3.0]])),
            dim=2, sparse=True)
```

On a dense 2×2 matrix the difference is invisible, but for anything
larger than a few dozen rows the sparse path is substantially faster
and (more importantly) keeps the Jacobian assembly off the fallback
path.

## API Reference

```{eval-rst}
.. autofunction:: Solverz.sym_algebra.matrix_calculus.MixedEquationDiff
.. autoclass:: Solverz.sym_algebra.matrix_calculus.TensorExpr
   :members: diff, visualize
```