I've noticed you have a new way to store model weights in sparse form (which is a great addition) and am wondering it there's some fundamental blocker to using sparse operations at inference time too?
The sparsity is leveraged during execution on certain compute units such as the Neural Engine.
This is not generally the case for any model which contains sparse matrices. It is best to use the sparse encoding explicitly.