More

shivampkumar · 2026-04-24T00:47:13 1776991633

Agreed...I've been adding some like mtlgemm, mtldiffrast from other contributors already

shivampkumar · 2026-04-24T00:45:39 1776991539

Not currently - TRELLIS.2 is single-image input only AFAIK

drbscl · 2026-04-28T16:47:05 1777394825

Ah, it’s still very useful though, thanks for the port!

shivampkumar · 2026-04-24T00:42:43 1776991363

The gather-scatter sparse conv should be fairly generic. Any model using 3x3x3 or 5x5x5 sparse convolutions on voxel grids could use it directly.

The main thing that's TRELLIS-specific is the neighbor cache key format, but that's a few lines to adapt.

The SDPA attention swap is even more reusable - it's just padding variable-length sequences into batches and calling torch.nn.functional.scaled_dot_product_attention.

shivampkumar · 2026-04-20T08:33:02 1776673982

that makes so much sense...I am exploring if I can find someone who has done this well...If not I'll try to do it myself.

shivampkumar · 2026-04-20T04:45:46 1776660346

The model needed about 15GB at peak during generation - the 4B model loads multiple sub-models (1.3B each for shape and texture flow). 8GB won't be enough, but both 24GB and 32GB both should be fine.

post-it · 2026-04-20T05:07:54 1776661674

Thanks! Could it conceivably load the sub-models in series rather than parallel? 8 still won't be enough but I wonder if those with 16 could eke something out.

shivampkumar · 2026-04-24T00:45:10 1776991510

In theory yes - the pipeline already does this to some extent with its low_vram mode, offloading models to CPU between stages. The challenge at 16GB is that even a single 1.3B sub-model at fp32 plus activations can push past what's available after macOS takes its share. Someone on an M1 iMac with 16GB did get geometry generation working tho (issue #5 on the repo), so 16GB is probably possible. 24GB gives comfortable headroom though.

shivampkumar · 2026-04-20T03:57:40 1776657460

added! will add more, maybe even a GIF

shivampkumar · 2026-04-20T03:37:06 1776656226

i was able to get it in 3.5 mins from a single image on my 24gb m4 pro macbook

I'm still working on this to try to replicate nvdiffrast better. Found an open source port, might look it tonight

shivampkumar · 2026-04-20T03:32:15 1776655935

thanks!

shivampkumar · 2026-04-20T03:27:53 1776655673

I mean I can see that it's niche. Did not expect so many upvotes, but ig it's less niche than I tought

If you're not working with 3D on Apple Silicon this isn't relevant to you. For the subset of people who are, running this 4B parameter 3D generation model locally on a Mac was previously blocked by hard CUDA dependencies with no workaround.

svnt · 2026-04-20T04:27:03 1776659223

Right but it is at most a couple of hours with claude code and posted on Sunday night.

atultw · 2026-04-20T11:03:29 1776683009

Exactly, I know because I did the same thing!

shivampkumar · 2026-04-20T03:26:17 1776655577

I thought it was cool and then I found the open issue mentioned above, that convinced me its def something more people want.

It IS significantly slower, about 3.5 minutes on my MacBook vs seconds on an H100. That's partly the pure-PyTorch backend overhead and partly just the hardware difference.

For my use case the tradeoff works -- iterate locally without paying for cloud GPUs or waiting in queues.