TL;DR:
My iPhone 16 Pro Max produces garbage output when running MLX LLMs. An iPhone 15 Pro runs the same code perfectly. A MacBook Pro also runs the same code perfectly. The tensor outputs on the 16 show numerical values an order of magnitude wrong. I suspect it points
Well, was it quantized in a way that iphone 16 supports?
Often it’s the quantization where things break down, and the hardware needs to support the quantization, can’t run FP16 on int8 hardware… And sometimes the act of quantization can cause problems too.
And yeah, LLMs are likely going to be very hit or miss anyway.