I assume that "pretty fast" depends on the phone. My old Pixel 4a ran Gemma-3n-E2B-it-int4 without problems. Still, it took over 10 minutes to finish answering "What can you see?" when given an image from my recent photos.
As a another data point, on E4B, my Pixel 6 Pro (Tensor v1, Oct 2021) is getting about 4.4 t/s decode on a picture of a glass of milk, and over 6 t/s on text chat. It's amazing, I never dreamed I'd be viably running an 8 billion param model when I got it 4 years ago. And kudos to the Pixel team for including 12 GB of RAM when even today PC makers think they can get away with selling 8.
("What can you see?"; photo of small monitor displaying stats in my home office)
1st token: 7.48s
Prefill speed: 35.02 tokens/s
Decode speed: 5.72 tokens/s
Latency: 86.88s
It did a pretty good job, the photo had lots of glare and was at a bad angle and a distance, with small text; it picked out weather, outdoor temperature, CO2/ppm, temp/C, pm2.5/ug/m^3 in the office; Misread "Homelab" as "Household" but got the UPS load and power correctly, Misread "Homelab" again (smaller text this time) as "Hereford" but got the power in W, and misread "Wed May 21" on the weather map as "World May 21".
Overall very good considering how poor the input image was.
In my case, it was pretty fast i would say, using S24 Fe, on Gemma3n E2B int 4, it took around 20 seconds to answer "Describe this image". And the result was pretty amazing.
Final stats:
15.9 seconds to first token
16.4 tokens/second prefill speed
0.33 tokens/second decode speed
662 seconds to complete the answer