> All memory would be prepared in advance and the massively parallel accelerator...

pclmulqdq · on July 3, 2023

This sort of architecture only really works when the network is small enough, which has been a perennial problem for neural network accelerators as networks grow but accelerators don't. LLMs and the like will often prefer the opposite form of "stream walking" (streaming the weights through the data) or a hybrid.

dhruvdh · on July 3, 2023

Streams weights instead of data sounds really interesting - I had never considered it.

Something else that might be theoretically possible is -

Large array of FPGAs are apparently used to simulate and verify chips [1], can the same be done to run LLMs? Can we have 0.25 to 1 token per cycle, would the engineering effort be worth it, and would it be financially feasible from a TCO standpoint?

[1] https://www.servethehome.com/amd-vp1902-is-leviathan-fpga-do...

pclmulqdq · on July 3, 2023

The only problem is that an array of FPGAs would make an Nvidia GPU look cheap. And easy to get.

buildbot · on July 3, 2023

People have used FPGAs for DNNs before

https://www.microsoft.com/en-us/research/project/project-bra...

vGPU · on July 4, 2023

FGPA’s have been tried and dismissed. GPU’s are simply better with speed and efficiency.