> All memory would be prepared in advance and the massively parallel accelerator would "stream walk" through It without having to load store arbitrary memory locations.
Dedicated accelerators do that already, e.g. google's TPUs, tesla's D1 or apple's neural engine. You must load the data into compute-unit local memory first before executing matmuls. Keeping the weights there and only piping the dynamic data through it saves memory bandwidth.
This sort of architecture only really works when the network is small enough, which has been a perennial problem for neural network accelerators as networks grow but accelerators don't. LLMs and the like will often prefer the opposite form of "stream walking" (streaming the weights through the data) or a hybrid.
Streams weights instead of data sounds really interesting - I had never considered it.
Something else that might be theoretically possible is -
Large array of FPGAs are apparently used to simulate and verify chips [1], can the same be done to run LLMs? Can we have 0.25 to 1 token per cycle, would the engineering effort be worth it, and would it be financially feasible from a TCO standpoint?
Dedicated accelerators do that already, e.g. google's TPUs, tesla's D1 or apple's neural engine. You must load the data into compute-unit local memory first before executing matmuls. Keeping the weights there and only piping the dynamic data through it saves memory bandwidth.