There’s no way that I know of to see the per prompt usage for commercially available models. They obviously hide that. I admit I don’t research them much but I am assuming each chip is processing prompts one at a time.
Its pretty simple arithmetic - if it’s running exclusively on a single GPU system, and a prompt takes X seconds to generate on said gpu, then you take the GPUs power over X seconds plus whatever fraction of the datacenter overhead power that gpu uses. For locally run models on your own hardware this is also trivial to calculate.
Alternatively, GPU’s run at a certain number of “tokens” per second and each prompt is a certain number of tokens being fed into the model, generally scaling with the length of prompt.
openai actually released some figures on power use per prompt, but the caveat is that a single prompt to their services can trigger multiple responses (the “thinking” mode) so they’re not consistent.
There’s no way that I know of to see the per prompt usage for commercially available models. They obviously hide that. I admit I don’t research them much but I am assuming each chip is processing prompts one at a time.
Its pretty simple arithmetic - if it’s running exclusively on a single GPU system, and a prompt takes X seconds to generate on said gpu, then you take the GPUs power over X seconds plus whatever fraction of the datacenter overhead power that gpu uses. For locally run models on your own hardware this is also trivial to calculate.
Alternatively, GPU’s run at a certain number of “tokens” per second and each prompt is a certain number of tokens being fed into the model, generally scaling with the length of prompt.
openai actually released some figures on power use per prompt, but the caveat is that a single prompt to their services can trigger multiple responses (the “thinking” mode) so they’re not consistent.