• ☆ Yσɠƚԋσʂ ☆@lemmy.mlOP
    link
    fedilink
    arrow-up
    7
    ·
    20 hours ago

    That’s part of the idea with the whole mixture of experts (MoE) approach in newer models actually.

    Rather than using a single neural net that’s say 512 wide, you split it into eight channels/experts of 64. If the neural net can pick the correct channel for each inference, then you only have to run 1/8th of the neurons on every forward pass. Of course, once you have your 8 channels/experts in parallel, you now need to decide which expert/channel to use for each token you want to process. This is called a router which takes in an input and decides which expert/channel to send it to. The router itself is a tiny neural network. It is a matrix that converts the input vectors to a router choice. And the router itself has a small set of trainable weights that gets trained together with the MoE.