The final layer of the transformer prior to the logits pretty much does what you want already. The KV of the final layers are taken into account when generating the final hidden state for your new token. The first layers of the model are really just contextualizing the token within the sentence, forcing hidden state from a higher layer through lower layers isn't going to help you much. Replacing the last 5 or so layers with an RNN could certainly be interesting though.