> Claude writes text one word at a time. Is it only focusing on predicting the next word or does it ever plan ahead?
When a LLM outputs a word, it commits to that word, without knowing what the next word is going to be. Commits meaning once it settles on that token, it will not backtrack.
That is kind of weird. Why would you do that, and how would you be sure?
People can sort of do that too. Sometimes?
Say you're asked to describe a 2D scene in which a blue triangle partially occludes a red circle.
Without thinking about the relationship of the objects at all, you know that your first word is going to be "The" so you can output that token into your answer. And then that the sentence will need a subject which is going to be "blue", "triangle". You can commit to the tokens "The blue triangle" just from knowing that you are talking about a 2D scene with a blue triangle in it, without considering how it relates to anything else, like the red circle. You can perhaps commit to the next token "is", if you have a way to express any possible relationship using the word "to be", such as "the blue circle is partially covering the red circle".
I don't think this analogy necessarily fits what LLMs are doing.
This was obvious to me very early with GPT-3.5-Turbo..
I created structured outputs with very clear rules and process. That if followed would funnel behavior the way I wanted it.. and low and behold the model would anticipate preconditions that would allow it to hallucinate a certain final output and the model would push those back earlier in the output. The model had effectively found wiggle room in the rules and injected the intermediate value into the field that would then be used later in the process to build the final output.
The instant I saw it doing that, I knew 100% this model "plans"/anticipates way earlier than I thought originally.
> ‘One token at a time’ is how a model generates its output, not how it comes up with that output.
I do not believe you are correct.
Now, yes, when we write printf("Hello, world\n"), of course the characters 'H', 'e', ... are output one at a time into the stream. But the program has the string all at once. It was prepared before the program was even run.
This is not what LLMs are doing with tokens; they have not prepared a batch of tokens which they are shifting out left-to-right from a dumb buffer. They output a token when they have calculated it, and are sure that the token will not have to be backtracked over. In doing so they might have calculated additional tokens, and backtracked over those, sure, and undoubtedly are carrying state from such activities into the next token prediction.
But the fact is they reach a decision where they commit to a certain output token, and have not yet committed to what the next one will be. Maybe it's narrowed down already to only a few candidates; but that doesn't change that there is a sharp horizon between committed and unknown which moves from left to right.
Responses can be large. Think about how mind boggling it is that the machine can be sure that the first 10 words of a 10,000 word response are the right ones (having put them out already beyond possibility of backtracking), at a point where it has no idea what the last 10 will be. Maybe there are some activations which are narrowing down what the second batch of 10 words will be, but surely the last ones are distant.
> it commits to that word, without knowing what the next word is going to be
Sounds like you may not have read the article, because it's exploring exactly that relationship and how LLMs will often have a 'target word' in mind that it's working toward.
Further, that's partially the point of thinking models, allowing LLMs space to output tokens that it doesn't have to commit to in the final answer.
That makes no difference. At some point it decides that it has predicted the word, and outputs it, and then it will not backtrack over it. Internally it may have predicted some other words and backtracked over those. But the fact it is, accepts a word, without being sure what the next one will be and the one after that and so on.
Externally, it manifests the generation of words one by one, with lengthy computation in between.
It isn't ruminating over, say, a five word sequence and then outputting five words together at once when that is settled.
> It isn't ruminating over, say, a five word sequence and then outputting five words together at once when that is settled.
True, and it's a good intuition that some words are much more complicated to generate than others and obviously should require more computation than some other words. For example if the user asks a yes/no question, ideally the answer should start with "Yes" or with "No", followed by some justification. To compute this first token, it can only do a single forward pass and must decide the path to take.
But this is precisely why chain-of-thought was invented and later on "reasoning" models. These take it "step by step" and generate sort of stream of consciousness monologue where each word follows more smoothly from the previous ones, not as abruptly as immediately pinning down a Yes or a No.
LLMs are an extremely well researched space where armies of researchers, engineers, grad and undergrad students, enthusiasts and everyone in between has been coming up with all manners of ideas. It is highly unlikely that you can easily point to some obvious thing they missed.
While the output is a single word (more precisely, token), the internal activations are very high dimensional and can already contain information related to words that will only appear later. This information is just not given to the output at the very last layer. You can imagine the internal feature vector as encoding the entire upcoming sentence/thought/paragraph/etc. and the last layer "projects" that down to whatever the next word (token) has to be to continue expressing this "thought".
But the activations at some point lead to a 100% confidence that the right word has been identified for the current slot. That is output, and it proceeds to the next one.
Like for a 500 token response, at some point it was certain that the first 25 words are the right ones, such that it won't have to take any of them back when eventually calculating the last 25.
This is true, but it doesn't mean that it decided those first 25 without "considering" whether those 25 can be afterwards continued meaningfully with further 25. It does have some internal "lookahead" and generates things that "lead" somewhere. The rhyming example from the article is a great choice to illustrate this.
By the way, there was recently a HN submission about a project studying using diffusion models rather than LLM for token prediction. With diffusion, tokens aren't predicted strictly left to right any more; there can be gaps that are backfilled. But: it's still essentially the same, I think. Once that type of model settles on a given token at a given position, it commits to that. Just more possible permutations of the token filling sequence have ben permitted.
That's a really interesting point about committing to words one by one. It highlights how fundamentally different current LLM inference is from human thought, as you pointed out with the scene description analogy. You're right that it feels odd, like building something brick by brick without seeing the final blueprint. To add to this, most text-based LLMs do currently operate this way. However, there are emerging approaches challenging this model. For instance, Inception Labs recently released "Mercury," a text-diffusion coding model that takes a different approach by generating responses more holistically. It’s interesting to see how these alternative methods address the limitations of sequential generation and could potentially lead to faster inference and better contextual coherence. It'll be fascinating to see how techniques like this evolve!
But as I noted yesterday in a follow-up comment to my own above, the diffusion-based approaches to text response generation still generate tokens one at a time. Just not in strict left-to-right order. So that looks the same; they commit to a token in some position, possibly preceded by gaps, and then calculate more tokens,
When a LLM outputs a word, it commits to that word, without knowing what the next word is going to be. Commits meaning once it settles on that token, it will not backtrack.
That is kind of weird. Why would you do that, and how would you be sure?
People can sort of do that too. Sometimes?
Say you're asked to describe a 2D scene in which a blue triangle partially occludes a red circle.
Without thinking about the relationship of the objects at all, you know that your first word is going to be "The" so you can output that token into your answer. And then that the sentence will need a subject which is going to be "blue", "triangle". You can commit to the tokens "The blue triangle" just from knowing that you are talking about a 2D scene with a blue triangle in it, without considering how it relates to anything else, like the red circle. You can perhaps commit to the next token "is", if you have a way to express any possible relationship using the word "to be", such as "the blue circle is partially covering the red circle".
I don't think this analogy necessarily fits what LLMs are doing.