Character prefix completion
I saw a fun post on twitter today, which lead to the following blogpost from Cursor, a code generation startup. The problem outlined in their blogpost is roughly as follows.
A common task in code generation is completion, where a robot must propose completions given a prefix sequence of characters. This is complicated by the fact that humans and robots operate with different text encodings. Humans type in characters (usually) while robots operate over subword tokens. This mismatch in encoding can lead to awkward situations if one is not careful.
Here’s an example. Consider the following sentence (tokenizer visualizer):
As you can see, segmentation according to tokens does not perfectly align with characters.
If we wanted to sample a completion given the prefix
Welcome to gp
, we would be in the middle of a token!
This hints at an obvious solution: backtrack a little, then only consider tokens
that agree with the characters you’ve moved over.
In this example, that would mean starting with the prefix Welcome to g
and only considering
tokens that start with p
.
We can visualize this with a cursor:
Welcome to g|p
where everything to the left of the cursor is fed as a prefix to the model, and everything to the right leads to logit constraints in our language model’s output distribution. In the worst case, you may want to backtrack all the way to the start of the current word! This can be visualized with multiple cursors:
Welcome to |g|p
Note that this is not a new problem. The problem of conditional sampling given mismatched segmentation is a well-known problem in classic NLP. It is most prominent in latent segmentation models, such as hidden semi-Markov models. In the rest of this post, we will identify a minimal to solve this problem and provide some code. Our solution will require forward passes of an LM, where is the largest number of characters in a token.
Problem setup
We are given a character stream