Character prefix completion


I saw a fun post on twitter today, which lead to the following blogpost from Cursor, a code generation startup. The problem outlined in their blogpost is roughly as follows.

A common task in code generation is completion, where a robot must propose completions given a prefix sequence of characters. This is complicated by the fact that humans and robots operate with different text encodings. Humans type in characters (usually) while robots operate over subword tokens. This mismatch in encoding can lead to awkward situations if one is not careful.

Here’s an example. Consider the following sentence (tokenizer visualizer):

Welcome to gpt-tokenizer. Replace this with your text to see how tokenization works.

As you can see, token segmentations do not align with characters. If we wanted to sample a completion given the prefix Welcome to gp, we would be in the middle of a token! This hints at an obvious solution: backtrack a little, then only consider tokens that agree with the observed characters. In this example, that would mean starting with the prefix Welcome to g and only considering next tokens that start with p. We can visualize this with a cursor:

Welcome to g|p

where everything to the left of the cursor is fed as a prefix to the model, and everything to the right leads to logit constraints in our language model’s output distribution. In the worst case, you may want to backtrack all the way to the start of the current word! This can be visualized with multiple cursors:

Welcome to |g|p

When sampling, we can simply compute the distribution over backtracked-cursors, constrain the next-token distribution to agree with the observed characters, then sample a token.

Note that this is not a new problem. The problem of conditional sampling with mismatched segmentations is well-studied in classic NLP. It is most prominent in latent segmentation models, such as hidden semi-Markov models.

In the rest of this post, we will give a minimal formalization of this problem, a solution, and (eventually) some code.

Problem setup

We are given a sequence of characters x=x1,x2,,xtx = x_1,x_2,\ldots,x_t. Our goal is to