raw.sh
Teaching chat models to solve chess puzzles
Turns out GPT completion models are pretty good at chess, with gpt-3.5-turbo-instruct playing around 1800 elo1. However, chat models typically struggle. It's been shown that combining fine-tuning and prompt optimization can be powerful2. To see if the performance of chat models could be brought up to a level competitive with completion models, I ran several experiments.
I tried automatically optimizing the LLM chess puzzle solving prompt with tuned few shot examples using DSPy. Additionally, in light of OpenAI's 2 million tokens of free finetuning per day, I also tried finetuning gpt-4o-mini on the optimized LLM chain of thought outputs from gpt-4o.
Glossary
Finetuning:
The process of further training a pre-trained language model on a specific dataset to improve its performance on a particular task.Prompt optimization:
The technique of refining and improving the input prompts given to an AI model to elicit better and more accurate responses.Few shot prompting:
A method where a language model is given a small number of examples (usually 2-5) within the prompt to help it understand and perform a specific task.Chain of thought (CoT):
An approach that encourages AI models to break down complex problems into step-by-step reasoning, often by including intermediate steps in prompts or training data.DSPy:
Very good™️ framework for algorithmically optimizing LM prompts and weights
Play gpt-4o-mini
challenge rawbot
to a game
watch games
Contents
- Dataset
- Can completion models solve chess puzzles?
- Promptization
- DSPy Compiled Results
- Finetuning
- Finetuning Results
- Notes
- TL;DR
- Play gpt-4o-mini
Dataset
To create a dataset to evaluate the models on, I used puzzles from the lichess database3, which provides a dataset of 4 million chess puzzles with metadata. For each puzzle, I called the API to get the full list of moves in PGN format, with one example for each move of the puzzle.
solution (hint: knight move)4
The processed dataset is available on huggingface.
Can completion models solve chess puzzles?
GPT completion models provide a strong baseline, with babbage and davinci getting 61.23% and 74.45% accuracy on the evaluation set respectively.
Completion models baseline
Model | Accuracy |
---|---|
babbage-002 | 61.23% |
davinci-002 | 74.45% |
Promptization
Without showing the LLM any examples, chat model performance is pretty poor, with gpt-4o-mini only solving 17.4% of puzzles, and gpt-4o solving 28.44%.
DSPy Program
DSPy: The framework for programming—not prompting—foundation models
To optimize our program with DSPy, we can express it using an LLM signature and module. This simple signature will be optimized into high-quality prompts with few shot examples.
This approach alone improves gpt-4o-mini's accuracy to 25.99%, a 50% increase over the zero-shot baseline. When combined with a self-correction step for illegal moves and 2 retries, the accuracy further improves to 31.72%.
Check if moves are valid:
By optimizing over few shot examples with DSPy, gpt-4o-mini performance is boosted to 31.72%. Using the newly optimized few shot chain of thought prompt with gpt-4o brings some impressive gains, bringing the accuracy up to 63.88%.
Compiling
A few million tokens later...
Compiled prompt
DSPy optimized chain of thought prompt with 5 few shot examples:
Interestingly, the rationale in the first example has a bug, and repeats the PGN from the instructions. Removing this actually drops the performance significantly.
DSPy Compiled Results
Chat models + DSPy
Model | Accuracy |
---|---|
gpt-4o-mini [zero shot] | 17.4% |
gpt-4o-mini | 25.99% |
gpt-4o-mini [SELF CORRECT 3 tries max] | 31.72% |
gpt-4o [zero shot] | 28.44 |
gpt-4o | 55.07% |
gpt-4o [SELF CORRECT 3 tries max] | 63.88% |
Finetuning
Constructing good examples
By running the best pipeline we have available (gpt-4o with the compiled prompt), we can construct fine tuning examples with the chain of thought processes from a strong model.
gpt-4o-mini
The resulting finetuned gpt-4o-mini model gets 65.64% accuracy, a 280% improvement over the zero shot baseline. This beats gpt-4o and babbage (61.23%), but loses to davinci (74.45%).
gpt-4o
A mixture of curiousity and gpt-4o finetuning going public at the perfect time led to finetuning gpt-4o, to the detriment of my wallet. The resulting model gets 71.37%, nearly matching the davinci completion model.
davinci
For completeness, I also tried finetuning the davinci completion model with just the PGN: move pairs, seeing a modest improvement.
Finetuning Results
Completion models
Model | Accuracy |
---|---|
babbage-002 | 61.23% |
davinci-002 | 74.45% |
davinci-002 finetune | 76.21% |
Chat models + DSPy
Model | Accuracy |
---|---|
gpt-4o-mini [zero shot] | 17.4% |
gpt-4o-mini | 25.99% |
gpt-4o-mini [SELF CORRECT 3 tries max] | 31.72% |
gpt-4o-mini finetune | 57.71% |
gpt-4o-mini finetune [SELF CORRECT 3 tries max] | 65.64% |
gpt-4o [zero shot] | 28.44 |
gpt-4o | 55.07% |
gpt-4o [SELF CORRECT 3 tries max] | 63.88% |
gpt-4o finetune | 58.59% |
gpt-4o finetune [SELF CORRECT 3 tries max] | 71.37% |
Notes
I asked Claude for some takeaways:
- The combination of prompt engineering and fine-tuning yields the best results, with the fine-tuned gpt-4o-mini model outperforming even gpt-4o in few-shot settings.
- The synergy between these approaches suggests that they address different aspects of model performance.
- Smaller, fine-tuned models can compete with larger models in specific tasks, offering potential efficiency gains.
TL;DR
First use DSPy with a cheap model to get a really good prompt, then use the optimized prompt with a strong model to create a finetuning example dataset. Finetuning on this dataset can make the performance of the cheap model competetive with the expensive teacher model.