Trust Issues Newsletter: Chain of density prompting and sliding window attention

The Latest in Trustworthy AI Research and Practice

Trust Issues

October 12, 2023 • Epoch 13

This was a big month for open source AI advancement! With a ton of new research and advancements to choose from, we wanted to bring you a few that are particularly important for practitioners building and scaling LLM apps. In this issue, we cover chain of density prompting and do a technical deep dive into the latest open source SOTA, Mistral 7B, and their use of a new technique, sliding window attention.

If you’d like to join our monthly reading group and discuss the frontier of AI research, including papers like these, you can sign up here:

Join the reading group

Chain of density

Summarization is a key workflow for many LLM use cases, but one-shot summaries can often be full of fluff or miss important information.

A new prompting technique called Chain of Density solves the problem by asking the AI to keep increasing the density of a summary. The paper shows that this produces highly compressed but readable output. You can see a comparison in the images.

Figure 1. Chain of density prompt and resulting improvement in quality

Because this is such an important task for LLM app observability, we also built summarization quality evals into open-source TruLens. For this evaluation, we use a different approach – called chain of thought prompting – to allow the LLM to carefully identify the key points from the source and check to make sure each is included in the resulting app response.

Mistral 7B model achieves SOTA

Mistral 7B is the latest LLM to make a splash in open-source with a new SOTA boasting. Even better – they’ve released it with a permissive Apache 2.0 license so it can be used without restrictions.

Outperforms Llama 2 13B on all benchmarks
Outperforms Llama 1 34B on many benchmarks
Approaches CodeLlama 7B performance on code, while remaining good at English tasks

How did they do it? One key advancement that the Mistral 7B used is Sliding Window Attention to handle longer sequences of tokens at a lower cost.

That’s a lot of jargon - let’s break it down.

Sliding window attention

Sliding window attention is an approach pioneered by a pair of recent papers (1, 2) that improves upon self-attention. While self-attention can be extremely powerful in capturing contextual information, it relies on the entire sequence. This can be quite expensive as the sequence grows.

On the other hand, sliding window attention (compared below) relies on only a recent window of tokens to build contextual representations of the entire context.

Figure 2. Full self-attention compared to sliding window attention

Do you find value in receiving this monthly newsletter? If you want to support Trust Issues, we’d appreciate it if you shared it with any friends, colleagues and family that would enjoy it. Anyone can subscribe or visit our archive using the link👇 below.

Subscribe to newsletter

Interested in LLM application development? Join TruEra's webinar on Moving LLM Apps from Development to Production on Oct. 26. This month's webinar features an all-star panel including OpenAI, Pinecone, TruEra and Menlo Ventures.

Register for webinar