Trust Issues Newsletter: Influence functions for explaining LLMs. Automated valuations for reducing hallucinations in ChatGPT.

The Latest in Trustworthy AI Research and Practice

Trust Issues

Sept. 14, 2023 • Epoch 12

In this month’s Trust Issues, we dive deep into a recent paper from Anthropic on explaining large language models and a paper from Microsoft and Columbia on improving LLM response with automated feedback. We also discuss OpenAI’s fine-tuning announcement and some recent policy proposals for regulating AI research and development.

Later this month, we’ll have a live session where we discuss a recent paper from DeepMind that introduces a new fine-tuning algorithm called ReST, or reinforced self training. Smash the sign-up button for our reading group below so you don’t miss it.

Join the reading group

Explaining LLMs with influence functions

Developers of large language models (LLMs) often want to be able to explain the outputs of their models. One way to do this is by using influence functions, which allow model developers to answer the counterfactual question: how would the model’s parameters change if we were to add a given sequence (i.e., a prompt/completion pair) to the training data?

While influence functions were introduced in 1974 and applied to deep neural networks in 2017, applying them to massive, state-of-the-art LLMs presents at least two computational challenges: solving the inverse Hessian-vector product (IHVP) and selecting candidate sequences to use in calculating a model’s training gradient for influence calculations.

This paper from researchers at Anthropic addresses these technical hurdles. To efficiently approximate the IHVP, they apply Eigenvalue-Corrected Kronecker-Factored Approximate Curvature (EK-FAC), which uses a parametric, empirical approximation of the Hessian. The authors validate the EK-FAC against other IHVP solver/approximation techniques, demonstrating that it achieves better or comparable performance while substantially reducing computational cost. To sample and select specific training sequences, the researchers experimented with TF-IDF filtering and query batching, both of which came with trade-offs. With TF-IDF filtering, they were able to find a manageable number of training sequences, but those sequences tended to be biased towards sequences with overlapping tokens. With query batching, they were able to eliminate this bias from their sample, but they still needed to search over a large number of training sequences.

The researchers use influence functions to study a range of qualitative phenomena related to LLM generalization, including model scaling, memorization, role-playing, sensitivity to word ordering, and layerwise attributions. Regarding layerwise attributions – the EK-FAC method is gradient-based, enabling either scalar, token-wise influence estimation (as showcased in Figure 1) or vector, layer-wise attributions (as seen in Figure 2). The authors emphasize the importance of layer-wise attributions when studying LLM generalization, claiming that influences on middle layers tend to capture sequences thematically related to queries while lower/upper layers tend to capture sequences with directly overlapping tokens.

Figure 1. An example of influence functions applied to a sequence of training tokens from the Studying Large Language Model Generalization with Influence Functions paper

Figure 2. An example of influence functions applied to different layers of a 52 billion parameter LLM. The influences are calculated for the top 500 most influential sequences.

Check your facts and try again

Automated evaluations are a great tool to improve the quality of LLM apps. Check your facts and try again is a great new paper from researchers at Microsoft and Columbia University to build on this idea with LLM-Augmenter.

To perform the evaluations needed, the researchers first needed to build a knowledge base. This process looks a lot like a RAG or agent-based application, where they retrieve raw evidence from a verified set of APIs like Bing and Reddit messages along with REST APIs to retrieve task-specific information from databases like restaurant reviews and product specifications. This raw evidence is then consolidated and sent to the working memory to use for evaluation.

After building the consolidated evidence, we can create a policy with the responsibility to acquire evidence, perform evaluations, create a response or send the response to the user.

Figure 3. The architecture of the LLM-Augmenter proposed
in the paper Check Your Facts and Try Again

By providing these evaluations back to the LLM, the LLM-Augmenter significantly reduced hallucinations compared to ChatGPT without sacrificing the fluency and informativeness of the response. As we continue to build out LLM applications, we should consider incorporating evaluations not only for observability, but as a mechanism for improving output quality itself as the researchers do here.

OpenAI fine-tuning announced

OpenAI continues to lead the way in growing the adoption of LLMs, recently announcing the availability of GPT-3.5 turbo fine-tuning. At a high level, fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt. After the model has been fine-tuned, you can achieve better results on tasks similar to the examples you’ve fine-tuned to while providing fewer examples in prompts. Ultimately, this can work to reduce the cost and latency of your LLM app.

Importantly, fine-tuning is not a replacement for retrieval methods, and is instead best for style changes and learning new tasks that are not easily described in words. Even with fine-tuning, the principle of allowing other components of your app to handle memorization still applies.

Federal AI licensing

In response to the glut of companies seeking to develop LLMs, the Center for AI Policy has outlined policy recommendations to “[ensure] safety in the development of AI that is advanced enough to pose major threats to public safety or international security.” These recommendations include the establishment of a federal agency that would regulate so-called “frontier AI” development, which includes any ML models that meet one of the following conditions:

Uses at least 10^24 FLOPs during training
Has at least 80B parameters
Costs at least $10M to train
Achieves at least 70% on MMLU or 1300 on the SAT

Current examples of such frontier AI include GPT-4, PaLM 2, and Claude 2. The proposed agency would be granted oversight on firms engaged in frontier research, as they would be able to monitor and license the stockpiling of large AI hardware clusters, the deployment/development of frontier AI systems, and access to frontier AI model weights.

While the risks around LLMs and the need for regulations around them have been enumerated in many places, the specific parameters of this proposal have been met with scrutiny. Reading the comments on the CAIP announcement post by Thomas Larsen (the Center’s Executive Director), developers are concerned that the conditions defining “frontier AI” described above might encompass firms that are not even working with generative LLMs (e.g., SOTA image classifiers, self-driving cars, healthcare applications, etc.). Some have even noted that the proposed performance/cost thresholds already capture the open source Llama 2 model, which many LLM developers are fine-tuning for their own applications.

Whatever the ultimate parameters of these thresholds are, policy makers should strive to craft regulations that do not hamstring open source work and academic research into LLMs. Doing so may cause more harm than good.

Do you find value in receiving this monthly newsletter? If you want to support Trust Issues, we’d appreciate it if you shared it with any friends, colleagues and family that would enjoy it. Anyone subscribe or visit our archive using the link👇 below.

Subscribe Here

Interested in the latest solutions for AI observability? Join TruEra's webinar on Full Lifecycle AI Observability on Sept. 28.

Register for Webinar