We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language ("golden hour""), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.
An illustration of our visual editing task. Users input an image/video and specify the desired visual appearance (upper row: source images, middle: user intents). An LLM interprets these intents, selects tools, and sets parameters. The bottom row displays the generated images by applying the LLM's output in our app. For example, inputting "Morocco" (left) results in warm hues typical of Moroccan landscapes, reflecting its deserts.
(1) We create a dataset by collecting user intents and the output (or potentially multiple outputs, if several users expressed the same intent) of our teacher LLM.
We ensure high quality by keeping outputs users chose to export frequently (one output with the highest export rate per intent).
After data processing, we randomly split the data into training and test sets.
(2) We fine-tune a smaller student LLM on our dataset.
(3) Offline, we evaluate the student LLM's selection of tools and predicted parameters.
(4) To improve fine-tuning in low-data regimes, we use an LLM to augment the training data by generating similar samples (e.g., "cool tone" from "cool morning") to mistakes of the student LLM.
(5) If a better student model is found offline, we conduct an online A/B test.
Offline evaluation results for our student models. Metrics include (tool-selection score, quality score, final score), and the average final score across the tools (Overall). Results show that FlanT5-base performs very similarly to Llama-2-7b-chat-hf, with only a 0.02 gap (rows 1, 4). Interestingly, both models perform better on a test subset with more popular user intents (r_5 > r_3 > All), where r_i denotes user intents with at least i calls.
Output images for reality check. Here are examples of samples given to our annotators to evaluate. For each sample, they were asked two binary questions: (1) whether the image is relevant to the intent, and (2) whether the student models correctly mimic the teacher model (see Section 4.2). Each sample includes the source image and the outputs of the teacher LLM along with the outputs from both of our student LLMs. Based on the annotator’s majority vote: In the first sample: (1) All models produced results relevant to the intent “Morocco” (e.g., warm hues, typical of Moroccan landscapes, reflecting its deserts). (2) Both student models successfully mimicked the teacher LLM. In the second sample: (1) All models produced results relevant to the intent “The Matrix” (e.g., darkness, green tint, and cyberpunk aesthetic) (2) Both student models did not mimic the teacher LLM well.
In addition to offline evaluation, we conducted two online A/B tests.
First, we compared our teacher, GPT-3.5-Turbo (tested on 94,317 projects), with Llama-2-7b-chat-hf (93,495 projects).
We measured project completion rates as an indicator of user satisfaction.
The completion rate for the teacher was 96.1% of that of Llama-2-7b-chat-hf (no statistical significance). Thus, we conclude they are comparable.
In our second A/B test, we compared our student models. FlanT5-base (tested on 20,294 projects) achieved a completion rate of 99% of that of Llama-2-7b-chat-hf (20,282 projects).
Thus, we conclude they are comparable and choose FlanT5-base for its lower latency and cost.
Importantly, we are encouraged by the fact that our offline metrics align with the results of the online A/B tests
FlanT5-base’s performance in subsets of the train set, with and without augmentation. We can see that augmentation is effective in limited data increasing the overall score by 0.13 for the 1/8 sample. With larger training subsets, the proportion of augmentations (%) decreases, reducing overall improvement as expected.
@article{sultan2024visualediting,
title={Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications},
author={Sultan, Oren and Khasin, Alex and Shiran, Guy and Greenstein-Messica, Asnat and Shahaf, Dafna},
journal={arXiv preprint arXiv:2210.12197},
year={2024}
}