🎥🔧🔗Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications

(1) The Hebrew University of Jerusalem; (2) Lightricks

EMNLP 2024 (Main Conference, Industry Track)

An illustration of our visual editing task. Users input an image/video and specify the desired visual appearance.
For example:
(1) "Golden hour" will result in more yellow warm temperature and golden tone of the golden hour filter look.
(2) "Dark atmosphere" will result in darker colors.
(3) "🥶" cold face emoji will result in more blue colors to emphasize the cold temperature and freezing weather.

Abstract

We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language ("golden hour""), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.

Our Task

Website

An illustration of our visual editing task. Users input an image/video and specify the desired visual appearance (upper row: source images, middle: user intents). An LLM interprets these intents, selects tools, and sets parameters. The bottom row displays the generated images by applying the LLM's output in our app. For example, inputting "Morocco" (left) results in warm hues typical of Moroccan landscapes, reflecting its deserts.

Our Distillation Framework

Website

(1) We create a dataset by collecting user intents and the output (or potentially multiple outputs, if several users expressed the same intent) of our teacher LLM. We ensure high quality by keeping outputs users chose to export frequently (one output with the highest export rate per intent). After data processing, we randomly split the data into training and test sets.
(2) We fine-tune a smaller student LLM on our dataset.
(3) Offline, we evaluate the student LLM's selection of tools and predicted parameters.
(4) To improve fine-tuning in low-data regimes, we use an LLM to augment the training data by generating similar samples (e.g., "cool tone" from "cool morning") to mistakes of the student LLM.
(5) If a better student model is found offline, we conduct an online A/B test.

EMNLP 2024 Poster

BibTeX

@article{sultan2024visualediting,
  title={Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications},
  author={Sultan, Oren and Khasin, Alex and Shiran, Guy and Greenstein-Messica, Asnat and Shahaf, Dafna},
  journal={arXiv preprint arXiv:2210.12197},
  year={2024}
}