티스토리 뷰

728x90

Title: JudgeLM: Fine-tuned Large Language Models are Scalable Judges

링크: https://arxiv.org/abs/2310.17631

 

여기서 만든 데이터셋 format으로 만들고, judgeLM으로 평가 내리는 것을 생각하기 (Github)

 

Summary

A novel approach to evaluating large language models, which are advanced AI models capable of generating human-like text. The main chanllenge in assessing LLMs is that existing benchmarks and metrics don't comprehensively measure their performance in open-ended tasks.

 

  • Word: open-ended(개방형) / versatility (다재) / succinctly(간결하게)

 

Motivation: Evaluating LLMs comprehensively has been challenging due to the lack of benchmarks and metrics that cam capture the capabilities.

JudgeLM is a new method, which fine-tunes LLMs as scalable judges to efficiently and effectively evaluate LLMs in open-ended benchmarks.

 

Contribution

  1. Large-scale dataset for judge models with diverse seed takss, LLM-generated answers, and detailed judgements from GPT-4.
  2. JudgeLM achieves high agreement with the teacher judge, extended capabilities in grading single answers, judging multi-modal models, handling multiple answers, and engaging in multi-turn chat.
  3. Analyze the biases inherent to LLM judge fine-tuning (key bias in fine-tuning LLMs as judges → position bias, knowledge bias, and format bias => incorporate techniques such as swap augmentation, reference support, and reference drop)

 

 

 


Related works

  • Instruction fine-tuning of large language models
  • evaluation of large language models

 


Dataset

 

 

Method

JudgeLM → Open-ended scenarios (목적: Fine-tuning 하며 생기는 편향을 해결하기 위함 / 모델의 일관성을 다양한 시나리오에 걸쳐 적응을 확장 시키는 것 )

They suggest reference drop, reference support, and swap augmentation for fine-tuning LLMs as judges in order to overcome format, knowledge, and position biases, respectively.

  • format: 특정 프롬프트에서만 최적의 성능
  • knowledge: pre-trained model에 대해 과하게 의존하는 지식 편향
  • position biases 특정 상황에 선호하는 응답

 

Application → multi-modal models such as multi-turn conversation, grading single replies, and judging multiple answers

 

Conclusion

By fine-tuning LLMs as judges and addressing biases through advanced techniques, JudgeLM achieves high agreement with human judges and demonstrates versatility in various tasks.

 

728x90
댓글