[논문 리뷰] JudgeLM: Fine-tuned Large Language Models are Scalable Judges

티스토리 뷰

Review/Paper

[논문 리뷰] JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Suyeon Cha 2024. 4. 20. 21:09

728x90

Title: JudgeLM: Fine-tuned Large Language Models are Scalable Judges

링크: https://arxiv.org/abs/2310.17631

여기서 만든 데이터셋 format으로 만들고, judgeLM으로 평가 내리는 것을 생각하기 (Github)

Summary

A novel approach to evaluating large language models, which are advanced AI models capable of generating human-like text. The main chanllenge in assessing LLMs is that existing benchmarks and metrics don't comprehensively measure their performance in open-ended tasks.

Word: open-ended(개방형) / versatility (다재) / succinctly(간결하게)

Motivation: Evaluating LLMs comprehensively has been challenging due to the lack of benchmarks and metrics that cam capture the capabilities.

JudgeLM is a new method, which fine-tunes LLMs as scalable judges to efficiently and effectively evaluate LLMs in open-ended benchmarks.

Contribution

Large-scale dataset for judge models with diverse seed takss, LLM-generated answers, and detailed judgements from GPT-4.
JudgeLM achieves high agreement with the teacher judge, extended capabilities in grading single answers, judging multi-modal models, handling multiple answers, and engaging in multi-turn chat.
Analyze the biases inherent to LLM judge fine-tuning (key bias in fine-tuning LLMs as judges → position bias, knowledge bias, and format bias => incorporate techniques such as swap augmentation, reference support, and reference drop)

Related works

Instruction fine-tuning of large language models
evaluation of large language models

Dataset

Method

JudgeLM → Open-ended scenarios (목적: Fine-tuning 하며 생기는 편향을 해결하기 위함 / 모델의 일관성을 다양한 시나리오에 걸쳐 적응을 확장 시키는 것 )

They suggest reference drop, reference support, and swap augmentation for fine-tuning LLMs as judges in order to overcome format, knowledge, and position biases, respectively.

format: 특정 프롬프트에서만 최적의 성능
knowledge: pre-trained model에 대해 과하게 의존하는 지식 편향
position biases 특정 상황에 선호하는 응답

Application → multi-modal models such as multi-turn conversation, grading single replies, and judging multiple answers

Conclusion

By fine-tuning LLMs as judges and addressing biases through advanced techniques, JudgeLM achieves high agreement with human judges and demonstrates versatility in various tasks.

728x90

저작자표시 비영리 변경금지 (새창열림)

'Review > Paper' 카테고리의 다른 글

[PAPER] REALM: Retrieval-Augmented Language Model Pre-Training (0)	2024.09.30
[논문 리뷰] InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (0)	2024.04.20
[논문 리뷰] Re3: Generating Longer Stories With Recursive Reprompting and Revision (1)	2024.04.20
[논문 리뷰] A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications (1)	2024.04.20
[Paper] Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? (1)	2023.12.20

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

글 보관함

250x250

Deep-Dive AI

티스토리 뷰

[논문 리뷰] JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Title: JudgeLM: Fine-tuned Large Language Models are Scalable Judges

'Review > Paper' 카테고리의 다른 글

티스토리툴바