티스토리 뷰
[논문 리뷰] JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Suyeon Cha 2024. 4. 20. 21:09Title: JudgeLM: Fine-tuned Large Language Models are Scalable Judges
링크: https://arxiv.org/abs/2310.17631
여기서 만든 데이터셋 format으로 만들고, judgeLM으로 평가 내리는 것을 생각하기 (Github)
Summary
A novel approach to evaluating large language models, which are advanced AI models capable of generating human-like text. The main chanllenge in assessing LLMs is that existing benchmarks and metrics don't comprehensively measure their performance in open-ended tasks.
- Word: open-ended(개방형) / versatility (다재) / succinctly(간결하게)
Motivation: Evaluating LLMs comprehensively has been challenging due to the lack of benchmarks and metrics that cam capture the capabilities.
JudgeLM is a new method, which fine-tunes LLMs as scalable judges to efficiently and effectively evaluate LLMs in open-ended benchmarks.
Contribution
- Large-scale dataset for judge models with diverse seed takss, LLM-generated answers, and detailed judgements from GPT-4.
- JudgeLM achieves high agreement with the teacher judge, extended capabilities in grading single answers, judging multi-modal models, handling multiple answers, and engaging in multi-turn chat.
- Analyze the biases inherent to LLM judge fine-tuning (key bias in fine-tuning LLMs as judges → position bias, knowledge bias, and format bias => incorporate techniques such as swap augmentation, reference support, and reference drop)
Related works
- Instruction fine-tuning of large language models
- evaluation of large language models
Dataset
Method
JudgeLM → Open-ended scenarios (목적: Fine-tuning 하며 생기는 편향을 해결하기 위함 / 모델의 일관성을 다양한 시나리오에 걸쳐 적응을 확장 시키는 것 )
They suggest reference drop, reference support, and swap augmentation for fine-tuning LLMs as judges in order to overcome format, knowledge, and position biases, respectively.
- format: 특정 프롬프트에서만 최적의 성능
- knowledge: pre-trained model에 대해 과하게 의존하는 지식 편향
- position biases 특정 상황에 선호하는 응답
Application → multi-modal models such as multi-turn conversation, grading single replies, and judging multiple answers
Conclusion
By fine-tuning LLMs as judges and addressing biases through advanced techniques, JudgeLM achieves high agreement with human judges and demonstrates versatility in various tasks.
'Review > Paper' 카테고리의 다른 글
- Total
- Today
- Yesterday
- 리눅스
- 도커 작업
- few-shot learning
- 리눅스 nano
- 서버구글드라이브연동
- style transfer
- 도커 컨테이너
- 리눅스 나노 사용
- NLP
- support set
- 도커
- stylegan
- 프롬프트
- 딥러닝
- 파이썬 딕셔너리
- 퓨샷러닝
- 파이썬
- 리눅스 나노
- 파이썬 클래스 계층 구조
- linux nano
- 구글드라이브연동
- Unsupervised learning
- prompt learning
- Prompt
- python
- docker
- CNN
- cs231n
- clip
- 파이썬 클래스 다형성
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |