티스토리 뷰
[논문 리뷰] JudgeLM: Fine-tuned Large Language Models are Scalable Judges
미남잉 2024. 4. 20. 21:09Title: JudgeLM: Fine-tuned Large Language Models are Scalable Judges
링크: https://arxiv.org/abs/2310.17631
여기서 만든 데이터셋 format으로 만들고, judgeLM으로 평가 내리는 것을 생각하기 (Github)
Summary
A novel approach to evaluating large language models, which are advanced AI models capable of generating human-like text. The main chanllenge in assessing LLMs is that existing benchmarks and metrics don't comprehensively measure their performance in open-ended tasks.
- Word: open-ended(개방형) / versatility (다재) / succinctly(간결하게)
Motivation: Evaluating LLMs comprehensively has been challenging due to the lack of benchmarks and metrics that cam capture the capabilities.
JudgeLM is a new method, which fine-tunes LLMs as scalable judges to efficiently and effectively evaluate LLMs in open-ended benchmarks.
Contribution
- Large-scale dataset for judge models with diverse seed takss, LLM-generated answers, and detailed judgements from GPT-4.
- JudgeLM achieves high agreement with the teacher judge, extended capabilities in grading single answers, judging multi-modal models, handling multiple answers, and engaging in multi-turn chat.
- Analyze the biases inherent to LLM judge fine-tuning (key bias in fine-tuning LLMs as judges → position bias, knowledge bias, and format bias => incorporate techniques such as swap augmentation, reference support, and reference drop)
Related works
- Instruction fine-tuning of large language models
- evaluation of large language models
Dataset
Method
JudgeLM → Open-ended scenarios (목적: Fine-tuning 하며 생기는 편향을 해결하기 위함 / 모델의 일관성을 다양한 시나리오에 걸쳐 적응을 확장 시키는 것 )
They suggest reference drop, reference support, and swap augmentation for fine-tuning LLMs as judges in order to overcome format, knowledge, and position biases, respectively.
- format: 특정 프롬프트에서만 최적의 성능
- knowledge: pre-trained model에 대해 과하게 의존하는 지식 편향
- position biases 특정 상황에 선호하는 응답
Application → multi-modal models such as multi-turn conversation, grading single replies, and judging multiple answers
Conclusion
By fine-tuning LLMs as judges and addressing biases through advanced techniques, JudgeLM achieves high agreement with human judges and demonstrates versatility in various tasks.
'Review > Paper' 카테고리의 다른 글
- Total
- Today
- Yesterday
- Unsupervised learning
- prompt learning
- vscode 자동 저장
- 구글드라이브서버다운
- 파이썬 클래스 다형성
- 프롬프트
- 구글드라이브다운
- python
- 서버에다운
- 파이썬
- cs231n
- 구글드라이브연동
- 도커
- clip
- few-shot learning
- 서버구글드라이브연동
- support set
- 도커 컨테이너
- 구글드라이브서버연동
- style transfer
- 파이썬 클래스 계층 구조
- Prompt
- CNN
- docker
- 데이터셋다운로드
- NLP
- 딥러닝
- stylegan
- 퓨샷러닝
- 파이썬 딕셔너리
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |