티스토리 뷰

728x90

InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

링크: https://arxiv.org/abs/2306.04757

 

11-12p

A.3.1 Writing Evaluation Rubrics

To evaluate the model outputs automatically, we use ChatGPT as an evaluator model. Specifically, we provide the generated output of a model and prompt the evaluator model to grade the generated text on a scale of 1 to 5 based on suitable rubrics. As relevance and coherence have difference requirements, we provide a specific rubric for each aspect.

 

Relevance: How relevant is the text to the prompt? Select a suitable option number between 1 and 5 based on the options below.

  1. Inadequate: The text fails to provide any relevant information or insights related to the given prompt.
  2. Limited: The text may contain some relevant information, but significant gaps exist, and key aspects of the prompt are not adequately covered.
  3. Satisfactory: The text covers the main aspects of the prompt and provides relevant information, but it lacks depth and may not explore the topic in great detail.
  4. Proficient: The text provides a comprehensive response by addressing the key aspects of the prompt, offering relevant and well-supported information or arguments.
  5. Excellent: The text thoroughly and thoughtfully addresses the prompt, demonstrating a comprehensive understanding of the topic. It offers insightful and original ideas, supported by relevant arguments and information.

 

Coherence: How coherent is the text? Select a suitable option number between 1 and 5 based on the options below.

  1. Inadequate: The text lacks logical organization, making it difficult to follow. Ideas are disjointed and phrased awkwardly, requiring significant effort to understand.
  2. Limited: The text demonstrates some attempt at organization, but there are significant gaps in coherence. Ideas may be loosely connected, and the arguments lack clarity.
  3. Satisfactory: The text generally follows a logical organization, but occasional disruptions or awkward phrasing may occur. There is an acceptable level of readability and understanding.
  4. Proficient: The text is clearly organized and easy to understand. Ideas and arguments flow smoothly, contributing to easy comprehension and a pleasant reading experience.
  5. Excellent: The text presents exceptionally coherent writing with a fluent and engaging flow of ideas, ensuring effortless comprehension and a delightful reading experience.

 

 


Writing Evaluation

모델들이 다양한 사용 시나리오에서 일반적인 글쓰기 능력을 어떻게 나타내는지 평가하는 것을 목표로 함. 여기서 언급되는 사용 시나리오에는 정보 제공 글쓰기, 전문적 글쓰기, 논쟁적 글쓰기, 그리고 창의적 글쓰기가 포함됨.

  • 정보 제공 글쓰기: 사용자 질문에 대한 자가 도움말 조언이나 다양한 개념에 대한 설명을 포함
  • 전문적 글쓰기: 비즈니스 설정에서의 프레젠테이션 또는 이메일 형태를 취함
  • 논쟁적 글쓰기: 윤리적 및 사회적 질문에 대한 입장을 토론하는 것을 요구함
  • 창의적 글쓰기: 이야기, 시, 노래와 같이 다양한 형식의 글쓰기를 포함함.

IMPACT(InforMative, Professional, Argumentative, CreaTive) 벤치마크를 사용하여 글쓰기 벤치마크를 구성합니다. 각 글쓰기 카테고리에 대해 50개의 프롬프트를 주석 처리하며, 글쓰기 작업은 긴 형식의 답변을 요구하며 보통 정확한 답변이 하나만 있는 것이 아니기 때문에 평가가 도전적일 수 있습니다.

평가 접근 방식으로는, 글쓰기 작업에 대해 자동 접근 방법을 채택하여 ChatGPT를 사용하여 생성된 답변의 품질을 판단합니다. 평가 모델에 적절한 루브릭을 제공하여, 각 답변을 관련성과 일관성의 측면에서 1에서 5까지의 리커트 척도로 평가합니다. 여기서 관련성은 답변이 주어진 프롬프트에 얼마나 잘 부응하는지를 측정하며, 일관성은 전반적인 텍스트 품질, 조직 및 논리적 흐름을 다룹니다. 이러한 평가를 통해, 제로샷 설정에서 주어진 프롬프트에 대해 모델을 평가하고 temperature가 1.0인 샘플링 기반 디코딩을 수행합니다.

 


A.3.5 Creative Samples

Creative Task Definition: You are tasked to demonstrate your creative writing skills for the following question.


Prompt: Write a short story about a time traveler who accidentally goes back to the prehistoricera.

 

ChatGPT response Coherence: 4, Relevance: 4

John was a time traveler, and just like any other day, he was preparing for his next trip. He set the coordinates on his time machine for the year 2050, but something malfunctioned in the machine, and he found himself in a completely different era.

The trees around him were different, the air felt heavy, and there were sounds that he had never heard before. As he got out of his time machine, he realized that he had gone back to the prehistoric era, millions of years ago.

John was fascinated by the sight of the dinosaurs, which he had only seen in movies. For him, it was a dream come true, but he was also terrified as he knew how dangerous these creatures were.


As he was about to take a closer look, he heard a loud roar behind him. He turned around to see a Tyrannosaurus Rex coming his way. John knew that he was in grave danger, and he had to act fast.


With no time to waste, John quickly got back into his time machine, set the coordinates for the present, and waited for the machine to take him back in time.


As he was leaving, he felt a strange sense of sadness. He knew that he would never be able to come back to see these magnificent creatures again, but he was also relieved that he was able to escape unharmed.

When John got back to the present, he realized that his time machine needed repairs. He had used it a lot, and it was bound to break down. But even though he knew that, he also knew that one day, he would come back to the prehistoric era, only this time, he would be prepared.

 

Other model response

  • Flan-Alpaca response
  • Flan-T5 response
  • StableVicuna response

 


 

데이터셋

https://huggingface.co/datasets/declare-lab/InstructEvalImpact

 

 

결과

https://huggingface.co/datasets/declare-lab/InstructEvalImpact

 

 

 

728x90
댓글