A Model-Based Evaluation Metric for Question Answering Systems

Baklr, D.; Aktas, M.S.; Ylldlz, B.; Yildiz, Beytullah; Bakir, Dilan

A Model-Based Evaluation Metric for Question Answering Systems

Date

2025

Authors

Publisher

World Scientific

Green Open Access

No

Publicly Funded

No

Impulse

Average

Influence

Average

Popularity

Average

Abstract

The paper addresses the limitations of traditional evaluation metrics for Question Answering (QA) systems that primarily focus on syntax and n-gram similarity. We propose a novel model-based evaluation metric, MQA-metric, and create a human-judgment-based dataset, squad-qametric and marco-qametric, to validate our approach. The research aims to solve several key problems: the objectivity in dataset labeling, the effectiveness of metrics when there is no syntax similarity, the impact of answer length on metric performance, and the influence of real answer quality on metric results. To tackle these challenges, we designed an interface for dataset labeling and conducted extensive experiments with human reviewers. Our analysis shows that the MQA-metric outperforms traditional metrics like BLEU, ROUGE and METEOR. Unlike existing metrics, MQA-metric leverages semantic comprehension through large language models (LLMs), enabling it to capture contextual nuances and synonymous expressions more effectively. This approach sets a standard for evaluating QA systems by prioritizing semantic accuracy over surface-level similarities. The proposed metric correlates better with human judgment, making it a more reliable tool for evaluating QA systems. Our contributions include the development of a robust evaluation workflow, creation of high-quality datasets, and an extensive comparison with existing evaluation methods. The results indicate that our model-based approach provides a significant improvement in assessing the quality of QA systems, which is crucial for their practical application and trustworthiness. © 2025 World Scientific Publishing Company.

ORCID

Aktas, Mehmet

YILDIZ, Beytullah

Keywords

Evaluation Metric, Generative Model, Large Language Model, Natural Language Processing, Question Answering, Transformer Models

WoS Q

Q4

Scopus Q

Q3

OpenCitations Citation Count

N/A

Source

International Journal of Software Engineering and Knowledge Engineering

Volume

35

Issue

2

Start Page

243

End Page

262

URI

https://doi.org/10.1142/S0218194025500032
https://hdl.handle.net/20.500.14411/10470

Collections

Scopus
WoS

PlumX Metrics

Citations

Scopus : 0

Captures

Mendeley Readers : 4

Full item page

Page Views

1

checked on Apr 13, 2026

Google Scholar™

Check

A Model-Based Evaluation Metric for Question Answering Systems

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

Green Open Access

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

BIP! Indicators

Research Projects

Journal Issue

Abstract

Description

ORCID

Keywords

Fields of Science

Citation

WoS Q

Scopus Q

OpenCitations Citation Count

Source

Volume

Issue

Start Page

End Page

URI

Collections

PlumX Metrics

Citations

Captures

Page Views

1

Google Scholar™

OpenAlex FWCI

0.7252

Sustainable Development Goals

SDG data is not available