A Model-Based Evaluation Metric for Question Answering Systems

Baklr, D.; Aktas, M.S.; Ylldlz, B.; Yildiz, Beytullah; Bakir, Dilan

A Model-Based Evaluation Metric for Question Answering Systems

dc.contributor.author	Baklr, D.
dc.contributor.author	Aktas, M.S.
dc.contributor.author	Ylldlz, B.
dc.contributor.author	Yildiz, Beytullah
dc.contributor.author	Bakir, Dilan
dc.date.accessioned	2025-03-05T20:47:03Z
dc.date.available	2025-03-05T20:47:03Z
dc.date.issued	2025
dc.description.abstract	The paper addresses the limitations of traditional evaluation metrics for Question Answering (QA) systems that primarily focus on syntax and n-gram similarity. We propose a novel model-based evaluation metric, MQA-metric, and create a human-judgment-based dataset, squad-qametric and marco-qametric, to validate our approach. The research aims to solve several key problems: the objectivity in dataset labeling, the effectiveness of metrics when there is no syntax similarity, the impact of answer length on metric performance, and the influence of real answer quality on metric results. To tackle these challenges, we designed an interface for dataset labeling and conducted extensive experiments with human reviewers. Our analysis shows that the MQA-metric outperforms traditional metrics like BLEU, ROUGE and METEOR. Unlike existing metrics, MQA-metric leverages semantic comprehension through large language models (LLMs), enabling it to capture contextual nuances and synonymous expressions more effectively. This approach sets a standard for evaluating QA systems by prioritizing semantic accuracy over surface-level similarities. The proposed metric correlates better with human judgment, making it a more reliable tool for evaluating QA systems. Our contributions include the development of a robust evaluation workflow, creation of high-quality datasets, and an extensive comparison with existing evaluation methods. The results indicate that our model-based approach provides a significant improvement in assessing the quality of QA systems, which is crucial for their practical application and trustworthiness. © 2025 World Scientific Publishing Company.	en_US
dc.identifier.doi	10.1142/S0218194025500032
dc.identifier.issn	0218-1940
dc.identifier.issn	1793-6403
dc.identifier.scopus	2-s2.0-86000436474
dc.identifier.uri	https://doi.org/10.1142/S0218194025500032
dc.identifier.uri	https://hdl.handle.net/20.500.14411/10470
dc.language.iso	en	en_US
dc.publisher	World Scientific	en_US
dc.relation.ispartof	International Journal of Software Engineering and Knowledge Engineering	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	Evaluation Metric	en_US
dc.subject	Generative Model	en_US
dc.subject	Large Language Model	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Question Answering	en_US
dc.subject	Transformer Models	en_US
dc.title	A Model-Based Evaluation Metric for Question Answering Systems	en_US
dc.type	Article	en_US
dspace.entity.type	Publication
gdc.author.id	Aktas, Mehmet/0000-0001-7908-5067
gdc.author.id	YILDIZ, Beytullah/0000-0001-7664-5145
gdc.author.scopusid	59677045200
gdc.author.scopusid	8410237700
gdc.author.scopusid	59677422200
gdc.author.wosid	Aktas, Mehmet/G-9710-2012
gdc.bip.impulseclass	C5
gdc.bip.influenceclass	C5
gdc.bip.popularityclass	C5
gdc.coar.access	metadata only access
gdc.coar.type	text::journal::journal article
gdc.collaboration.industrial	false
gdc.description.department	Atılım University	en_US
gdc.description.departmenttemp	Baklr D., Computer Engineering Department, Yildiz Technical University Istanbul, Turkey; Aktas M.S., Computer Engineering Department, Yildiz Technical University Istanbul, Turkey; Ylldlz B., Software Engineering Department, Atilim University Ankara, Turkey	en_US
gdc.description.endpage	262	en_US
gdc.description.issue	2	en_US
gdc.description.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
gdc.description.scopusquality	Q3
gdc.description.startpage	243	en_US
gdc.description.volume	35	en_US
gdc.description.woscitationindex	Science Citation Index Expanded
gdc.description.wosquality	Q4
gdc.identifier.openalex	W4404838894
gdc.identifier.wos	WOS:001405946200001
gdc.index.type	WoS
gdc.index.type	Scopus
gdc.oaire.diamondjournal	false
gdc.oaire.impulse	0.0
gdc.oaire.influence	2.3811355E-9
gdc.oaire.isgreen	false
gdc.oaire.popularity	2.5970819E-9
gdc.oaire.publicfunded	false
gdc.openalex.collaboration	National
gdc.openalex.fwci	0.7252
gdc.openalex.normalizedpercentile	0.78
gdc.opencitations.count	0
gdc.plumx.mendeley	4
gdc.plumx.scopuscites	0
gdc.scopus.citedcount	0
gdc.virtual.author	Yıldız, Beytullah
gdc.wos.citedcount	0
relation.isAuthorOfPublication	8eb144cb-95ff-4557-a99c-cd0ffa90749d
relation.isAuthorOfPublication.latestForDiscovery	8eb144cb-95ff-4557-a99c-cd0ffa90749d
relation.isOrgUnitOfPublication	50be38c5-40c4-4d5f-b8e6-463e9514c6dd
relation.isOrgUnitOfPublication	4abda634-67fd-417f-bee6-59c29fc99997
relation.isOrgUnitOfPublication.latestForDiscovery	50be38c5-40c4-4d5f-b8e6-463e9514c6dd

Collections

Scopus
WoS

A Model-Based Evaluation Metric for Question Answering Systems

Files

Collections