SHAP-Guided Feature Selection for Cross-Dataset Generalization in Network Intrusion Detection Systems
| dc.contributor.author | Şengül, Gökhan | |
| dc.contributor.author | Kılıç, Can | |
| dc.date.accessioned | 2026-06-23T11:10:30Z | |
| dc.date.issued | 2026 | |
| dc.description.abstract | Flow-based machine learning intrusion detection systems (IDS) often achieve near-perfect performance when trained and tested on a single benchmark dataset; nonetheless, their ability to generalize across datasets is a crucial and mostly unresolved challenge. This study analyzes the cross-dataset generalization behavior of an explainable, flow-based IDS trained on CICIDS2017 and externally evaluated on the CSE-CIC-IDS2018 dataset, which represents a more realistic network environment with varying attack implementations, traffic compositions, and background services. Two frequently used ensemble models, Random Forest and XGBoost, are trained solely on flow-level metadata without packet payload examination. After removing non-behavioral identifiers (Flow ID, Source IP, Destination IP, and Timestamp) and harmonizing feature schemas, the datasets are aligned into a unified 80-dimensional feature space extracted with CICFlowMeter. SHAP (TreeSHAP) is used to calculate global feature importance and create multiple explainability-driven feature subsets, such as model-specific Top-20 sets, a COMMON-10 intersection, and a UNION-30 superset. Although both models attain near-perfect accuracy and weighted F1-scores on CICIDS2017 (macro-F 1 ≈ 0.90 ), when evaluated on CSE-CIC-IDS2018, macro-F1 drops to 0.127 for Random Forest and 0.119 for XGBoost, despite high overall accuracy, indicating a strong bias toward majority classes under domain shift conditions. SHAP-guided feature reduction provides a measurable but limited improvement for Random Forest, increasing macro-F1 from 0.127 to 0.166, while an additional port-removal ablation further improves macro-F1 to 0.207. In contrast, no significant cross-dataset improvement is observed for XGBoost. An additional practical observation is that SHAP-guided feature rankings remain highly stable across sample sizes: class-balanced subsets of approximately 400 flows (50 samples per class) produce highly similar Top-20 rankings to those obtained from 10,000 flows (1250 samples per class), supporting the feasibility of computationally efficient explainability. Overall, the results show that explainability-driven feature analysis improves transparency, compactness, and feature prioritization; however, it does not fully resolve the broader distributional shift challenges that limit cross-dataset generalization in flow-based intrusion detection systems. | |
| dc.identifier.doi | 10.1109/ACCESS.2026.3703481 | |
| dc.identifier.issn | 2169-3536 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14411/11624 | |
| dc.identifier.uri | https://doi.org/10.1109/ACCESS.2026.3703481 | |
| dc.language.iso | en | |
| dc.publisher | IEEE | |
| dc.relation.ispartof | IEEE Access | |
| dc.rights | info:eu-repo/semantics/openAccess | |
| dc.subject | Cross-dataset generalization | |
| dc.subject | explainable artificial intelligence | |
| dc.subject | flow-based traffic analysis | |
| dc.subject | network intrusion detection | |
| dc.subject | random forest | |
| dc.subject | SHAP | |
| dc.subject | XGBoost | |
| dc.title | SHAP-Guided Feature Selection for Cross-Dataset Generalization in Network Intrusion Detection Systems | |
| dc.type | Article | |
| dspace.entity.type | Publication | |
| gdc.description.department | Computer Engineering | |
| gdc.description.publicationcategory | Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı | |
| gdc.description.scopusquality | Q1 | |
| gdc.description.volume | 14 | |
| gdc.description.wosquality | Q2 | |
| relation.isAuthorOfPublication.latestForDiscovery | f291b4ce-c625-4e8e-b2b7-b8cddbac6c7b | |
| relation.isOrgUnitOfPublication.latestForDiscovery | 50be38c5-40c4-4d5f-b8e6-463e9514c6dd |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- SHAP-Guided_Feature_Selection_for_Cross-Dataset_Generalization_in_Network_Intrusion_Detection_Systems_IEEE_ACCESS_2026.pdf
- Size:
- 1.79 MB
- Format:
- Adobe Portable Document Format
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 1.71 KB
- Format:
- Item-specific license agreed to upon submission
- Description:
