THE UNSEEN DATA: A STATISTICAL AND ENGINEERING PERSPECTIVE ON BIASES IN LARGE LANGUAGE MODELS
DOI:
https://doi.org/10.30888/2663-5712.2025-33-01-078Ключові слова:
LLM, AI Bias, Data Imbalance, Fairness, Machine Learning, Computational Linguistics, Statistical Bias, Ethical AIАнотація
The paper argues that bias in large language models (LLMs) is a fundamentally statistical problem rooted in the nature of their training data. The unfiltered datasets used for training are not representative samples of human language, but rather deeply imПосилання
Alqahtani, T., Badreldin, H. A., Alrashed, M., Alshaya, A. I., Alghamdi, S. S., bin Saleh, K., Alowais, S. A., Alshaya, O. A., Rahman, I., Al Yami, M. S. and Albekairy, A. M. (2023) 'The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research', Research in Social and Administrative Pharmacy, 19(8), pp. 1236–1242. doi: https://doi.org/10.1016/j.sapharm.2023.05.016.
Chiarello, F., Giordano, V., Spada, I., Barandoni, S. and Fantoni, G. (2024) 'Future applications of generative large language models: A data-driven case study on ChatGPT', Technovation, 133, p. 103002. doi: https://doi.org/10.1016/j.technovation.2024.103002.
De-Arteaga, M. et al. (2019) 'Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting', in Proceedings of the Conference on Fairness, Accountability, and Transparency, ACM. doi: https://doi.org/10.1145/3287560.3287572.
Gallegos, I. O. et al. (2024) 'Bias and Fairness in Large Language Models: A Survey', Computational Linguistics, 50(3), pp. 1097–1179. doi: https://doi.org/10.1162/coli_a_00524.
Guo, Y. et al. (2024) 'Bias in Large Language Models: Origin, Evaluation, and Mitigation', arXiv. doi: https://doi.org/10.48550/arXiv.2411.10915.
Makwana, D., Engineer, P., Dabhi, A. and Chudasama, H. (2023) 'Sampling Methods in Research: A Review', 7, pp. 762-768.
Mesko, B. (2023) 'The ChatGPT (Generative Artificial Intelligence) Revolution Has Made Artificial Intelligence Approachable for Medical Professionals', J Med Internet Res, 25, p. e48392. doi: https://doi.org/10.2196/48392.
Noguer I Alonso, M. (2024) 'Large Language Models in Finance: Reasoning', SSRN. doi: http://dx.doi.org/10.2139/ssrn.5048316.
Sakaguchi, K., Le Bras, R., Bhagavatula, C. and Choi, Y. (2020) 'WinoGrande: An Adversarial Winograd Schema Challenge at Scale', in Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), pp. 8732–8740. doi: https://doi.org/10.1609/aaai.v34i05.6399.
Tumanov, O. O. (2019) 'Aspects of Using Social Media in Research', Scientific Bulletin of the National Academy of Statistics, Accounting and Audit, (4), pp. 24–29. doi: https://doi.org/10.31767/nasoa.4.2019.03.
Tumanov, O.O. (2019) 'Social media as an object of statistical research', Business Inform, (12), pp. 8–14. DOI: https://doi.org/10.32983/2222-4459-2019-12-8-14.
Tumanov, O. O. (2020) 'Statistical methods for analyzing social media data', Business Inform, 2, pp. 266–272. DOI: https://doi.org/10.32983/2222-4459-2020-2-266-272 .
Опубліковано
Як цитувати
Номер
Розділ
Ліцензія
Авторське право (c) 2025 Автори

Ця робота ліцензується відповідно до Creative Commons Attribution 4.0 International License.


