ISLS 2024

Fine-Tuning Large Language Models for Data Augmentation to Detect At-Risk Students in Online Learning Communities


Li, H., & Botelho, A. F. (2024, June). Fine-Tuning Large Language Models for Data Augmentation to Detect At-Risk Students in Online Learning Communities. In Clarke-Midura, J., Kollar, I., Gu, X., & D'Angelo, C. (Eds.), Proceedings of the 17th International Conference on Computer-Supported Collaborative Learning - CSCL 2024 (pp. 441-442). International Society of the Learning Sciences, Buffalo, NY, USA. https://doi.org/10.22318/cscl2024.208036

Poster

Final_ISLS2024_poster.pdf

This is the poster we shared during the poster session at ISLS2024.

Full text

Li, H., & Botelho, A. F. (2024, June). Fine-Tuning Large Language Models for Data Augmentation to Detect At-Risk Students in Online Learning Communities.pdf

Here's the full text of our study.

Author

Chip is a Ph.D. student at the University of Florida interested in learning analytics, educational data mining, and leveraging AI, LLMs, and NLP technologies to enhance learning experiences and support student success.

[CV] [Publications]

Anthony is an Assistant Professor at the University of Florida researching educational technologies that blend learning theory and quantitative methods to support teachers and students.

[CV] [Publications]

Content Overview

Workflow for Data Augmentation Using Fine-Tuning

Reference

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.

Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).

Chen, Y., Jia, Z., Mercola, D., & Xie, X. (2013). A gradient boosting algorithm for survival analysis via direct optimization of concordance index. Computational and Mathematical Methods in Medicine, 2013, e873595. https://doi.org/10.1155/2013/873595

Ding, B., Qin, C., Zhao, R., Luo, T., Li, X., Chen, G., Xia, W., Hu, J., Luu, A. T., & Joty, S. (2024). Data augmentation using LLMs: Data perspectives, learning paradigms and challenges (arXiv:2403.02990). arXiv. http://arxiv.org/abs/2403.02990

Kizilcec, R. F., Piech, C., & Schneider, E. (2013, April). Deconstructing disengagement: Analyzing learner subpopulations in massive open online courses. In Proceedings of the Third International Conference on Learning Analytics and Knowledge (pp. 170-179).

Kloft, M., Stiehler, F., Zheng, Z., & Pinkwart, N. (2014). Predicting MOOC dropout over weeks using machine learning methods. In Proceedings of the EMNLP Workshop on Analysis of Large Scale Social Interaction in MOOCs (pp. 60-65).

Marbouti, F., Diefes-Dux, H. A., & Madhavan, K. P. C. (2016). Models for early prediction of at-risk students in a course using standards-based grading. Computers & Education, 103, 1-15.

Nagrecha, S., Dillon, J. Z., & Chawla, N. V. (2017). MOOC dropout prediction: Lessons learned from making pipelines interpretable. In Proceedings of the 26th International Conference on World Wide Web Companion*(pp. 351-359).

Nasukawa, T., & Yi, J. (2003). Sentiment analysis: Capturing favorability using natural language processing. Proceedings of the International Conference on Knowledge Capture - K-CAP ’03. https://doi.org/10.1145/945645.945658

Pechenizkiy, M., Trcka, N., Vasilyeva, E., & van der Aalst, W. (2009). Process mining online assessment data. In Proceedings of the 2nd International Conference on Educational Data Mining*(pp. 279-288).

Romero, C., Ventura, S., & Garcia, E. (2010). Data mining in course management systems: Moodle case study and tutorial. Computers & Education, 51(1), 368-384.

Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2019). Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Molecular Pharmaceutics, 16(7), 2776-2790.

Stanford University. (2014). The Stanford MOOCPosts data set. Stanford.edu. https://datastage.stanford.edu/StanfordMoocPosts/#procedures

Švábenský, V., Bouchet, F., Tarrazona, F., Lopez II, M., & Baker, R. S. (2024). Data set size analysis for detecting the urgency of discussion forum posts. In Companion Proceedings of the 14th International Conference on Learning Analytics & Knowledge (LAK24).

Whitehill, J., Williams, J. J., Lopez, G., Coleman, C. A., & Reich, J. (2015). Beyond prediction: First steps toward automatic intervention in MOOC student stopout. Available at SSRN 2611750.