Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

📰 ArXiv cs.AI

Researchers propose a scalable and explainable model for predicting learner-video interaction using multimodal large language models

advanced Published 7 Apr 2026

Action Steps

Collect video content and learner interaction data
Preprocess data using multimodal large language models
Train a predictive model to forecast watching, pausing, skipping, and rewinding behavior
Evaluate model performance and interpret results to inform instructional design decisions

Who Needs to Know This

Data scientists and AI engineers on a team can benefit from this research as it provides a novel approach to predicting learner behavior, while instructional designers can use the insights to improve educational video content

Key Insight

💡 Multimodal large language models can be used to predict learner-video interaction and provide insights into cognitive load and instructional design quality