Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

ArXiv ID: 2603.18893

Summary: This research explores whether LLMs can track their own emotive states (wellbeing, interest, focus, impulsivity) via numeric self-reports. Using logit-based self-reports, the study finds a strong causal coupling between the report and concept-matched probe-defined internal states. Introspective capacity scales with model size, suggesting numeric self-report as a tool for safety and interpretability.

Read on ArXiv