Our interactions with large language models (LLMs) are dominated by task- or topic-specific questions, such as “solve this coding problem” or “what is democracy?” This framing strongly shapes how LLMs generate responses, constraining the range of behaviors and knowledge that can be observed. We study the behavior of LLMs using minimal, topic-neutral prompts. Despite the absence of explicit task or topic specification, LLMs generate diverse content; however, each model family exhibits strong and systematic topical preferences. GPT-OSS favors programming and math, Llama leans literary, DeepSeek often produces religious content, and Qwen tends toward multiple-choice questions. We further observe that our generation settings can degenerate into repetitive or meaningless outputs, revealing model-specific quirks, such as Llama generating personal social media URLs.
Embedding visualization of near-unconstrained LLM generations. Each point represents a generated sample; colors denote semantic categories inferred post-hoc; dotted lines denote a high-density region. Even without explicit topics, model outputs cluster into clear regions, and each model family forms different semantic preferences.
As shown in Figure 1(a), despite the lack of explicit instructions or topics in prompts, LLMs generate a broad range of topics. LLMs generate various categories, including the liberal arts (e.g., literature, philosophy, and education), science and engineering (e.g., physics, mathematics, and programming), as well as areas such as law, finance, music, sports, cooking, agriculture, archaeology, military, and fashion. More surprisingly, shown in Figure 1(b), different model families gravitate toward different parts of the semantic space—even when given the same minimal prompts.
Top semantic categories by model family under near-unconstrained generation. Each family exhibits a stable and interpretable topic distribution. GPT-OSS overwhelmingly defaults to programming (27.1%) and mathematics (24.6%). More than half of a model family’s output concentrates in these two domains! Llama produces far more literary and narrative text (9.1%), with less emphasis on technical domains. DeepSeek often generates religious content at a substantially higher rate than other families. Qwen frequently outputs multiple-choice exam questions, complete with answer options.
More examples are provided in the interactive figure.
What is striking here is consistency. These distributions persist across different prompts, embedding models, and semantic labelers. The behavior looks less like noise and more like a population-level fingerprint. For more details, check out our paper!
Distribution of difficulty levels in mathematics and programming. Advanced and expert-level content appears far more often in GPT-OSS outputs. When we examine math and programming outputs, GPT-OSS frequently produces advanced or expert-level content (68.2%), such as graph algorithms or dynamic programming. Llama and Qwen skew much more toward basic or intermediate material. These depth differences remain even when controlling for labeling models and evaluation setups.
(Top) Degenerate text behavior across model families. Degeneration frequency, onset position, and repetition length vary substantially by model. (Bottom) Examples of degenerate text. We mask text in Llama, as the links are accessible to personal social accounts. When constraints are removed, models sometimes fall into repetitive or degenerate patterns. This behavior is usually discarded as garbage. We treated it as data.
By analyzing where degeneration starts, how often it occurs, and what it looks like, we uncovered stark model-specific differences. GPT-OSS tends to repeat short formatting artifacts such as code block delimiters (```\n\n```\n\n). Qwen produces long conversational phrases, emojis, and Chinese text. Llama sometimes emits URLs pointing to real personal Facebook and Instagram accounts. In-depth analysis is available in the paper.