Loading...
Loading...
Across research roundups and hands-on training notes, the story is how Transformers consolidated their lead through scale, tooling, and growing interpretability. Gwern.net’s 2021 newsletters track rapid acceleration from GPT-3’s API-driven adoption to a wave of 100B+ industrial models such as Naver’s HyperCLOVA, alongside new architectures (Perceiver, Set Transformers) and multimodal advances (CLIP, SEER). Complementing the macro view, a checkpoint-by-checkpoint GPT-2-style training experiment shows coherence emerging quickly and inheriting web-data biases, underscoring evaluation and dataset concerns. A technical explainer on encoder–decoder attention ties these trends to core mechanisms that make large models practical to scale and stack.
Gwern.net's March 2021 newsletter aggregates recent updates to the site and a curated list of AI, genetics, and broader science links. Key highlights include enabling mobile "popins" and new recursive Wikipedia popups on Gwern.net, and a roundup of AI research: dissecting CLIP's multimodal neurons (Goh et al.), large-scale self-supervised vision work like SEER, critiques of ImageNet transfer learning, Vision Transformer robustness studies, and discussion of GPT-3 API adoption. The newsletter also links reinforcement-learning papers, autonomous vehicle simulation analyses, and debates about global AI leadership. Other topics include genetics GWAS findings, evolutionary biology, and meta-science items. It matters as a concise signal of influential papers and trends for researchers and practitioners tracking ML, self-supervision, and AI tool adoption.
Gwern.net's April 2021 newsletter compiles links and commentary on AI/ML research and large-model developments, noting GPT-3's impact and new giant language and multimodal models from industry (OpenAI, Naver HyperCLOVA, Huawei PanGu-α, Google LaMDA/MUM, Alibaba PLUG). It highlights papers on Set Transformers, Perceiver, Z-IL (local learning rules matching backprop), super-convergence, and creative uses of generative models (CogView, VideoGPT, GODIVA). The newsletter flags trends: continued Transformer dominance, rapid scale-ups (100B+ models), Chinese and multinational efforts, and open checkpoints/releases. This matters for researchers, engineers, and policy watchers because it maps where compute, datasets, and architectures are driving AI capability gains and where reproducibility, efficiency, and governance questions will arise.
A developer trained a GPT-2-small-style transformer (163M parameters) on 3.2B tokens from Hugging Face’s FineWeb dataset and saved 57 checkpoints across a two-day run to sample how generations evolve during training. Prompting each checkpoint with “Every effort moves you,” the model moves from token-salad with partial words to common-token guesses, then to plausible but generic sentences and finally to content reflecting the web-scraped training distribution (business/self-help tones). The piece highlights how modern token-based LLMs acquire fluency and topical biases rapidly compared with older character-level RNNs, and shows how prompt seed and training data shape early emergent coherence. It matters for model evaluation, dataset curation, and understanding emergent behaviors during training.
The article explains how encoder–decoder attention in transformers combines value vectors: it scales per-word value vectors by the softmax attention weights (derived from queries and keys) and sums them to produce encoder–decoder attention outputs. It emphasizes that the weight matrices for queries, keys, and values in encoder–decoder attention differ from self-attention, yet are reused across tokens to handle variable input/output lengths. The piece notes that encoder–decoder attention layers can be stacked, like self-attention, to model more complex phrases, and signals follow-up articles will provide additional detail. The post also includes a brief promotional blurb for Installerpedia, a community installer tool.