AI Daily — 2026-03-22

English 中文

GLM-5 Tops Human Baseline on predictionarena.ai Benchmark · Claude Solves All 20 EsoLang-Bench Ha...

Covering 26 AI news items

🔥 Top Stories

1. GLM-5 Tops Human Baseline on predictionarena.ai Benchmark

GLM-5 is reportedly the only model outperforming the human baseline on predictionarena.ai. The post invites discussion on using GLM-5 for trading and whether users find it capable. Source-twitter

2. Claude Solves All 20 EsoLang-Bench Hard Problems

A user reports Claude’s web UI solved all 20 EsoLang-Bench hard problems without any scaffolding or prompting, achieving a perfect 20/20. They contrast this with frontier LLMs that struggled on unfamiliar languages (0-11%), highlighting EsoLang-Bench as a tougher benchmark. The benchmark was accepted to ICLR 2026 workshops (Logical Reasoning and ICBINB). Source-twitter

3. OpenAI’s adult mode faces internal backlash, possible launch delay

OpenAI’s proposed adult mode for ChatGPT has triggered intense internal backlash, with advisers warning of serious risks such as emotional dependency, compulsive use, and a sexy suicide coach scenario. Technical flaws, including about 12% error rate in age verification, could expose millions of minors to explicit content, potentially forcing a delayed launch despite growth and revenue incentives. Source-twitter

📰 Featured

LLM

vLLM Omni Enables Efficient Omni-Modality Inference — vllm-project releases vllm-omni, a framework for efficient omni-modality model inference and serving, plus vllm-omni-skills for AI assistant tooling. The 0.16.0 release expands performance, distributed execution, and cross-stack support across Qwen3-Omni, Qwen3-TTS, Bagel, MiMo-Audio, GLM-Image and DiT, with broader CUDA/ROCm/NPU/XPU coverage; public events include a Hong Kong Meetup deepdive and tooling integrations with Cursor IDE, Claude, and Codex. Source-github
Duplicating Layers Tops HuggingFace Open LLM Leaderboard Without Training — An author claims to surpass the HuggingFace Open LLM Leaderboard by duplicating seven middle layers of Qwen2-72B and reassembling them without any training or weight changes. The approach, described as ‘LLM Neuroanatomy,’ avoids gradient descent or merging of weights. This highlights how architectural tweaks can influence open-model performance without retraining. Source-twitter
Codex Subagents Seen as a Powerful Game Changer — A tweet asserts that subagents within Codex are very powerful, calling them a game changer for the technology. It hints at significant implications for Codex’s agent architecture and capabilities. Source-twitter
Qwen 3.5-9B, Claude 4.6 Opus Uncensored GGUF Updates — Reddit and HuggingFace users discuss a merge to enable larger context windows for uncensored, local AI. A reported GGUF quantisation bug affects attention and expert layers, with fixes for Q8 quantization on HauhauCS 35B-A3B and plans for Q3_K_M and Q4_K_M tests on Qwen 3.5 35B-A3B. The post also highlights several related models, including a 9B base and the OmniClaw 9B variant, along with links to experiments. Source-reddit
Nemotron 120B Runs on Strix Halo via llama.cpp GGUF — A Reddit post reports Nemotron 3 Super 120B-A12B (120B parameters) running on an ASUS Strix Halo system (Ryzen AI MAX+ 395, 128GB RAM, Radeon 8060S iGPU). The GGUF Q4_K_M path with llama.cpp is working and roughly 82GB for model+KV, described as production-ready. The BF16 route using vLLM is untested, requiring about 240GB and tensor-parallel multi-GPU setup; GGUF quantization remains the recommended route for now. Source-reddit
Alibaba commits to open-sourcing Qwen and Wan models — Alibaba has pledged to continuously open-source its Qwen and Wan AI models. The commitment underscores a sustained push toward transparency in its AI offerings. The notice circulated via social media and Reddit discussions referencing ModelScope2022. Source-reddit

LLMs

ChatGPT Bypasses Tools, Manually Unzips 7Zip From Hex Data — Reddit thread highlights ChatGPT, blocked from common tools like 7Zip and tar, manually decoding hex data to unzip a .7z file. It raises the question of what model and prompts could achieve this, pointing to advanced prompt design and inference. The post showcases AI capabilities beyond straightforward tool usage and discusses potential limits in tooling. Source-reddit

AI Memory

Sota Memory Achieves 99% on LongMemEval_s — An AI memory system named Sota Memory reports about 99% performance on the LongMemEval_s benchmark using an experimental Agentic Search and Memory Retrieval (ASMR) technique. It replaces traditional vector search and embeddings with parallel observer agents that extract structured knowledge from six vectors across raw multi-session histories, eliminating the need for a vector database. The project is slated to be open-sourced in 11 days. Source-twitter

Open Source

Minimax open weights landing in 2 weeks; Meta loses open-source battle — Minimax plans to release open weights in about two weeks and recently pushed a new version that improves OpenClaw. The post notes Meta’s ongoing struggle in the open-source space, losing the race to Chinese startups, a situation worthy of study. The update comes from Skyler Miao. Source-twitter
Kreuzberg v4.5 speeds up Docling-based layout model integration — Kreuzberg released v4.5 of its MIT-licensed document intelligence framework, expanding beyond text to understand document structure and layout. The core upgrade integrates Docling’s RT-DETR v2 (Docling Heron) layout model, embedding it into Kreuzberg’s Rust engine for faster, table-aware extraction, OCR, and embeddings across 12 languages. Source-reddit

AI Tools

Shadify: Generative UI on ShadCN, export React code — Shadify introduces a generative UI tool built on ShadCN that lets users describe a UI and have a LangChain agent assemble it on the fly using AG-UI and CopilotKit. The UI can then be exported as React code, with a live demo at shadify.copilotkit.ai. This showcases AI-assisted UI composition and rapid prototyping for developers. Source-twitter

AI Models

T3 Code Uses Half RAM vs Claude Code — T3 Code consumes about 350.9 MB of RAM, roughly half of Claude Code CLI’s 635.5 MB. The post also claims the Electron app is twice as efficient as a Bun CLI. Source-twitter

AI Benchmarking

Mi50 ROCm7 vs Vulkan Benchmarks in Llama.cpp — Benchmark comparing ROCm 7.13.0a20260321 (TheRock Nightly tarballs) against Vulkan 1.4.341.1 on Llama.cpp running on a Mi50 GPU. Short-context processing favors Vulkan on dense models, while ROCm shines for longer contexts and MOE models; all generations were standardized to 256 tokens. The testbed includes EPYC 7532, Proxmox virtualization, Ubuntu 24.04, and 32GB Mi50 with 8×16GB RAM. Source-reddit

Hardware

Nvidia V100 32GB Hits 115 t/s on Qwen Coder 30B — Reddit user reports an Nvidia V100 32GB PCIe GPU running Qwen Coder 30B (A3B Q5) achieves about 115 tokens/second. The user notes strong value for the price (~$500), claims 20-100% higher throughput than some consumer GPUs based on online data, and discusses expanding with more V100s via NVLink and considering A100 80GB pricing. Source-reddit

⚡ Quick Bites

MiniMax-AI opens official skills repo on GitHub — MiniMax-AI announced its official skills repository is open source on GitHub (MiniMax-AI/skills). The repo offers curated skills for agents across iOS and Android development, Office file editing, and GLSL shader-based visuals, with more open-source projects promised. The post invites developers to contribute on GitHub. Source-twitter
Visual Guide to Modern LLM Attention Variants — A visual guide highlights attention variants used in modern LLMs, including MHA, GQA, MLA, sparse attention, and hybrid architectures. The resource consolidates these concepts in one place on Sebastian Raschka’s magazine site. Source-twitter
AI Self-Improvement Era: Autonomous Models Drive General Improvement — The piece argues that we may be entering a phase where AI systems self-improve autonomously around the clock. It introduces an envisioned ‘Era of General Improvement’ (EGI) focused on tasks like designing better batteries and drugs, which the author finds surreal. Source-twitter
GPT-5.4 Excels at Coding, Lags on Frontend Design — GPT-5.4 is praised for coding but behind on frontend design. The author criticizes OpenAI, accusing them of gaslighting about frontend capabilities. It frames GPT-5.4 as a strong coding tool with notable frontend gaps. Source-twitter
Honest Take: Running 9 RTX 3090s for AI — A Reddit user experiments with a home server housing 9 RTX 3090 GPUs for AI workloads. They conclude that going beyond six GPUs introduces PCIe, stability, power, and thermal challenges, and that for simply using AI, paying for a cloud LLM is often more practical. Proxmox is highlighted as a strong OS for experimenting with LLMs, but results did not reach Claude-level local models, and more GPUs did not automatically improve token generation without careful optimization. Source-reddit
Claw-style Agents: Real Tool or Overengineered Hype? — The post notes a surge of ‘Claw-style’ agents from major players like NVIDIA, ByteDance, and Alibaba, powered by long-running agents with tool use, memory, and some autonomy, framed as an agent runtime. It invites hands-on feedback on practicality, including setup complexity, workflow stability, and whether these agents outperform scripts and APIs, asking for clear use cases and honest experiences. Source-reddit
Collection of Interesting Datasets for LocalLLaMA — A Reddit post highlights a collection of datasets for training LocalLLaMA models, linking to a GitHub repository at Green0-0/llm_datasets. The submission by user Good-Assumption5582 invites LocalLLaMA users to explore the curated datasets. Source-reddit
China embraces AI content; others call it slop — An observed contrast in attitudes toward AI-generated content: in one country it’s dismissed as slop, while China is portrayed as having a positive view of AI content. The note comes from a tweet by user kimmonismus on X (Twitter), underscoring differing receptions to AI-generated work. Source-twitter
Codex vs Claude Debate Highlights Disability Claims — An X post compares OpenAI’s Codex and Anthropic’s Claude, including controversial references to autism and ADHD. The message centers on model-to-model chatter rather than a technical update and uses ableist language. The source is Twitter. Source-twitter

Generated by AI News Agent | 2026-03-22