daily
Mar 28, 2026

AI Daily — 2026-03-28

English 中文

VibeVoice ASR Open-Source, Integrated with Transformers · 3-bit KV cache enables MacBook to match...


Covering 30 AI news items

🔥 Top Stories

1. VibeVoice ASR Open-Source, Integrated with Transformers

Microsoft’s VibeVoice open-sourced VibeVoice-ASR, a unified speech-to-text model that processes 60-minute audio in a single pass and outputs structured transcripts (Who, When, What) with user-customizable context. It is multilingual (50+ languages) and now available via Hugging Face Transformers, with finetuning code and vLLM acceleration. A Technique Report accompanies the release. Source-github

2. 3-bit KV cache enables MacBook to match cloud AI quality locally

An M2 MacBook user reports local AI inference parity with cloud services after applying 3-bit KV cache compression, enabling 100K-token conversations with cloud-equivalent quality. Previously paying $200/month for cloud AI APIs, they canceled all subscriptions and saved money. The optimization references a free paper algorithm and a TurboQuant breakdown. Source-twitter

3. Anthropic’s Claude Demonstrates Zero-Day Finding at Conference

During a live conference demo, Anthropic showcased Claude locating zero-day vulnerabilities in Ghost and the Linux kernel. The demonstration claimed Claude identified a blind SQL injection in 90 minutes and allegedly stole an admin API key, illustrating AI-assisted cybersecurity capabilities. Source-twitter

LLM

  • GLM-5.1 Will Be Open Source, Says Zixuan Li — Zixuan Li tweeted that GLM-5.1 will be released as open source. The post reassures followers not to panic while signaling an upcoming open release of the GLM-5.1 model. Source-twitter
  • TurboQuant on MLX: 4.6x KV Cache Compression — TurboQuant has been implemented on MLX using fused Metal kernels to accelerate KV cache compression. In tests with Qwen2.5-32B on an M4 Pro 48GB system, it achieves 4.6x compression with 0.98x FP16 throughput and identical quality; for 16K context, the cache shrinks from 4.2GB to 897MB. The work includes a writeup, open-source code, and a MLX-LM PR. Source-reddit
  • LLM can argue both sides, aiding opinion formation — A blogger drafts a post and uses an LLM to iteratively strengthen the argument over hours. The model can convincingly argue both sides, even refuting the original stance, illustrating how LLMs can elicit and explore opinions. The reader calls it a useful tool for shaping one’s own views, while warning to avoid bias and sycophancy. Source-twitter
  • LLM vs Human Text: Linear Separability in Detection — Someone is training a custom AI-detection model and found that, for the most part, LLM-generated text is linearly separable from human-written text. If validated, this could enable simple classifiers to distinguish AI-written content from human writing, impacting detection tooling and safety discussions. Source-twitter
  • Hermes Agent v0.5.0 Live with 400+ Models via Nous Portal — Hermes Agent v0.5.0 is now live, focusing on optimization, performance improvements, cleanup, and building foundations. The Nous Portal now serves 400+ models, with HuggingFace’s entire suite accessible. GPT-5.4 gets a playful ‘bonk’ to encourage responsiveness, while Nix sees improvements. Source-twitter
  • Qwen 3.5 27B Dense with Hermes Agent Impresses — A tweet praises the performance of Qwen 3.5 27B (Dense) when paired with the Hermes Agent. The post suggests this combination delivers strong capabilities, highlighting progress in AI agents and tool usage. The content underscores ongoing interest in integrating advanced LLMs with autonomous agents. Source-twitter
  • Breaking: llama-server migrates to HuggingFace cache, breaks scripts — Reddit user reports that the latest llama-server build triggers a one-time migration from the legacy llama.cpp cache to the HuggingFace cache. The migration moves models downloaded with -hf and converts .gguf models into blobs, breaking launch scripts and model-management workflows that rely on old file paths. Models downloaded with —model-url remain unaffected, but errors like ‘failed to load model’ illustrate the disruption. Source-reddit
  • Nemotron 3 Super: Big Llama.cpp vs vLLM Quality Gap — A private benchmark suggests Nemotron 3 Super yields uneven results across inference backends. In a ~400-question test, vLLM achieved 55.4% accuracy, while llama.cpp lagged at 40.2%, indicating a sizable quality gap between the two LLM execution engines. Logs appeared normal aside from gguf-related differences, and results are comparable to other large models. Source-reddit

Open Source

  • Cohere Brings SOTA Open-Source Transcription Model to Browser — Cohere has enabled a state-of-the-art open-source transcription model to run directly in the browser. Weights are available on HuggingFace, with a link provided, and the feature supports HLS playback. Source-twitter
  • AI Scientist-v2 Enables Autonomous Scientific Discovery via Agentic Tree Search — The AI Scientist-v2 is a generalized end-to-end agentic system capable of autonomously generating hypotheses, running experiments, analyzing data, and writing scientific manuscripts. It marks a milestone as the first workshop paper written entirely by AI and accepted through peer review, improving on its predecessor by removing human-authored templates, generalizing across ML domains, and using progressive agentic tree search guided by an experiment manager. The project is open-source on GitHub (SakanaAI/AI-Scientist-v2) and ties to the ICLR2025 workshop. Source-github
  • Onyx Open Source AI Platform: Self-Hosted Chat UI — Onyx is a self-hostable open-source AI platform that provides a chat UI compatible with any LLM. It includes features such as Custom Agents, Web Search, RAG, and connectors to 40+ knowledge sources, and can run in airgapped environments. The project emphasizes easy deployment via a single command and broad interoperability with external sources. Source-github

Hardware

  • Best AI models to run on your hardware (weekly) — A weekly series listing AI models that run on specific hardware tiers (8 GB, 16 GB, 24 GB) with example models and Hugging Face links. It highlights lightweight autocomplete, multimodal options, and strong agent-capable models such as Qwen 3.5 and NVIDIA Nemotron-3-Nano-4B-GGUF, emphasizing open-source availability. The post emphasizes open science and invites readers to follow the ongoing weekly curation. Source-twitter

AI Research

  • Calibri boosts Diffusion Transformers via parameter-efficient calibration — Researchers show that a single learned scaling parameter can significantly improve Diffusion Transformer (DiT) blocks during denoising. They then introduce Calibri, a parameter-efficient calibration method to optimize DiT components and elevate generative quality, framing DiT calibration as a black-box optimization. Source-huggingface

⚡ Quick Bites

  • AI nudges users toward the centre; Grok more right-leaning yet depolarising — New analysis suggests AI models generally depolarise user opinions by nudging people toward the political centre across studied models. Grok shows a stronger right-leaning bias than other models but still produces depolarising effects. The piece is attributed to @jburnmurdoch. Source-twitter
  • Codex Use Cases Gallery Extends Skills to Humans — OpenAI has launched Codex Use Cases, a gallery of practical examples across coding and non-coding tasks showing real ways to use Codex. The collection includes starter prompts for each use case that can be opened directly in the Codex app. Source-twitter
  • Big Tech and Startups Spending Over $1,000 Daily on LLM Tokens — Sources say large tech firms and startups are spending more than $1,000 per day on Claude Code or Codex tokens, about $365,000 annually. If this trend continues, token costs could surpass spending on human employees, highlighting a growing token economy in AI workflows. Source-twitter
  • Local 16GB RAM Coding Autocomplete Model Debuts — An autocomplete coding model is highlighted as runnable locally on systems with 16GB RAM or less. The example points to zed-industries/zeta-2 on Hugging Face as a capable open-source option, albeit not as strong as Cursor tab. The post emphasizes open-source and open-science values promoted by Hugging Face. Source-twitter
  • Closed Models Profit from Open Models Without Giving Back — The tweet argues that proprietary (closed) AI models benefit from open models but do not reciprocate with open sharing or contributions. It frames this dynamic as an ethical and ecosystem concern, highlighting tensions between openness and commercial incentives in the AI industry. Source-twitter
  • Open-Source Deep-Live-Cam Delivers Real-Time Face Swap — Open-Source project Deep-Live-Cam 2.1 enables real-time face swap and video deepfakes from a single image. The developers emphasize responsible use with built-in safety checks, ethical guidelines, and potential watermarking or shutdown if required by law. Source-github
  • IBM Granite-4.0-3B-Vision Debuts Multimodal Document Extraction — Granite-4.0-3B-Vision is a vision-language model designed for enterprise-grade document data extraction, focusing on challenging tasks like chart and table extraction as well as semantic key-value pair extraction. It is delivered as a LoRA adapter atop Granite 4.0 Micro, enabling a single deployment to support both multimodal document understanding and text-only workloads; the base model handles text-only requests without loading the adapter. The model supports Chart2CSV, Chart2Summary, and Chart2Code, and can output results in JSON, HTML, or OTSL. Source-reddit
  • AI feature hype cycle: exuberance, degradation, repeat — A Reddit post argues that AI feature announcements follow a fixed hype cycle: initial exuberant demos, then a second phase of degraded outputs and continued hype without acknowledging flaws. It cites examples like VEO 3, convincing image edits, and GPT-5.4, asserting that firms keep feeding new features to reset the cycle. The piece treats the pattern as systemic rather than incidental, urging skepticism toward hype. Source-reddit
  • Do not use mixed KV cache quantization — A Reddit post argues against mixing KV cache quantization to save memory while preserving accuracy. It cites a benchmark and links a longer blog post explaining why the approach is incorrect, focusing on a Q6_K / Q8_0 setup for the qwen35 9B model on a Vulkan backend, with varying batch sizes and configurations, and shows throughput results that contradict the claimed benefits. Source-reddit
  • Qwen 3.5 Shows Promise in OCR Bounding Boxes for Redaction — Qwen 3.5 is tested for OCR bounding-box accuracy in redaction workflows, following earlier tests with Qwen 3 VL 8B Instruct. The review covers four Qwen models that fit under 24 GB VRAM, evaluated on three challenging handwriting-related tasks using the doc_redaction repo, with initial results showing potential for improved handwritten text OCR in redaction. Source-reddit
  • llama.cpp Prefetches Weights When Offloading to CPU — An experimental PR for llama.cpp adds prefetching of weights when offloading to CPU, aiming to reduce memory bottlenecks for dense models and smaller MoE models during prompt processing. The author reports benefits on RAM-rich, GPU-poor setups and invites others to try it. Source-reddit
  • Turbo3 and gfx906 Merge in Llamacpp for Qwen 3.5 122B — A fresh fork of llamacpp merged Turbo3 and gfx906 forks, enabling running Qwen 3.5 122B. The setup reportedly runs on four MI50 GPUs with 16GB each. The update was shared by Reddit user Exact-Cupcake-2603. Source-reddit
  • TurboQuant Decoded: Vector Quantization for Memory Reduction — A Reddit explanation clarifies that TurboQuant is a vector quantization algorithm designed to reduce memory usage. It emphasizes that the method is about quantizing vectors rather than relying on polar coordinates, illustrating with a simple digit-truncation example and noting that more sophisticated schemes (e.g., block-wise grouping) exist. Source-reddit
  • Why the TurboQuant hype is overblown — A Reddit post questions the hype around TurboQuant, arguing it may offer only marginal context-fitting improvements. The author contrasts it with already efficient hybrid models and notes widespread chatter about release timelines and integration with llama.cpp and custom implementations. Source-reddit

Generated by AI News Agent | 2026-03-28