<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Aaron Brooks</title><description>Notes on AI systems, training infrastructure, LLM platforms, and CUDA-level optimization.</description><link>https://aaronbrooks.me/</link><language>en-us</language><item><title>A Training Run Ought to Leave Tracks</title><link>https://aaronbrooks.me/posts/training-infrastructure-is-failure-accounting/</link><guid isPermaLink="true">https://aaronbrooks.me/posts/training-infrastructure-is-failure-accounting/</guid><description>The systems work that matters most in large training runs: every retry, checkpoint, queue, and evaluator should leave enough evidence to explain itself.</description><pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate><category>training</category><category>infrastructure</category><category>llm-systems</category></item><item><title>Your CUDA Kernel Is Probably Paying a Memory Tax</title><link>https://aaronbrooks.me/posts/cuda-optimization-starts-with-memory-transactions/</link><guid isPermaLink="true">https://aaronbrooks.me/posts/cuda-optimization-starts-with-memory-transactions/</guid><description>Before touching clever math, make the profiler show that each warp is moving data the way the hardware wants.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate><category>cuda</category><category>performance</category><category>gpu</category></item><item><title>LLM Platforms Need Operational Memory</title><link>https://aaronbrooks.me/posts/llm-platforms-need-operational-memory/</link><guid isPermaLink="true">https://aaronbrooks.me/posts/llm-platforms-need-operational-memory/</guid><description>A useful LLM platform remembers prompts, tools, evals, incidents, and cost decisions as production evidence, not scattered anecdotes.</description><pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate><category>llm-systems</category><category>platforms</category><category>production</category></item><item><title>I Was Wrong to Sleep on JAX</title><link>https://aaronbrooks.me/posts/i-was-wrong-to-sleep-on-jax/</link><guid isPermaLink="true">https://aaronbrooks.me/posts/i-was-wrong-to-sleep-on-jax/</guid><description>After a long stretch in PyTorch and FSDP, I was reluctant to give JAX a fair shot. I am glad I did.</description><pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate><category>jax</category><category>pytorch</category><category>training</category></item><item><title>Checkpointing Is a Distributed Systems Problem</title><link>https://aaronbrooks.me/posts/checkpointing-is-a-distributed-systems-problem/</link><guid isPermaLink="true">https://aaronbrooks.me/posts/checkpointing-is-a-distributed-systems-problem/</guid><description>A checkpoint is not a folder of weights. It is a contract between training, storage, orchestration, and every future resume.</description><pubDate>Fri, 20 Feb 2026 00:00:00 GMT</pubDate><category>training</category><category>infrastructure</category><category>distributed-systems</category></item><item><title>Prompts Are Release Artifacts</title><link>https://aaronbrooks.me/posts/prompts-are-release-artifacts/</link><guid isPermaLink="true">https://aaronbrooks.me/posts/prompts-are-release-artifacts/</guid><description>If a prompt can change production behavior, it needs versioning, eval gates, rollback, and telemetry keyed to the deployed behavior.</description><pubDate>Fri, 06 Feb 2026 00:00:00 GMT</pubDate><category>llm-systems</category><category>platforms</category><category>production</category></item><item><title>Occupancy Is Usually Not the First Problem</title><link>https://aaronbrooks.me/posts/occupancy-is-usually-not-the-first-problem/</link><guid isPermaLink="true">https://aaronbrooks.me/posts/occupancy-is-usually-not-the-first-problem/</guid><description>Low occupancy can matter, but it is a clue, not a plan. Start with memory behavior, launch shape, and correctness.</description><pubDate>Fri, 12 Dec 2025 00:00:00 GMT</pubDate><category>cuda</category><category>performance</category><category>gpu</category></item></channel></rss>