Aaron Brooks

Aaron BrooksNotes on AI systems, training infrastructure, LLM platforms, and CUDA-level optimization.https://aaronbrooks.me/en-usA Training Run Ought to Leave Trackshttps://aaronbrooks.me/posts/training-infrastructure-is-failure-accounting/https://aaronbrooks.me/posts/training-infrastructure-is-failure-accounting/The systems work that matters most in large training runs: every retry, checkpoint, queue, and evaluator should leave enough evidence to explain itself.Fri, 15 May 2026 00:00:00 GMTtraininginfrastructurellm-systemsYour CUDA Kernel Is Probably Paying a Memory Taxhttps://aaronbrooks.me/posts/cuda-optimization-starts-with-memory-transactions/https://aaronbrooks.me/posts/cuda-optimization-starts-with-memory-transactions/Before touching clever math, make the profiler show that each warp is moving data the way the hardware wants.Fri, 01 May 2026 00:00:00 GMTcudaperformancegpuLLM Platforms Need Operational Memoryhttps://aaronbrooks.me/posts/llm-platforms-need-operational-memory/https://aaronbrooks.me/posts/llm-platforms-need-operational-memory/A useful LLM platform remembers prompts, tools, evals, incidents, and cost decisions as production evidence, not scattered anecdotes.Fri, 17 Apr 2026 00:00:00 GMTllm-systemsplatformsproductionI Was Wrong to Sleep on JAXhttps://aaronbrooks.me/posts/i-was-wrong-to-sleep-on-jax/https://aaronbrooks.me/posts/i-was-wrong-to-sleep-on-jax/After a long stretch in PyTorch and FSDP, I was reluctant to give JAX a fair shot. I am glad I did.Fri, 13 Mar 2026 00:00:00 GMTjaxpytorchtrainingCheckpointing Is a Distributed Systems Problemhttps://aaronbrooks.me/posts/checkpointing-is-a-distributed-systems-problem/https://aaronbrooks.me/posts/checkpointing-is-a-distributed-systems-problem/A checkpoint is not a folder of weights. It is a contract between training, storage, orchestration, and every future resume.Fri, 20 Feb 2026 00:00:00 GMTtraininginfrastructuredistributed-systemsPrompts Are Release Artifactshttps://aaronbrooks.me/posts/prompts-are-release-artifacts/https://aaronbrooks.me/posts/prompts-are-release-artifacts/If a prompt can change production behavior, it needs versioning, eval gates, rollback, and telemetry keyed to the deployed behavior.Fri, 06 Feb 2026 00:00:00 GMTllm-systemsplatformsproductionOccupancy Is Usually Not the First Problemhttps://aaronbrooks.me/posts/occupancy-is-usually-not-the-first-problem/https://aaronbrooks.me/posts/occupancy-is-usually-not-the-first-problem/Low occupancy can matter, but it is a clue, not a plan. Start with memory behavior, launch shape, and correctness.Fri, 12 Dec 2025 00:00:00 GMTcudaperformancegpu