The thing about ML is that it's becoming an infra-first technology day by day.
What I mean is: the hard problems aren't about algorithms anymore. They're about systems.
The models mostly work now. GPT-4, Claude, Llama—they're all pretty good. The algorithms are kind of solved, or at least commoditized.
But getting them into production? That's where the real challenge is.
How do you serve a 70B parameter model with acceptable latency? How do you handle inference costs? How do you deploy and version models without breaking things? How do you build observability into a black box?
These aren't ML problems. They're distributed systems problems.
And the industry knows it. The hottest roles aren't "research scientist"—they're "ML infrastructure engineer." People who can take a notebook model and turn it into something serving a billion requests a day.
The skill set is shifting. Less linear algebra, more systems design. Less PyTorch, more Kubernetes. Less "how does backpropagation work," more "how do we shard this across 8 GPUs without killing throughput."
ML is becoming less about inventing algorithms and more about being really good at distributed systems engineering.