You're right that a computer vision algorithm detects objects without context. But as you likely know, contextual judgments are attainable with a broader suite of AI algorithms -- in this case maybe using an ensemble of computer vision/natural language processing/clustering. Complex models simply require lots of training. That arguably is challenging, but it's not impossible.
One simple example: A few years back Google got in hot water for a face recognition algo that wrongly identified a dude with African ancestry as a gorilla. We both probably know the reason -- it's not racism, it was inadequate model training. The algo wasn't trained on enough photos of dudes who looked like him to learn the required feature space and likely put too much probability-weight on abstract face color. If the algo had been trained predominantly on African people and few Caucasians, it may have identified a white guy as a maggot or a snowman, for example. As we know, it's only math and math doesn't know or care about human sensibilities. Google trained the model better and the problem was solved.
So training is everything. The question is: Can an AI algo be trained to navigate these contextual challenges? I think the answer is Yes, but it would require hard work and a lot of effort. The corporate cultural changes you reference may impede that sort of effort. More broadly, a company as large as Youtube or Google should design a workflow to narrow the possibility set and then kick questionable content up for internal live vetting. I'm editorializing now, but there's a public good aspect to fair moderation that they shouldn't be able to duck. In my view, creative and skilled engineering could ameliorate this problem and maybe eliminate it.