yeah might be brittle for more complex instructions. In those cases you could indeed you an embedding model or an LLM judge to assess similarity.