Yes, data quality is definitely crucial. There was the LIMA paper, for example, the showed that with a 1k high-quality instruction dataset, you can get better performance than 50k instructions from Alpaca (which were supposedly lower quality):
arxiv.org