Beside the size, the other thing is that most decoder models are trained on much more data now (eg the 0.5 B Qwen 2.5 model was trained on 18 trillion tokens)
That being said here are some additional comparisons to BERT models:
github.com