Grpo with wordle poor performance

Hi Guys, I have implemented the wordle grpo based on the short course and tried a variety of different rewards - but i am never getting any model which is able to complete the game - the performance is always poor. I have tried with a qwen 3b with and without the sft in advance.

I took the values in the config from the course and have also tried a variety of other rewards. I have managed to get a model which will guess different 5 letter words mostly but it refuses to learn from the x,- and tick - however i have also seen from playing around with gemini pro (naturally no training) that even that model seems unable to play properly - Has anyone else actually got any good results and if so do you have a repo?