Reading accuracy

Hello, I need to evaluate and compare the accuracy of readings using recordings. Is it more advisable to transcribe speech to text and then evaluate with AI or evaluate the audio directly with AI?

What sort of recordings and readings are you referring to?

The reading aloud of a story.

There are lots of speech-to-text tools already.
Then you could run the text through other AI for analysis.

And what type of AI would be the most recommendable for comparing the accuracy of the readings?

What two things are you comparing?