I did try the smallest model and I still noticed CPU use along GPU so it wasn’t a memory size issue.
As I wrote in another post:
when I tested with Langchain llama.cpp there was no additional CPU usage. So I’m ditching CTransformers and setting on llama.cpp for my local work.
If anyone reads this and understands any benefits that Langchain’s CTransformers offers over their llama.cpp (or anything better than llama.cpp) I’d love to know.