Ctransformers also using CPU as well as GPU for a model that should fit in VRAM

Hi everyone!

I’m hoping to share in the wisdom of the crowd.

  1. I’m using ctransformers to load various GGUF models (7B Q4/Q5) that I have downloaded locally. I’ve got ctransformers[cuda] and the other Nvidia Windows GPU stuff that’s needed and when I run inference my GPU is definitely working.
    The models should all fit into 8GB VRAM but when I run inference my CPU also spikes to 100%.
    I have gpu_layers set to max and there is definitely a speed improvement when using GPU vs gpu_layers=0 (CPU only).

why is my CPU also being used?
btw if I use the Langchain GPT4All binding with device=‘gpu’ then it only uses my GPU, CPU doesn’t spike at all.

PS. it just occurred to me to download a tiny model to make sure there is ample VRAM available and see if CPU is still spiking just to remove any doubt about VRAM space.

Thanks in advance!

I did try the smallest model and I still noticed CPU use along GPU so it wasn’t a memory size issue.

As I wrote in another post:

when I tested with Langchain llama.cpp there was no additional CPU usage. So I’m ditching CTransformers and setting on llama.cpp for my local work.

If anyone reads this and understands any benefits that Langchain’s CTransformers offers over their llama.cpp (or anything better than llama.cpp) I’d love to know.