Training NLPC3W3 Assignment on Ubuntu laptop CPU w/ 12 cores

jaanbi · May 10, 2022, 2:32pm

Hi and thank you for helping me!

I have installed Trax on my Ubuntu machine, and trained the model on 1 core. However, when I add a new parameter to training_loop - n_devices = 12 (see below), it throws a JAX error: “JAX cannot work yet with n_devices != all devices: 12 != 1”, which implies that JAX only gets one core. Could I ask for your help in order to expose my entire CPU to JAX/Trax?

With kind regards,
Jaan Übi

training_loop = training.Loop(
NER, # A model to train
train_task, # A train task
eval_tasks = [eval_task], # The evaluation task
output_dir = output_dir, # The output directory
n_devices = 11
)

balaji.ambresh · May 12, 2022, 11:47am

trax uses fastmath which uses jax as default backend.
jax is like numpy. It’s used to vectorize operations.

There’s a differnce between device and core.

A device could refer to GPU / CPU / TPU.

Here’s one way to get device information:

>>> import jax
>>> jax.devices()
[CpuDevice(id=0)]
>>> from trax import fastmath
>>> fastmath.local_device_count()
1

Each device has multiple cores.

TPU ~ 2
CPU ~ 12 (in your case)
GPU ~ 1000s

Once a device is chosen, the available cores will still be used for vectorizing computations.

So, the CPU cores on your laptop are being used even when n_devices is not set by you. Open system monitor and see for yourself by setting train_steps to be a number like 500. Be sure to close other applications and run the python file from command line.

PZ2004 · October 11, 2023, 2:54pm

I have a follow up questions. if we want to train an intermediate size model (data size in 20 Mbytes range, model complexity like LSTM) with TRAX, would a single computer with i5 CPU 6 cores be possible to handle it? how do we determine the system spec required? Many thanks.

balaji.ambresh · October 11, 2023, 2:57pm

Adding @arvyzukai

arvyzukai · October 12, 2023, 6:02am

Hi @PZ2004

I’m no expert on this so what I usually do is I run some test runs and check the memory footprint and also the time to complete an epoch to get a rough estimation how long will the model will train that is pretty lame “strategy” but it is what it is…

From what I understand from your post is that you should not have problems fitting the model into memory (which is usually the biggest problem in general) but the training time should be quite long (slow training).

I’m not sure if you have the access to Generative AI course lecture Computational challenges of training LLMs (if you don’t, it doesn’t matter - the main point is discussed here, and btw I also think that the calculations are off but a follow up might reveal the truth).

Again, I’m no expert, but in my experience, small/intermediate models (less than 300M parameters) are not a problem to train on GPUs, but for CPUs I think that would take ages.

Cheers

Topic		Replies	Views
Configuration for Trax-jax[cuda11] on linux(WSL2) NLP with Sequence Models week-1	1	382	August 17, 2023
No GPU in labs. Should there be one? NLP with Attention Models week-1	15	597	May 7, 2023
C3_W4 assignment Question duplicates NLP with Sequence Models week-4	1	512	June 12, 2023
Different Data Types in Local Notebook NLP with Sequence Models week-2	3	493	March 27, 2023
Problem with installation of Trax on Mac NLP with Sequence Models week-1	5	553	April 3, 2023

Training NLPC3W3 Assignment on Ubuntu laptop CPU w/ 12 cores

Related topics