The following code snippet shows the matrix multiplication took 21 seconds on my laptop with CPU Core i7-11800H and GPU Nvidia RTX 3070 with 8GB of VRAM.
In particular, the code shows the CUDA is enabled
import torchimport timeN = 20000a = torch.ones(N,N)b = torch.ones(N,N)print(torch.cuda.is_available())a.to("cuda:0")b.to("cuda:0")start = time.time()c = a @ bdelta = time.time() - startprint(delta)
In particular, when it runs, I saw the CPU is 800% busy.Wonder what's causing this.