Similar to this question and this issue, I encountered the problem of different results depending on operations being performed for a batch or for a single sample, with the difference, that in those posts, the errors are in the range of e-5 to e-6, while for me it's in the range of 0.2 - 0.7
Running the example below will print the biggest difference between final activations for the first data sample, resulting in some value ranging from 0.2 to 0.7, depending on the seed. Tested on both cuda and cpu.
import torchimport torch.nn as nnnum_samples = 100num_neurons = 200data_size = 150_000dataset = torch.randn(num_samples, data_size)linear_layer = nn.Linear(data_size, num_neurons)with torch.no_grad(): for i in range(num_samples): linear_layer.weight[i].copy_(dataset[i]) print(torch.max(torch.abs(linear_layer(dataset[0]) - linear_layer(dataset)[0])))
Choosing a smaller data size e.g. 150 reduces the error to the e-5/e-6 range, and so does not using the data samples to initialize the weights, however both are necessary for my usecase.