...
Verzije
verzija | modul | python | Supek | Padobran |
---|---|---|---|---|
1.8.0 | scientific/pytorch/1.8.0-ngc | 3.8 | ||
1.14.0 | scientific/pytorch/1.14.0-ngc | 3.8 | ||
2.0.0 | scientific/pytorch/2.0.0 | 3.10 | ||
scientific/pytorch/2.0.0-ngc | 3.10 |
Note | ||
---|---|---|
| ||
Python aplikacije i knjižnice na Supeku su dostavljene u obliku kontejnera i zahtijevaju korištenje wrappera kao što je opisano ispod. Više informacija o python aplikacijama i kontejnerima na Supeku možete dobiti na sljedećim poveznicama: |
Dokumentacija
- Službena stranica - https://pytorch.org/
- Priručnik - https://pytorch.org/docs/stable/index.html
- torchrunAPI distributed - https://pytorch.org/docs/stable/distributed.html
- torchrun
- accelerate
...
Supek
Ispod se nalaze primjeri pozivanja naredbi i aplikacija unutar kontejnera i aplikacija umjetnog benchmarka koji testira performanse na modelu Resnet50.
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
[korisnik@x3000c0s25b0n0] $ module load scientific/pytorch/1.14.0-ngc [korisnik@x3000c0s25b0n0] $ run-command.sh pip3 list INFO: underlay of /etc/localtime required more than 50 (95) bind mounts INFO: underlay of /usr/bin/nvidia-smi required more than 50 (474) bind mounts 13:4: not a valid test operator: ( 13:4: not a valid test operator: 510.47.03 Package Version ----------------------- ------------------------------- absl-py 1.3.0 accelerate 0.19.0 apex 0.1 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.2.1 ... |
torchrun
...
Izvršavanje PyTorch koda na jednom grafičkom procesoru
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
# source # - https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py import os import argparsetime import torch.backends.cudnn as cudnn import torch.nn.functional as Fnn import torch.optim as optim importfrom torch.utils.data.distributed import DataLoader from torchvision.models import models import sys import time import numpy as np # Benchmark settings parser = argparse.ArgumentParser(description='PyTorch Synthetic Benchmark',resnet50 from torchvision.datasets import FakeData from torchvision.transforms import ToTensor def main(): # vars batch = 256 samples = 256*100 epochs = 1 # model model = resnet50(weights=None) formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument("-i",model.cuda() optimizer = optim.SGD(model.parameters(), lr=0.001) loss_fn = nn.CrossEntropyLoss() # data dataset = FakeData(samples, "--images", typenum_classes=int1000, help="image number", transform=ToTensor()) loader = DataLoader(dataset, default=1024) parser.add_argument('--batch_size'=batch, typeshuffle=intFalse, defaultnum_workers=321, help='input batch size') parser.add_argument("-e", pin_memory=True) # train for epoch in range(epochs): start "--epochs",= time.time() for batch, (images, labels) in enumerate(loader): type=int, images = images.cuda() labels help="epochs",= labels.cuda() outputs = model(images) default=1) parser.add_argument('--model', classes = torch.argmax(outputs, dim=1) loss = loss_fn(outputs, labels) type=str, optimizer.zero_grad() loss.backward() default='resnet50', optimizer.step() help='model to benchmark') args = parser.parse_args() # model model = getattr(models, args.model)() model.cuda() lr_scaler = 1 optimizer = optim.SGD(model.parameters(), lr=0.01 * lr_scaler) cudnn.benchmark = True # data data = torch.randn(args.batch_size, 3, 224, 224) target = torch.LongTensor(args.batch_size).random_() % 1000 data, target = data.cuda(), target.cuda() # fit def benchmark_step(): optimizer.zero_grad() output = model(data) loss = F.cross_entropy(output, target) loss.backward() optimizer.step() return loss.item() for epoch in range(args.epochs): begin = time.time() for batches in range(args.images//args.batch_size): loss = benchmark_step() if (batch%10 == 0): print('--- Epoch %i, Batch %3i / %3i, Loss = %0.2f ---' % (epoch, if (batches%10 == 0): batch, print('--- Epoch %2i, Batch %3i: Loss = %0.2f ---' % (epoch, len(loader), batches, loss,.item())) endelapsed = time.time()-start imgsec = args.images//(end-begin)samples/elapsed print('--- Epoch %2i%i finished: %0.2f img/sec ---' % (epoch, imgsec)) | ||||||||
Code Block | ||||||||
| ||||||||
#!/bin/bash #PBS -q gpu #PBS -l ngpus=1 # pozovi modul module load scientific/pytorch/1.14.0-ngc # pomakni se u direktorij gdje se nalazi skripta cd ${PBS_O_WORKDIR:-""} # potjeraj skriptu korištenjem run-singlegpu.sh run-singlegpu.sh singlegpu.py \ --images 25600 \ --batch_size 256 \ --epochs 1 |
Aplikacija na više grafičkih procesora i jednom čvoru
imgsec))
if __name__ == "__main__":
main() |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Code Block | ||||||||
| ||||||||
# source #!/bin/bash #PBS -q gpu #PBS -l ngpus=1 # pozovi modul module load scientific/pytorch/2.0.0-ngc # pomakni se u direktorij gdje se nalazi skripta cd ${PBS_O_WORKDIR:-""} # potjeraj skriptu korištenjem run-singlegpu.sh run-singlegpu.sh singlegpu.py |
torchrun/distributed
Note | ||
---|---|---|
| ||
Korištenje wrappera |
Aplikacija na više grafičkih procesora i jednom čvoru
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
# source # - https://pytorch.org/tutorials/intermediate/dist_tuto.html # - https://pytorch.org/vision/main/generated/torchvision.datasets.FakeData.html # - https://tuni-itc.github.io/wiki/Technical-Notes/Distributed_dataparallel_pytorch/#setting-up-the-same-model-with-distributeddataparallel import time import torch import torch.nn as nn import torch.optim as optim import torch.distributed as dist from torch.utils.data import DataLoader from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel as DDP from torchvision.models import resnet50 from torchvision.datasets import FakeData from torchvision.transforms import ToTensor def main(): # vars batch = 256 samples = 25600 epochs = 3 # init dist.init_process_group("nccl") rank = dist.get_rank() ngpus = torch. https://pytorch.org/tutorials/intermediate/dist_tuto.html # - https://pytorch.org/vision/main/generated/torchvision.datasets.FakeData.html # - https://tuni-itc.github.io/wiki/Technical-Notes/Distributed_dataparallel_pytorch/#setting-up-the-same-model-with-distributeddataparallel import time import torch import torch.nn as nn import torch.optim as optim import torch.distributed as dist from torch.utils.data import DataLoader from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel as DDP from torchvision.models import resnet50 from torchvision.datasets import FakeData from torchvision.transforms import ToTensor def main(): # vars batch = 256 samples = 25600 epochs = 3 # init dist.init_process_group("nccl") rank = dist.get_rank() ngpus = torch.cuda.device_count() # model model = resnet50(weights=None) model = model.to(rank) model = DDP(model, device_ids=[rank]) optimizer = optim.SGD(model.parameters(), lr=0.001) loss_fn = nn.CrossEntropyLoss() # data dataset = FakeData(samples, num_classes=1000, transform=ToTensor()) sampler = DistributedSampler(dataset) loader = DataLoader(dataset, batch_size=batch//ngpus, sampler=sampler, shuffle=False, num_workers=2, pin_memory=True,) # train for epoch in range(epochs): start = time.time() for batch, (images, labels) in enumerate(loader): images = images.to(rank) labels = labels.to(rank) outputs = model(images) classes = torch.argmax(outputs, dim=1) loss = loss_fn(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() if (rank == 0) and (batch%10 == 0): print('epoch: %3d, batch: %3d, loss: %0.4f' % (epoch+1, batch, loss.item())) if (rank == 0): elapsed = time.time()-start img_sec = samples/elapsed print('Epoch complete in %s seconds [%f img/sec] ' % (elapsed, img_sec)) # clean dist.destroy_process_group() if __name__ == "__main__": main() |
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
# source # - https://pytorch.org/tutorials/intermediate/dist_tuto.html # - https://pytorch.org/vision/main/generated/torchvision.datasets.FakeData.html # - https://tuni-itc.github.io/wiki/Technical-Notes/Distributed_dataparallel_pytorch/#setting-up-the-same-model-with-distributeddataparallel import os import time import torch import torch.nn as nn import torch.optim as optim import torch.distributed as dist from torch.utils.data import DataLoader from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel as DDP from torchvision.models import resnet50 from torchvision.datasets import FakeData from torchvision.transforms import ToTensor def main(): # vars batch = 256 samples = 256*100 epochs = 3 # init dist.init_process_group("nccl") rank = int(os.environ['LOCAL_RANK']) global_rank = int(os.environ['RANK']) # model model = resnet50(weights=None) model = model.to(rank) model = DDP(model, device_ids=[rank]) optimizer = optim.SGD(model.parameters(), lr=0.001) loss_fn = nn.CrossEntropyLoss() # data dataset = FakeData(samples, num_classes=1000, transform=ToTensor()) sampler = DistributedSampler(dataset) loader = DataLoader(dataset, batch_size=batch, sampler=sampler]) optimizer = optim.SGD(model.parameters(), lr=0.001) loss_fn = nn.CrossEntropyLoss() # data dataset = FakeData(samples, num_classes=1000, transform=ToTensor()) sampler = DistributedSampler(dataset) loader = DataLoader(dataset, batch_size=batch, sampler=sampler, shuffle=False, num_workers=1, pin_memory=True,) # train for epoch in range(epochs): start = time.time() for batch, (images, labels) in enumerate(loader): images = images.to(rank) labels = labels.to(rank) outputs = model(images) classes = torch.argmax(outputs, dim=1) loss = loss_fn(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() if (global_rank == 0) and (batch%10 == 0): print('epoch: %3d, batch: %3d/%3d, loss: %0.4f' % (epoch+1, batch, len(loader), loss.item())) if (global_rank == 0): elapsed = time.time()-start img_sec = samples/elapsed print('Epoch complete in %0.2f seconds [%0.2f img/sec] ' % (elapsed, img_sec)) # clean dist.destroy_process_group() if __name__ == "__main__": main() |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash
#PBS -q gpu
#PBS -l select=8:ngpus=1:ncpus=4
# pozovi module
module load scientific/pytorch/1.14.0-ngc
# pomakni se u direktorij gdje se nalazi skripta
cd ${PBS_O_WORKDIR:-""}
# potjeraj skriptu korištenjem torchrun-multinode.sh
torchrun-multinode.sh multigpu-multinode.py |
accelerate
Aplikacija na jednom čvoru
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash
#PBS -q gpu
#PBS -l select=1:ngpus=2:ncpus=8
# env
module load scientific/pytorch/2.0.0
# cd
cd ${PBS_O_WORKDIR:-""}
# run
accelerate-singlenode.sh accelerate-singlenode.py |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
# source # - https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from accelerate import Accelerator from torchvision import models from torch.utils.data import DataLoader from torchvision.datasets import FakeData from torchvision.transforms import ToTensor import os import sys import time import pprint import numpy as np def main(): # settings epochs = 3 batch_size = 256 image_number = 256*30 model = 'resnet50' # accelerator accelerator = Accelerator() # model model = getattr(models, model)() model.to(accelerator.device) # optimizer optimizer = optim.SGD(model.parameters(), lr=0.01) loss_function = nn.CrossEntropyLoss() # loader data = FakeData(image_number, shuffle=Falsenum_classes=1000, transform=ToTensor()) loader num_workers=1= DataLoader(data, pinbatch_memory=True,size=batch_size) # trainscheduler forscheduler epoch in range(epochs): start = time.time() = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9) # prepare formodel, batchoptimizer, (imagesloader, labels)scheduler in= enumerate(loader): accelerator.prepare(model, images = images.to(rank) labels = labels.to(rank) outputs = model(images) classes = torch.argmax(outputs, dim=1) loss = loss_fn(outputs, labels) optimizer, optimizer.zero_grad() loss.backward() optimizer.step() if (global_rank == 0) and (batch%10 == 0): loader, print('epoch: %3d, batch: %3d/%3d, loss: %0.4f' % (epoch+1, scheduler) # fit for epoch in range(epochs): start = time.time() for batch, (images, labels) in enumerate(loader): optimizer.zero_grad() images = images.to(accelerator.device) labels = labels.to(accelerator.device) outputs = model(images) classes = len(loader),torch.argmax(outputs, dim=1) loss = loss_function(outputs, labels) accelerator.backward(loss) optimizer.step() scheduler.step() if (batch%1 loss.item())) if (global_rank == 0): == 0) and ('RANK' not in os.environ or os.environ['RANK'] == '0'): elapsed = time.time()-start img_sec = samples/elapsedprint('--- Epoch %2i, Batch %3i: Loss = %0.2f ---' % (epoch, batch, loss,)) print('Epoch completeif 'RANK' not in %0os.2fenviron secondsor [%0.2f img/sec] ' % (elapsed, img_sec)) os.environ['RANK'] == '0' : # clean end = dist.destroy_process_grouptime.time() if __name__ == "__main__": main() | ||||||||
Code Block | ||||||||
| ||||||||
#!/bin/bash #PBS -q gpu #PBS -l select=8:ngpus=1:ncpus=4 # pozovi module module load scientific/pytorch/1.14.0-ngc module load cray-pals # pomakni se u direktorij gdje se nalazi skripta cd ${PBS_O_WORKDIR:-""} # potjeraj skriptu korištenjem torchrun-multinode.sh torchrun-multinode.sh multigpu-multinode.py |
accelerate
...
imgsec = image_number/(end-start)
print('--- Epoch %2i, Finished: %0.2f img/sec ---' % (epoch, imgsec))
if __name__ == '__main__':
main() |
Aplikacija na više čvorova
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash #PBS -q gpu #PBS -l select=12:ngpus=2:ncpus=8 #PBS -o output/ #PBS -e output/ # env module load scientific/pytorch/2.0.0 # cd cd ${PBS_O_WORKDIR:-""} # run accelerate-singlenodemultinode.sh accelerate-singlenodemultinode.py |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
# source # - https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from accelerate import Accelerator from torchvision import models from torch.utils.data import DataLoader from torchvision.datasets import FakeData from torchvision.transforms import ToTensor import os import sys import time import pprint import socket import numpy as np def main(): # settings epochs = 310 batch_size = 256 image_number = 256*30 model = 'resnet50' # accelerator accelerator = Accelerator() # model model = getattr(models, model)() model.to(accelerator.device) # optimizer optimizer = optim.SGD(model.parameters(), lr=0.01) loss_function = nn.CrossEntropyLoss() # loader data = FakeData(image_number, num_classes=1000, transform=ToTensor()) loader = DataLoader(data, batch_size=batch_size) # scheduler scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9) # prepare model, optimizer, loader, scheduler = accelerator.prepare(model, optimizer, loader, scheduler) # fit for epoch in range(epochs): start = time.time() for batch, (images, labels) in enumerate(loader): optimizer.zero_grad() images = images.to(accelerator.device) labels = labels.to(accelerator.device) outputs = model(images) classes = torch.argmax(outputs, dim=1) loss = loss_function(outputs, labels) accelerator.backward(loss) optimizer.step() scheduler.step() if (batch%1 == 0) and ('RANK' not in os.environ or os.environ['RANK'] == '0'): print('--- Epoch %2i, Batch %3i: Loss = %0.2f ---' % (epoch, batch, loss,)) if 'RANK' not in os.environ or os.environ['RANK'] == '0' : end = time.time() imgsec = image_number/(end-start) print('--- Epoch %2i, Finished: %0.2f img/sec ---' % (epoch, imgsec)) if __name__ == '__main__': main() |
...
main() |
Vrančić
Ispod se nalazi primjer aplikacije umjetnog benchmarka koji testira performanse na modelu Resnet50.
Aplikacija na jednom čvoru
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#!/bin/bash #PBS -q gpucpu #PBS -l select=2:ngpus=2:ncpus=832 #PBS -o output/ #PBS -e output/l mem=50GB # envenvironment module load scientific/pytorch/2.0.0.0.0 # set thread number to the cpu one export OMP_NUM_THREADS=${NCPUS} # cdrun cd ${PBS_O_WORKDIR:-""} # run accelerate-multinode.sh accelerate-multinodepython singlenode.py |
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
#import sourceos # - https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.pyimport time import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from accelerate import Accelerator from torchvision import models optim from torch.utils.data import DataLoader from torchvision.models import resnet50 from torchvision.datasets import FakeData from torchvision.transforms import ToTensor import os import sys import time import pprint import socket import numpy as np def main(): # settingsdef main(): # vars batch = 16 samples = 16*30 epochs = 10 3 # model batch_sizemodel = 256resnet50(weights=None) image_number = 256*30optimizer = optim.SGD(model.parameters(), lr=0.001) modelloss_fn = 'resnet50'nn.CrossEntropyLoss() # # acceleratordata acceleratordataset = AcceleratorFakeData()samples, # model model = getattr(models, model)() model.to(accelerator.device) # optimizer num_classes=1000, optimizer = optim.SGD(model.parameters(), lr=0.01) loss_function = nn.CrossEntropyLoss() # loader transform=ToTensor()) dataloader = FakeDataDataLoader(image_numberdataset, num_classes=1000 batch_size=batch, transform=ToTensor()) loader shuffle= DataLoader(dataFalse, batchnum_size=batch_size) workers=1, # scheduler scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9 pin_memory=True) # preparetrain for epoch model, optimizer, loader, scheduler = accelerator.prepare(model, in range(epochs): start = time.time() for batch, (images, labels) in enumerate(loader): outputs = model(images) classes = torch.argmax(outputs, dim=1) loss = optimizer,loss_fn(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() if (batch%10 == 0): loader, print('--- Epoch %i, Batch %3i / %3i, Loss = %0.2f ---' % (epoch, scheduler) # fit for epoch in range(epochs): start = time.time() for batch, (images, labels) in enumerate(loader): optimizer.zero_grad() images = images.to(accelerator.device) labels = labels.to(accelerator.device) outputs = model(images) classes = torch.argmax(outputs, dim=1) len(loader), loss = loss_function(outputs, labels) accelerator.backward(loss) optimizer.step() scheduler.step() if (batch%1 == 0) and ('RANK' not in os.environ or os.environ['RANK'] == '0'): loss.item())) elapsed = time.time()-start imgsec = samples/elapsed print('--- Epoch %2i, Batch %3i: Loss = %i finished: %0.2f img/sec ---' % (epoch, batch, loss,)) % (epoch, if 'RANK' not in os.environ or os.environ['RANK'] == '0' : end = time.time() imgsec = image_number/(end-start) print('--- Epoch %2i, Finished: %0.2f img/sec ---' % (epoch, imgsec)) if __name__ == '"__main__'": main() |
Napomene
Tip | ||
---|---|---|
| ||
Ova knjižnica je dostavljena u obliku kontejnera, zbog opterećenja koje pip/conda virtualna okruženja stvaraju na Lustre dijeljenim datotečnim sustavima. Za ispravno izvršavanje python aplikacija ili naredbi koje se u njemu nalaze, potrebno je koristiti wrappere u skriptama sustava PBS:
Načini pozivanja wrappera opisani su u primjerima iznad. |
...