profile
viewpoint
Andrej karpathy Stanford https://twitter.com/karpathy I like to train Deep Neural Nets on large datasets.

karpathy/char-rnn 10340

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch

karpathy/convnetjs 10043

Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.

karpathy/minGPT 5298

A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training

karpathy/neuraltalk 5148

NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.

karpathy/neuraltalk2 4955

Efficient Image Captioning code in Torch, runs on GPU

karpathy/arxiv-sanity-preserver 3984

Web interface for browsing, search and filtering recent arxiv submissions

jcjohnson/densecap 1401

Dense image captioning in Torch

karpathy/reinforcejs 1063

Reinforcement Learning Agents in Javascript (Dynamic Programming, Temporal Difference, Deep Q-Learning, Stochastic/Deterministic Policy Gradients)

karpathy/micrograd 1009

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

karpathy/recurrentjs 882

Deep Recurrent Neural Networks and LSTMs in Javascript. More generally also arbitrary expression graphs with automatic differentiation.

issue commentkarpathy/arxiv-sanity-preserver

This site can’t be reached

Yeah so now it's dying because I'm running out of RAM on the machine. I can't simultaneously host and re-compute SVMs for all 121K+ papers. Whole thing needs a refactor, sorry.

Anonymous-so

comment created time in 16 days

Pull request review commentkarpathy/minGPT

Feature/lightning

+"""+Temporary benchmarking script while integrating Lightning, will remove before merge to master+"""++import os+import time+import math+import logging+import argparse++import numpy as np+import torch+from torch.utils.data import Dataset+from torch.utils.data.dataloader import DataLoader+import torch.backends.cudnn as cudnn++from mingpt.model import GPT+from mingpt.lr_decay import WarmupCosineLearningRateDecay+from mingpt.utils import sample++logger = logging.getLogger(__name__)+logging.basicConfig(+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",+        datefmt="%m/%d/%Y %H:%M:%S",+        level=logging.INFO,+)++torch.backends.cudnn.benchmark = True # autotune kernels++# -----------------------------------------------------------------------------+import os+if int(os.environ.get('USE_LIGHTNING', 0)):+    logging.info("USING LIGHTNING!!")+    import pytorch_lightning as pl+else:+    import mingpt.fake_lightning as pl+    logging.info("using our humble trainer")+# -----------------------------------------------------------------------------++class Text8Dataset(Dataset):+    """+    e.g. Text8 dataset is often used: http://mattmahoney.net/dc/textdata.html+    Vocabulary is lowercase English characters and space for total of 27.+    Training data: First 90M characters.+    Validation data: First 5M characters out of the last 10M characters.+    Testing data: Last 5M characters.+    """++    def __init__(self, data_path, block_size, crop=None, override_vocab=None):++        # load the data and crop it appropriately+        with open(data_path, 'r') as f:+            if crop is None:+                data = f.read()+            else:+                f.seek(crop[0])+                data = f.read(crop[1])++        # build a vocabulary from data or inherit it+        vocab = sorted(list(set(data))) if override_vocab is None else override_vocab+        data_size, vocab_size = len(data), len(vocab)+        logging.info('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))++        self.stoi = { ch:i for i,ch in enumerate(vocab) }+        self.itos = { i:ch for i,ch in enumerate(vocab) }+        self.block_size = block_size+        self.vocab_size = vocab_size+        self.data = data+        self.vocab = vocab++    def __len__(self):+        return len(self.data) // self.block_size++    def __getitem__(self, idx):+        # attempt to fetch a chunk of (block_size + 1) items, but (block_size) will work too+        chunk = self.data[idx*self.block_size : min(len(self.data), (idx+1)*self.block_size + 1)]+        # map the string into a sequence of integers+        ixes = [self.stoi[s] for s in chunk]+        # if stars align (last idx and len(self.data) % self.block_size == 0), pad with -100, to skip training at the last position+        if len(ixes) < self.block_size + 1:+            assert len(ixes) == self.block_size # i believe this is the only way this could happen, make sure+            ixes.append(-100)+        dix = torch.tensor(ixes, dtype=torch.long)+        return dix[:-1], dix[1:]++# -----------------------------------------------------------------------------++parser = argparse.ArgumentParser()+parser.add_argument('-x', '--num-epochs', type=int, default=5, help="number of epochs to train for")+parser.add_argument('-b', '--batch-size', type=int, default=64, help="batch size to train with")+parser.add_argument('-l', '--block-size', type=int, default=128, help="block size for the model (length of window of context)")+parser.add_argument('-n', '--num-workers', type=int, default=0, help="number of workers for dataloading")+parser.add_argument('-g', '--num-gpus', type=int, default=1, help="number of gpus to train on")+parser.add_argument('-p', '--pin-memory', type=int, default=1, help="pin memory on dataloaders?")+parser.add_argument('-r', '--precision', type=int, default=32, help="fp precision to use, e.g. 32/16")+parser.add_argument('-o', '--default_root_dir', type=str, default='.', help="best model checkpoint will be written at this location")+args = parser.parse_args()+print(vars(args))++logging.info("preparing the data loaders")+# NOTE: REDUCED DATA SIZE FOR DEBUGGING, TODO CLEAN BEFORE MERGE IF EVER+train_dataset = Text8Dataset('text8', args.block_size, crop=(0,         int(1e6)))+val_dataset   = Text8Dataset('text8', args.block_size, crop=(int(90e6), int(1e5)), override_vocab=train_dataset.vocab)+test_dataset  = Text8Dataset('text8', args.block_size, crop=(int(95e6), int(1e5)), override_vocab=train_dataset.vocab)+common = {'batch_size': args.batch_size, 'pin_memory': bool(args.pin_memory), 'num_workers': args.num_workers}+train_dataloader  = DataLoader(train_dataset, shuffle=True, **common)+val_dataloader  = DataLoader(val_dataset, shuffle=False, **common)+test_dataloader  = DataLoader(test_dataset, shuffle=False, **common)++logging.info("creating the model")+model = GPT(train_dataset.vocab_size, args.block_size, n_layer=6, n_head=8, n_embd=256)++logging.info("preparing the learning rate schedule")+iter_tokens = args.batch_size * args.block_size # number of tokens backpropped in one iteration+epoch_tokens = math.ceil(len(train_dataset) / args.batch_size) * iter_tokens+lr_decay = WarmupCosineLearningRateDecay(learning_rate=6e-4, warmup_tokens=epoch_tokens//2,+                                         final_tokens=args.num_epochs*epoch_tokens)++t0 = time.time()+logging.info("training...")+trainer = pl.Trainer(gpus=args.num_gpus, max_epochs=args.num_epochs, gradient_clip_val=1.0, callbacks=[lr_decay],+                     precision=args.precision, default_root_dir=args.default_root_dir)+trainer.fit(model, train_dataloader, val_dataloader)+t1 = time.time()+logging.info("%d epochs took %fs, or %fs/epoch", args.num_epochs, t1 - t0, (t1-t0)/args.num_epochs)++# todo below: I don't yet understand the Lightning checkpoint schema+# logging.info("testing...")+# ckpt_path = os.path.join(args.default_root_dir, 'model.pt')+# model.load_from_checkpoint(ckpt_path) # load the best checkpoint we found+# trainer.test(test_dataloader=test_dataloader)

some people always need something, which is why frameworks are so hard. Next thing you know you can't use a list of data loaders and have to introduce a DataLoaderSetManager object.

karpathy

comment created time in 25 days

PullRequestReviewEvent

pull request commentkarpathy/minGPT

Feature/lightning

Ok I think things have improved quite a bit. In particular, my "fake lightning" has now been reduced all the way to

class LightningModule(nn.Module):
    pass

class Callback:
    pass

which is fun :) And I can train with the fake trainer or the lightning trainer and the code looks decent ish.

karpathy

comment created time in 25 days

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 492b79fb31994e406ad8d778835c36c42d650071

get rid of spurious function for the model

view details

Andrej Karpathy

commit sha a796899f656345ac541aba49eccb368f49b7d730

reorg the bench code to support multigpu training, have to indent properly under __main__

view details

push time in 25 days

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 4817231b2341e675136a8b26cc07ec118df32782

testing now works with both lightning and minLightning

view details

Andrej Karpathy

commit sha d91bb1c0beb5bdf4463f661ec4b195cdce3e2387

make labels non-blocking transfer to overlap them, but i don't really expect this to do too much to latency

view details

push time in 25 days

Pull request review commentkarpathy/minGPT

Feature/lightning

+"""+Temporary benchmarking script while integrating Lightning, will remove before merge to master+"""++import os+import time+import math+import logging+import argparse++import numpy as np+import torch+from torch.utils.data import Dataset+from torch.utils.data.dataloader import DataLoader+import torch.backends.cudnn as cudnn++from mingpt.model import GPT+from mingpt.lr_decay import WarmupCosineLearningRateDecay+from mingpt.utils import sample++logger = logging.getLogger(__name__)+logging.basicConfig(+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",+        datefmt="%m/%d/%Y %H:%M:%S",+        level=logging.INFO,+)++torch.backends.cudnn.benchmark = True # autotune kernels++# -----------------------------------------------------------------------------+import os+if int(os.environ.get('USE_LIGHTNING', 0)):+    logging.info("USING LIGHTNING!!")+    import pytorch_lightning as pl+else:+    import mingpt.fake_lightning as pl+    logging.info("using our humble trainer")+# -----------------------------------------------------------------------------++class Text8Dataset(Dataset):+    """+    e.g. Text8 dataset is often used: http://mattmahoney.net/dc/textdata.html+    Vocabulary is lowercase English characters and space for total of 27.+    Training data: First 90M characters.+    Validation data: First 5M characters out of the last 10M characters.+    Testing data: Last 5M characters.+    """++    def __init__(self, data_path, block_size, crop=None, override_vocab=None):++        # load the data and crop it appropriately+        with open(data_path, 'r') as f:+            if crop is None:+                data = f.read()+            else:+                f.seek(crop[0])+                data = f.read(crop[1])++        # build a vocabulary from data or inherit it+        vocab = sorted(list(set(data))) if override_vocab is None else override_vocab+        data_size, vocab_size = len(data), len(vocab)+        logging.info('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))++        self.stoi = { ch:i for i,ch in enumerate(vocab) }+        self.itos = { i:ch for i,ch in enumerate(vocab) }+        self.block_size = block_size+        self.vocab_size = vocab_size+        self.data = data+        self.vocab = vocab++    def __len__(self):+        return len(self.data) // self.block_size++    def __getitem__(self, idx):+        # attempt to fetch a chunk of (block_size + 1) items, but (block_size) will work too+        chunk = self.data[idx*self.block_size : min(len(self.data), (idx+1)*self.block_size + 1)]+        # map the string into a sequence of integers+        ixes = [self.stoi[s] for s in chunk]+        # if stars align (last idx and len(self.data) % self.block_size == 0), pad with -100, to skip training at the last position+        if len(ixes) < self.block_size + 1:+            assert len(ixes) == self.block_size # i believe this is the only way this could happen, make sure+            ixes.append(-100)+        dix = torch.tensor(ixes, dtype=torch.long)+        return dix[:-1], dix[1:]++# -----------------------------------------------------------------------------++parser = argparse.ArgumentParser()+parser.add_argument('-x', '--num-epochs', type=int, default=5, help="number of epochs to train for")+parser.add_argument('-b', '--batch-size', type=int, default=64, help="batch size to train with")+parser.add_argument('-l', '--block-size', type=int, default=128, help="block size for the model (length of window of context)")+parser.add_argument('-n', '--num-workers', type=int, default=0, help="number of workers for dataloading")+parser.add_argument('-g', '--num-gpus', type=int, default=1, help="number of gpus to train on")+parser.add_argument('-p', '--pin-memory', type=int, default=1, help="pin memory on dataloaders?")+parser.add_argument('-r', '--precision', type=int, default=32, help="fp precision to use, e.g. 32/16")+parser.add_argument('-o', '--default_root_dir', type=str, default='.', help="best model checkpoint will be written at this location")+args = parser.parse_args()+print(vars(args))++logging.info("preparing the data loaders")+# NOTE: REDUCED DATA SIZE FOR DEBUGGING, TODO CLEAN BEFORE MERGE IF EVER+train_dataset = Text8Dataset('text8', args.block_size, crop=(0,         int(1e6)))+val_dataset   = Text8Dataset('text8', args.block_size, crop=(int(90e6), int(1e5)), override_vocab=train_dataset.vocab)+test_dataset  = Text8Dataset('text8', args.block_size, crop=(int(95e6), int(1e5)), override_vocab=train_dataset.vocab)+common = {'batch_size': args.batch_size, 'pin_memory': bool(args.pin_memory), 'num_workers': args.num_workers}+train_dataloader  = DataLoader(train_dataset, shuffle=True, **common)+val_dataloader  = DataLoader(val_dataset, shuffle=False, **common)+test_dataloader  = DataLoader(test_dataset, shuffle=False, **common)++logging.info("creating the model")+model = GPT(train_dataset.vocab_size, args.block_size, n_layer=6, n_head=8, n_embd=256)++logging.info("preparing the learning rate schedule")+iter_tokens = args.batch_size * args.block_size # number of tokens backpropped in one iteration+epoch_tokens = math.ceil(len(train_dataset) / args.batch_size) * iter_tokens+lr_decay = WarmupCosineLearningRateDecay(learning_rate=6e-4, warmup_tokens=epoch_tokens//2,+                                         final_tokens=args.num_epochs*epoch_tokens)++t0 = time.time()+logging.info("training...")+trainer = pl.Trainer(gpus=args.num_gpus, max_epochs=args.num_epochs, gradient_clip_val=1.0, callbacks=[lr_decay],+                     precision=args.precision, default_root_dir=args.default_root_dir)+trainer.fit(model, train_dataloader, val_dataloader)+t1 = time.time()+logging.info("%d epochs took %fs, or %fs/epoch", args.num_epochs, t1 - t0, (t1-t0)/args.num_epochs)++# todo below: I don't yet understand the Lightning checkpoint schema+# logging.info("testing...")+# ckpt_path = os.path.join(args.default_root_dir, 'model.pt')+# model.load_from_checkpoint(ckpt_path) # load the best checkpoint we found+# trainer.test(test_dataloader=test_dataloader)

got it. it looks like test_dataloader is not a kwarg, it's test_dataloaders with an 's'. Similar to val_dataloaders, but not the same as train_dataloader without the s, it looks like. Some of the docs are inconsistent on the use of "s" btw, I think.

karpathy

comment created time in 25 days

PullRequestReviewEvent

Pull request review commentkarpathy/minGPT

Feature/lightning

+"""+A manual, minimal and non-full-featured implementation of boilerplate training loop.+Intentionally made to have the same API as PyTorch Lightning, giving two benefits:+1) Everyone can inspect/hack this simple implementation for educational purposes+2) Everyone can run the full Lightning implementation when they just want to go FAST+"""++import os+import math+import logging++from tqdm import tqdm+import torch+import torch.nn as nn++logger = logging.getLogger(__name__)++# -----------------------------------------------------------------------------++class Result:+        """ very thin wrapper around a result of a train/val/test step of the model """+        def __init__(self, minimize=None, checkpoint_on=None):+            self.minimize = minimize+            self.checkpoint_on = checkpoint_on++        def log(self, key, val):+            setattr(self, key, val)++class TrainResult(Result):+    pass++class EvalResult(Result):+    pass++class LightningModule(nn.Module):++    def load_from_checkpoint(self, checkpoint_path):+        logger.info("loading the best model checkpoint from %s", checkpoint_path)+        state_dict = torch.load(checkpoint_path)+        self.load_state_dict(state_dict)++class Callback:+    pass

got it, ok converted to use of dicts with latest commit

karpathy

comment created time in 25 days

PullRequestReviewEvent

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 9b1e5a461f1c85a6c609502ee5293a283fda7b10

delete Result structs in favor of dicts

view details

push time in 25 days

pull request commentkarpathy/minGPT

Feature/lightning

Okay I merged one more big refactor. Honestly I am starting to think this branch was a very bad idea. I thought I could make things clean but there is a lot of baggage that Lightning "leaks" in a number of places, e.g. w.r.t. model checkpointing, the use of Training/Eval Result structures, forcing me into relatively odd looking abstractions and half-measures.

Anyway, thank you for your help @williamFalcon , I'll have to sleep on this a few days, read the Lightning docs more, and then maybe give it another shot some other time.

karpathy

comment created time in 25 days

Pull request review commentkarpathy/minGPT

Feature/lightning

+"""+A manual, minimal and non-full-featured implementation of boilerplate training loop.+Intentionally made to have the same API as PyTorch Lightning, giving two benefits:+1) Everyone can inspect/hack this simple implementation for educational purposes+2) Everyone can run the full Lightning implementation when they just want to go FAST+"""++import math+import logging+import torch++from tqdm import tqdm+import numpy as np+import torch.nn as nn++logger = logging.getLogger(__name__)++# -----------------------------------------------------------------------------++class Result:+        """ very thin wrapper around a result of a train/val/test step of the model """+        def __init__(self, minimize, checkpoint_on):+            self.minimize = minimize++        def log(self, key, val):+            pass++class TrainResult(Result):+    pass++class LightningModule(nn.Module):+    """ a very thin wrapper on module that returns the model's device similar to Lightning """++    @property+    def device(self):+        return 'cuda' if next(self.parameters()).is_cuda else 'cpu'++    @device.setter+    def device(self, new_device):+        raise RuntimeError("Cannot be set directly, is derived based on the module's parameters' location")++class Callback:+    pass++class LightningDataModule:++    def prepare_data(self):+        pass++    def setup(self, stage):+        pass++    def train_dataloader(self):+        return None++    def val_dataloader(self):+        return None++    def test_dataloader(self):+        return None++# -----------------------------------------------------------------------------+"""+Simple Trainer object; Boilerplate that could apply to any arbitrary neural network,+so nothing here really has anything to do with GPT specifically. This is a+very basic Trainer class that will only train the model on up to one GPU.+"""++class Trainer:++    def __init__(self, max_epochs, gpus=0, gradient_clip_val=None, ckpt_path=None, callbacks=None,+                 precision=32, **kwargs):+        self.gpus = gpus+        self.max_epochs = max_epochs+        self.gradient_clip_val = gradient_clip_val+        self.ckpt_path = ckpt_path+        self.callbacks = [] if callbacks is None else callbacks+        self.model = None++        if self.gpus > 1:+            logger.error("This simple Trainer does not support > 1 GPUs, will just use one.")++        if precision != 32:+            logger.error("This simple Trainer does not support non-fp32 precision, will use fp32")++    def save_checkpoint(self):+        # DataParallel wrappers keep raw model object in .module attribute+        logger.info("saving model checkpoint to %s", self.ckpt_path)+        torch.save(self.model.state_dict(), self.ckpt_path)++    def load_checkpoint(self):+        logger.info("loading the best model checkpoint from %s", self.ckpt_path)+        state_dict = torch.load(self.ckpt_path)+        self.model.load_state_dict(state_dict)++    def fit(self, model, data_module):+        self.model = model # bind model to the class here++        # prepare the dataloaders for outputting batches+        data_module.prepare_data()+        data_module.setup('train')+        train_loader = data_module.train_dataloader()+        data_module.setup('val')+        val_loader = data_module.val_dataloader()+        data_module.setup('test')+        test_loader = data_module.test_dataloader()++        # ship model to gpu if possible+        device = 'cpu'+        if self.gpus > 0 and torch.cuda.is_available():+            logger.info("found CUDA device, shipping model to GPU")+            device = 'cuda'+            self.model = self.model.to(device)

Thank you, yes it's cleaner. I like the more general .to syntax because we'll have many different XPUs etc, feels a bit more future proof

karpathy

comment created time in 25 days

PullRequestReviewEvent

Pull request review commentkarpathy/minGPT

Feature/lightning

+"""+Temporary benchmarking script while integrating Lightning, will remove before merge to master+"""++import os+import time+import math+import logging+import argparse++import numpy as np+import torch+from torch.utils.data import Dataset+from torch.utils.data.dataloader import DataLoader+import torch.backends.cudnn as cudnn++from mingpt.model import GPT+from mingpt.lr_decay import WarmupCosineLearningRateDecay+from mingpt.utils import sample++logger = logging.getLogger(__name__)+logging.basicConfig(+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",+        datefmt="%m/%d/%Y %H:%M:%S",+        level=logging.INFO,+)++# -----------------------------------------------------------------------------+import os+if int(os.environ.get('USE_LIGHTNING', 0)):+    logging.info("USING LIGHTNING!!")+    import pytorch_lightning as pl+else:+    import mingpt.fake_lightning as pl+    logging.info("using our humble trainer")+# -----------------------------------------------------------------------------++class CharDataset(Dataset):+    """+    e.g. Text8 dataset is often used: http://mattmahoney.net/dc/textdata.html+    Vocabulary is lowercase English characters and space for total of 27.+    Training data: First 90M characters.+    Validation data: First 5M characters out of the last 10M characters.+    Testing data: Last 5M characters.+    """++    def __init__(self, data_path, block_size, split, override_vocab=None):++        # load the data and crop it appropriately+        with open(data_path, 'r') as f:+            data = f.read()++        crop = {+            'train': (0, 90000000),+            'val': (90000000, 95000000),+            'test': (95000000, 100000000),+        }[split]+        data = data[crop[0]:crop[1]]++        # build a vocabulary from data or inherit it+        vocab = sorted(list(set(data))) if override_vocab is None else override_vocab+        data_size, vocab_size = len(data), len(vocab)+        logging.info('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))++        self.stoi = { ch:i for i,ch in enumerate(vocab) }+        self.itos = { i:ch for i,ch in enumerate(vocab) }+        self.block_size = block_size+        self.vocab_size = vocab_size+        self.data = data+        self.vocab = vocab++    def __len__(self):+        return math.ceil(len(self.data) / self.block_size)++    def __getitem__(self, idx):+        i = np.random.randint(0, len(self.data) - (self.block_size + 1))+        chunk = self.data[i:i+self.block_size+1]+        dix = torch.tensor([self.stoi[s] for s in chunk], dtype=torch.long)+        return dix[:-1], dix[1:]++class CharDataModule(pl.LightningDataModule):++    def __init__(self, batch_size=64, block_size=128, pin_memory=0, num_workers=0):+        super().__init__()+        self.batch_size = batch_size+        self.block_size = block_size+        self.num_workers = num_workers+        self.pin_memory = pin_memory++        self.train_dataset = CharDataset('text8', self.block_size, 'train')++    def prepare_data(self): # called only on 1 GPU/machine+        pass # could technically download text8 here...++    def setup(self, stage): # called for every GPU/machine+        if stage == 'train' or stage == 'fit':+            pass # nothing to do, the train_dataset is initialized in the constructor+        elif stage == 'val':+            self.val_dataset = CharDataset('text8', self.block_size, 'val', override_vocab=self.train_dataset.vocab)+        elif stage == 'test':+            self.test_dataset = CharDataset('text8', self.block_size, 'test', override_vocab=self.train_dataset.vocab)+        else:+            raise ValueError(f"stage {stage} is not recognized")++    def train_dataloader(self):+        loader = DataLoader(self.train_dataset, batch_size=self.batch_size,+                            shuffle=True, pin_memory=bool(self.pin_memory),+                            num_workers=self.num_workers)

Yes, subtle point!

karpathy

comment created time in 25 days

PullRequestReviewEvent

Pull request review commentkarpathy/minGPT

Feature/lightning

+"""+Temporary benchmarking script while integrating Lightning, will remove before merge to master+"""++import os+import time+import math+import logging+import argparse++import numpy as np+import torch+from torch.utils.data import Dataset+from torch.utils.data.dataloader import DataLoader+import torch.backends.cudnn as cudnn++from mingpt.model import GPT+from mingpt.lr_decay import WarmupCosineLearningRateDecay+from mingpt.utils import sample++logger = logging.getLogger(__name__)+logging.basicConfig(+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",+        datefmt="%m/%d/%Y %H:%M:%S",+        level=logging.INFO,+)++# -----------------------------------------------------------------------------+import os+if int(os.environ.get('USE_LIGHTNING', 0)):+    logging.info("USING LIGHTNING!!")+    import pytorch_lightning as pl+else:+    import mingpt.fake_lightning as pl+    logging.info("using our humble trainer")+# -----------------------------------------------------------------------------++class CharDataset(Dataset):+    """+    e.g. Text8 dataset is often used: http://mattmahoney.net/dc/textdata.html+    Vocabulary is lowercase English characters and space for total of 27.+    Training data: First 90M characters.+    Validation data: First 5M characters out of the last 10M characters.+    Testing data: Last 5M characters.+    """++    def __init__(self, data_path, block_size, split, override_vocab=None):++        # load the data and crop it appropriately+        with open(data_path, 'r') as f:+            data = f.read()++        crop = {+            'train': (0, 90000000),+            'val': (90000000, 95000000),+            'test': (95000000, 100000000),+        }[split]+        data = data[crop[0]:crop[1]]++        # build a vocabulary from data or inherit it+        vocab = sorted(list(set(data))) if override_vocab is None else override_vocab+        data_size, vocab_size = len(data), len(vocab)+        logging.info('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))++        self.stoi = { ch:i for i,ch in enumerate(vocab) }+        self.itos = { i:ch for i,ch in enumerate(vocab) }+        self.block_size = block_size+        self.vocab_size = vocab_size+        self.data = data+        self.vocab = vocab++    def __len__(self):+        return math.ceil(len(self.data) / self.block_size)++    def __getitem__(self, idx):+        i = np.random.randint(0, len(self.data) - (self.block_size + 1))+        chunk = self.data[i:i+self.block_size+1]+        dix = torch.tensor([self.stoi[s] for s in chunk], dtype=torch.long)+        return dix[:-1], dix[1:]++class CharDataModule(pl.LightningDataModule):++    def __init__(self, batch_size=64, block_size=128, pin_memory=0, num_workers=0):+        super().__init__()+        self.batch_size = batch_size+        self.block_size = block_size+        self.num_workers = num_workers+        self.pin_memory = pin_memory++        self.train_dataset = CharDataset('text8', self.block_size, 'train')++    def prepare_data(self): # called only on 1 GPU/machine+        pass # could technically download text8 here...++    def setup(self, stage): # called for every GPU/machine+        if stage == 'train' or stage == 'fit':+            pass # nothing to do, the train_dataset is initialized in the constructor+        elif stage == 'val':+            self.val_dataset = CharDataset('text8', self.block_size, 'val', override_vocab=self.train_dataset.vocab)+        elif stage == 'test':+            self.test_dataset = CharDataset('text8', self.block_size, 'test', override_vocab=self.train_dataset.vocab)+        else:+            raise ValueError(f"stage {stage} is not recognized")++    def train_dataloader(self):+        loader = DataLoader(self.train_dataset, batch_size=self.batch_size,+                            shuffle=True, pin_memory=bool(self.pin_memory),+                            num_workers=self.num_workers)+        return loader++    def val_dataloader(self):+        loader = DataLoader(self.val_dataset, batch_size=self.batch_size,+                            shuffle=False, pin_memory=bool(self.pin_memory),+                            num_workers=self.num_workers)+        return loader++    def test_dataloader(self):+        loader = DataLoader(self.test_dataset, batch_size=self.batch_size,+                            shuffle=False, pin_memory=bool(self.pin_memory),+                            num_workers=self.num_workers)+        return loader+++# -----------------------------------------------------------------------------++parser = argparse.ArgumentParser()+parser.add_argument('-x', '--num-epochs', type=int, default=5, help="number of epochs to train for")

Neat! I'll have to read more of the docs

karpathy

comment created time in 25 days

PullRequestReviewEvent

Pull request review commentkarpathy/minGPT

Feature/lightning

+"""+Temporary benchmarking script while integrating Lightning, will remove before merge to master+"""++import os+import time+import math+import logging+import argparse++import numpy as np+import torch+from torch.utils.data import Dataset+from torch.utils.data.dataloader import DataLoader+import torch.backends.cudnn as cudnn++from mingpt.model import GPT+from mingpt.lr_decay import WarmupCosineLearningRateDecay+from mingpt.utils import sample++logger = logging.getLogger(__name__)+logging.basicConfig(+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",+        datefmt="%m/%d/%Y %H:%M:%S",+        level=logging.INFO,+)++# -----------------------------------------------------------------------------+import os+if int(os.environ.get('USE_LIGHTNING', 0)):+    logging.info("USING LIGHTNING!!")+    import pytorch_lightning as pl+else:+    import mingpt.fake_lightning as pl+    logging.info("using our humble trainer")+# -----------------------------------------------------------------------------++class CharDataset(Dataset):+    """+    e.g. Text8 dataset is often used: http://mattmahoney.net/dc/textdata.html+    Vocabulary is lowercase English characters and space for total of 27.+    Training data: First 90M characters.+    Validation data: First 5M characters out of the last 10M characters.+    Testing data: Last 5M characters.+    """++    def __init__(self, data_path, block_size, split, override_vocab=None):++        # load the data and crop it appropriately+        with open(data_path, 'r') as f:+            data = f.read()++        crop = {+            'train': (0, 90000000),+            'val': (90000000, 95000000),+            'test': (95000000, 100000000),+        }[split]+        data = data[crop[0]:crop[1]]++        # build a vocabulary from data or inherit it+        vocab = sorted(list(set(data))) if override_vocab is None else override_vocab+        data_size, vocab_size = len(data), len(vocab)+        logging.info('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))++        self.stoi = { ch:i for i,ch in enumerate(vocab) }+        self.itos = { i:ch for i,ch in enumerate(vocab) }+        self.block_size = block_size+        self.vocab_size = vocab_size+        self.data = data+        self.vocab = vocab++    def __len__(self):+        return math.ceil(len(self.data) / self.block_size)++    def __getitem__(self, idx):+        i = np.random.randint(0, len(self.data) - (self.block_size + 1))+        chunk = self.data[i:i+self.block_size+1]+        dix = torch.tensor([self.stoi[s] for s in chunk], dtype=torch.long)+        return dix[:-1], dix[1:]++class CharDataModule(pl.LightningDataModule):++    def __init__(self, batch_size=64, block_size=128, pin_memory=0, num_workers=0):+        super().__init__()+        self.batch_size = batch_size+        self.block_size = block_size+        self.num_workers = num_workers+        self.pin_memory = pin_memory++        self.train_dataset = CharDataset('text8', self.block_size, 'train')++    def prepare_data(self): # called only on 1 GPU/machine+        pass # could technically download text8 here...++    def setup(self, stage): # called for every GPU/machine+        if stage == 'train' or stage == 'fit':+            pass # nothing to do, the train_dataset is initialized in the constructor+        elif stage == 'val':

Sorry I'm pretty sure I totally misunderstood what a "stage" is and thought it referred to splits.

karpathy

comment created time in 25 days

PullRequestReviewEvent

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 1aa67ca5277107b406304c6b35e915c686ad09f6

switch to a faster version of zero_grad()

view details

Andrej Karpathy

commit sha 452a5ab9a0a0f9b0e32f947f34f46ac8b425a0e4

massive refactor yet again. this was all probably a pretty bad idea

view details

push time in 25 days

pull request commentkarpathy/minGPT

Feature/lightning

Alright calling it here for today, I'm tired and still have some actual work to do. I'm pretty sure I don't understand how pl.LightningDataModule API is supposed to be used properly between init, prepare, setup, and dataloader calls, what stage=None means, and how it is expected to be called from the Trainer, etc. To be continued...

karpathy

comment created time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha ebd40f112c249efdbfad7ae9d3e5d885a5d0661e

support fp16/32 precision in bench

view details

push time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 0ed3376b3f44ce5d018b60328cbb01de52a9383b

move instantiation of text dataset into the constructor so we don't have to create it twice

view details

push time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha fa10298a8de5a20f387c9eec035afc9d6a30b420

use a standard benchmark (text8) and implement train/val/test splits

view details

push time in a month

pull request commentkarpathy/minGPT

Feature/lightning

@williamFalcon ok done. few things I find a bit gross:

  • feels like I have to manually create a spurious train_dataset, because I need to know the size of vocab to calculate the learning rate decay, and to sample at the end etc.
  • see data_module.setup('fit') # fit... should be 'train'? ;\ . Based on this https://pytorch-lightning.readthedocs.io/en/latest/new-project.html is 'fit' the currently sanctioned name for the training stage? Should it maybe be 'train'?
karpathy

comment created time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha fb37e03cd17d546ff3b01ecbdabcebf2bb17f395

refactor into a datamodule, attempt number 1

view details

push time in a month

pull request commentkarpathy/minGPT

Feature/lightning

ok with the last commit I feel a bit better about things. mingpt/fake_lightning.py is now basically a minLightning :D, which could ideally be imported instead of Lightning and get a strict subset of just the very basic functionality. It's not functionally equivalent and a little bit hardcoded to minGPT purposes, but that's okay.

karpathy

comment created time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 81650ae4d7162fbb01ef2adbb46fab87c8952a1d

one more refactor, this is better because the equivalence to lightning is now much cleaner and all of lightning functionality is in one file

view details

push time in a month

PR opened karpathy/minGPT

Feature/lightning

I'm trying @williamFalcon , but I have somewhat mixed feelings about it. The API are now matched up and I can train the basic loop with either:

$ USE_LIGHTNING=0 python bench.py
5 epochs took 33.225314s, or 6.645063s/epoch

or

$ USE_LIGHTNING=1 python bench.py
5 epochs took 30.068728s, or 6.013746s/epoch

some overhead incurred, not that it matters too much at the stage of a single GPU.

To merge would still have to:

  • clean up a bit further and make even more transparent ideally, needs a bit more thought
  • delete bench.py
  • uprev all notebooks
  • uprev Readme file
+262 -129

0 comment

4 changed files

pr created time in a month

create barnchkarpathy/minGPT

branch : feature/lightning

created branch time in a month

issue closedkarpathy/arxiv-sanity-preserver

This site can’t be reached

http://www.arxiv-sanity.com/ ERR_CONNECTION_REFUSED

closed time in a month

Anonymous-so

issue commentkarpathy/arxiv-sanity-preserver

This site can’t be reached

server just died for no reason, i restarted it

Anonymous-so

comment created time in a month

pull request commentkarpathy/minGPT

Fix typo in comment in play_char.ipynb

:) ty

brchristian

comment created time in a month

push eventkarpathy/minGPT

brchristian

commit sha 4b5d96b99c144bf02042e0af2448afc8c230c2c5

Fix typo in comment in play_char.ipynb

view details

Andrej

commit sha 4050db60409b5bbaaa3302cee1e49847fc145c65

Merge pull request #32 from brchristian/patch-1 Fix typo in comment in play_char.ipynb

view details

push time in a month

PR merged karpathy/minGPT

Fix typo in comment in play_char.ipynb
+2 -2

0 comment

1 changed file

brchristian

pr closed time in a month

push eventkarpathy/minGPT

fpgaminer

commit sha a7b13e02ffef5a6569b6ea7cac1b0566665d9e75

fix CharDataset::__len__ off by one error

view details

Andrej

commit sha c43600576e37b422357c29131d6c6837ad8482bf

Merge pull request #31 from fpgaminer/master fix CharDataset::__len__ off by one error

view details

push time in a month

PR merged karpathy/minGPT

fix CharDataset::__len__ off by one error

I made an off by one mistake in my comments on issue #22, which unfortunately got rolled into 339f4e7. Sorry about that!

Verified by example:

>>> data = list(range(8))
>>> len(data)
8
>>> block_size = 4
>>> 
>>> 
>>> for i in range(len(data) - block_size):
...     print(i)
...     print(data[i:i+block_size+1])
... 
0
[0, 1, 2, 3, 4]
1
[1, 2, 3, 4, 5]
2
[2, 3, 4, 5, 6]
3
[3, 4, 5, 6, 7]
>>> for i in range(len(data) - (block_size+1)):
...     print(i)
...     print(data[i:i+block_size+1])
... 
0
[0, 1, 2, 3, 4]
1
[1, 2, 3, 4, 5]
2
[2, 3, 4, 5, 6]
+1 -1

1 comment

1 changed file

fpgaminer

pr closed time in a month

pull request commentkarpathy/minGPT

fix CharDataset::__len__ off by one error

easy way to see this is that when block_size + 1 == len(data) the call to len should return 1. I actually spotted the issue in your PR and then copied it anyway into the code forgot about it, I think it was just too late.

fpgaminer

comment created time in a month

PR closed karpathy/minGPT

Adding CODE_OF_CONDUCT
+76 -0

1 comment

1 changed file

gaushikmr

pr closed time in a month

pull request commentkarpathy/minGPT

Adding CODE_OF_CONDUCT

maybe if this project grows up more. for now it's a weekend project i'm still rapidly iterating on

gaushikmr

comment created time in a month

issue closedkarpathy/minGPT

play_char training is broken; CharDataset is not multiprocessing compatible

I discovered that the CharDataset implementation is broken and returns the same batch of data multiple times in a row. This causes massive overfitting and wasted cycles.

The root issue is that CharDataset is not multiprocessing compatible, but num_workers is >1 so it's used in multiprocessing mode.

Details

The source of the problem is this line in __getitem__:

        i = np.random.randint(0, len(self.data) - (self.block_size + 1))

CharDataset is fed into a DataLoader during training, with num_workers set greater than 1. This puts DataLoader into multiprocessing mode where it distributes the Dataset to multiple processes. The crux of the issue is that in doing so it copies over local state, including for example the state of random number generators. So that line above will return the exact same sequence of "random" indexes in every worker process. This results in the same batch of data being repeated four times in a row, before repeating the next batch of data four times, and so on.

Here is a notebook that simplifies play_char to demonstrate the issue: https://gist.github.com/fpgaminer/7737a9377e3379fe17dc5bb83d4db69c

In the simplified notebook __getitem__ returns i directly. In the last cell it iterates the loader and prints out the batches. As can be seen, batches are repeated four times.

Workaround

The workaround for me was to set num_workers to 1. Before the workaround the model showed signs of overfitting on WebText2 which shouldn't be possible. After the workaround, the model started to train correctly and test loss began dropping as expected.

Fix

I haven't worked with raw PyTorch much, so I don't know the idiomatic fix. I'm happy to research and propose a pull request if you would like. Perhaps the easiest fix is to use the workaround and drop an assert into CharDataset to throw if multiprocessing gets used. Since the dataset is in-memory there's little reason to use multiple workers. Larger datasets would really need a different Dataset implementation anyway.

closed time in a month

fpgaminer

issue commentkarpathy/minGPT

play_char training is broken; CharDataset is not multiprocessing compatible

addressed in https://github.com/karpathy/minGPT/commit/339f4e7ad39558bfd7e99d916b9fdd6c6827f807

as mentioned in commit i'm not happy with this demo still, and i'm not happy that epochs will now take a super long time. Closing the issue for now.

fpgaminer

comment created time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 339f4e7ad39558bfd7e99d916b9fdd6c6827f807

fix dataloader issue pointed out by @fpgaminer in #28 and introduce shuffle=True and pin_memory=True as defaults. That said I'm still not very happy with this demo because we're likely overfitting a massive model to tiny text and nothing is really tuned at all. This needs a real train/test dataset and a tiny bit of hyperparameter search, todo.

view details

push time in a month

issue commentkarpathy/minGPT

play_char training is broken; CharDataset is not multiprocessing compatible

epochs are just acting as delimiters for when to log training and testing loss progress

yes exactly. For the char demo it may be just fine to use the i = idx; shuffle=True fix with no other changes. The epoch -> iterations logging change can be thought of as a separate issue. Let me think through the details here since it's a little bit gnarly and suddenly would involve all the other demos too, etc. bleh.

fpgaminer

comment created time in a month

issue commentkarpathy/minGPT

play_char training is broken; CharDataset is not multiprocessing compatible

Got it, yes agree on all points. The only downside with my proposal up above seems to be that epochs will last a long time and are over-estimated by a factor of block size. But the correct thing would be happening as far as training goes. If you squint you can actually see that as correct because technically every window of data is a different example, because each output is conditioned on a slightly different-sized input. So TLDR I'm not too averse to that as the interpretation of "epoch". I'm kind of an enemy of the concept of epochs anyway, I much prefer to think about everything in just raw iterations and multiples there of.

fpgaminer

comment created time in a month

issue commentkarpathy/minGPT

play_char training is broken; CharDataset is not multiprocessing compatible

why doesn't shuffle=True suffice?

fpgaminer

comment created time in a month

issue commentkarpathy/minGPT

play_char training is broken; CharDataset is not multiprocessing compatible

idiomatic raw pytorch and the "correct" solution is what i described above, returning the correct idx'th example. The workaround is not a good idea.

fpgaminer

comment created time in a month

issue commentkarpathy/minGPT

play_char training is broken; CharDataset is not multiprocessing compatible

(however note that this change is then coupled to the Trainer code, where we have to use an appropriate sampler, or set shuffle=True when we create the DataLoader object. Which actually sounds like a good default setting I believe.)

fpgaminer

comment created time in a month

issue commentkarpathy/minGPT

play_char training is broken; CharDataset is not multiprocessing compatible

ouch, doh. the correct thing is to not ignore the idx that comes in and return the appropriate example to that.

fpgaminer

comment created time in a month

pull request commentkarpathy/minGPT

Weight decay exclusions

@morganmcg1 the purpose of L2 regularization is to "spread out" the weights in dot products, ensuring that more "independent measurements" (dimensions of the input) get used more equally, instead of any one feature dominating the computation. This only makes sense for matrix multiply layers, which the embeddings and the layernorm parameters are not. L2 regularization is covered eg in CS231n.

michaellavelle

comment created time in a month

pull request commentkarpathy/minGPT

play_math.ipynb: remove hints for the model

It's correct the way it is. I think I could definitely document this better, I apologize, it is a bit tricky. Basically minGPT always just predicts the single next integer in the sequence. The input and target x = [1, 2, 3, 7, 0, 4], y = [-100, -100, -100, 0, 4, 9] are actually 3 separate examples that will all be trained on independently and simultaneously, with a single forward/backward pass:

given [1,2,3,7] predict 0 given [1,2,3,7,0] predict 4 given [1,2,3,7,0,4] predict 9

then at test time when minGPT sees [1,2,3,7] it will first decide the first digit (0), then the second (4), then the third (9), one by one, in 3 steps creating the prediction.

j-planet

comment created time in a month

issue commentkarpathy/minGPT

play_image notebook doesn't work.

minGPT isn't a library you install just yet, so mingpt has to be in your pythonpath. e.g. is your notebook opened in the root repo directory?

n00mkrad

comment created time in a month

issue closedkarpathy/minGPT

Layer norm weights not excluded from weight decay

Currently weight decay is being applied to the LayerNorm weights.

I've raised https://github.com/karpathy/minGPT/pull/18 and https://github.com/karpathy/minGPT/pull/16 to resolve this issue.

closed time in a month

michaellavelle

pull request commentkarpathy/minGPT

Correcting the Bard's name

haha cool ty :)

michaellavelle

comment created time in a month

push eventkarpathy/minGPT

Michael Lavelle

commit sha effa35fd93721f43cbd5fb36c5ade5747fa5737f

Correcting the Bard's name

view details

Andrej

commit sha 94187b944cf9f91468893d7dd26c57eea9de1b39

Merge pull request #25 from michaellavelle/shakespeare Correcting the Bard's name

view details

push time in a month

PR merged karpathy/minGPT

Correcting the Bard's name
+2 -2

0 comment

1 changed file

michaellavelle

pr closed time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 63902c8d092277ac2fdaf7b7c75f7810b4166e8a

remove passive aggressive comment. control yourself andrej.

view details

push time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 38d7327dfdb671040b55624a90f92d2cfc54ff37

instead of -1e10 use float -inf, which I think will play nicer with fp16 down the line

view details

push time in a month

PR closed karpathy/minGPT

More generic fix for HuggingFace LayerNorm.weight reference - replacing with PyTorch equivalents

I read on the README that "mingpt/trainer.py is (GPT-independent) PyTorch boilerplate" and realised the previous fix (https://github.com/karpathy/minGPT/pull/16 ) for the layer norm weight exclusion was coupled to the GPT structure.

This PR should resolve that by allowing a set of module types to be configured whose parameters are to be excluded from weight decay.

+11 -1

2 comments

1 changed file

michaellavelle

pr closed time in a month

pull request commentkarpathy/minGPT

More generic fix for HuggingFace LayerNorm.weight reference - replacing with PyTorch equivalents

fixed in https://github.com/karpathy/minGPT/commit/bbbdac74fa9b2e55574d70056163ffbae42310c1

michaellavelle

comment created time in a month

PR closed karpathy/minGPT

Weight decay exclusions

Excluding Parameters from weight decay by default which are:

  • Freestanding ( not nested with a Module)
  • Within LayerNorm or Embedding modules

A list of parameter name overrides can be configured with TrainerConfig that allow specific Parameters to be included with weight decay

+28 -1

1 comment

1 changed file

michaellavelle

pr closed time in a month

pull request commentkarpathy/minGPT

Weight decay exclusions

Fixed in https://github.com/karpathy/minGPT/commit/bbbdac74fa9b2e55574d70056163ffbae42310c1

michaellavelle

comment created time in a month

PR closed karpathy/minGPT

Fixing HuggingFace LayerNorm.weight reference - replacing with PyTorch equivalents

A reference to "LayerNorm.weight" looks like it may have been ported from a HuggingFace implementation - when running locally the values that I need to exclude from weight decay in this implementation are "ln1.weight", "ln2.weight" and "ln_f.weight".

Update: I've created another PR with a more generic fix for this issue: https://github.com/karpathy/minGPT/pull/18

Thanks for putting out this awesome code - I'm working on a Java implementation of gpt and couldn't find any existing code to reference that contained both the sampling and training steps without introducing a lot of complexity or superfluous frameworks. This is a really helpful project - many thanks!

+1 -1

2 comments

1 changed file

michaellavelle

pr closed time in a month

pull request commentkarpathy/minGPT

Fixing HuggingFace LayerNorm.weight reference - replacing with PyTorch equivalents

Fixed in https://github.com/karpathy/minGPT/commit/bbbdac74fa9b2e55574d70056163ffbae42310c1

michaellavelle

comment created time in a month

issue commentkarpathy/minGPT

Layer norm weights not excluded from weight decay

Ok here is my solutions. Basically I couldn't find a good generic way to make model-free assumptions about what params should or should not be decayed. In that case it makes most sense to couple this decision inside the model code itself. And if we're doing that we might as well look into making things incrementally more compatible with Lightning, to which we may want to switch later.

https://github.com/karpathy/minGPT/commit/bbbdac74fa9b2e55574d70056163ffbae42310c1

michaellavelle

comment created time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 0d9d098cd208a0de1e7a456d62462c8ce352136b

first commit, able to multigpu train fp32 GPTs on math and character-level data, but have done barely any tuning.

view details

Shivam Tawari

commit sha 25c2ad25dd393fa8cc2eca5be2959a9ff5b96588

Update README.md

view details

Andrej Karpathy

commit sha d708b1e5e292473430e5d074d42d0c8990a3373a

fix a dumb bug, intended to use -1e10 instead of 1e-10. thank you @fpgaminer for spotting and bringing to my attention

view details

Andrej

commit sha c97efac9a936912a058c635aa46661d1331b0b1e

Merge pull request #6 from shivamtawari/patch-1 Update README.md

view details

Jaka Kravanja

commit sha 004b807eb21b45344eb5be2c4545f8fba8e88db7

Sort characters to always return same mapping

view details

Andrej

commit sha ebcc03ec7e55ba70d54946c51eebc0ab9351bcac

Merge pull request #21 from jkravanja/master Sort characters to always return same mapping

view details

Andrej Karpathy

commit sha d15a85719ecef0f7585d5aa45d8823fd075be99c

add demo of image gpt trained on CIFAR-10

view details

Andrej Karpathy

commit sha eca27f631653b5a6365ff213b481074ae4aaeef8

Merge branch 'master' of github.com:karpathy/minGPT

view details

“Andrej

commit sha 23982656dffca3ed2839a65ea10e80216001b430

add early stopping logic

view details

“Andrej

commit sha bbbdac74fa9b2e55574d70056163ffbae42310c1

properly separate params that should be weight decayed, and make a small incremental step towards Lightning compatibility by creating the optimizer object inside the model's configure_optimizers

view details

“Andrej

commit sha 421caf8b201c278f535243a982f78143611bb0a3

mit license file

view details

“Andrej

commit sha 5a67ab913dff1746bf1192954a71f6ec1e4d75a8

add early stopping to cifar10 image demo

view details

“Andrej

commit sha a8835cfebcfb8cae7eee3c8515a35f1361dfda4f

bleh resolve merge conflicts

view details

“Andrej

commit sha f6830858927ee0abee6bf09f1850d452f3f62e11

resolve merge conflict, this is not going well at all

view details

push time in a month

issue commentkarpathy/minGPT

Layer norm weights not excluded from weight decay

Yes the "naked" pos_emb is a bit tricky. I have a specific not amazing but okay solution in mind that I'll push soon as a quick fix.

michaellavelle

comment created time in a month

pull request commentkarpathy/minGPT

Adapted this awesome minGPT to use PyTorch lightning!

cool, looking forward to 1.0!

williamFalcon

comment created time in a month

pull request commentkarpathy/minGPT

Adapted this awesome minGPT to use PyTorch lightning!

By the way, I am still making my way through the docs/code but I actually like what Lightning is trying to do, and I think will try to incrementally restructure the code to meet its API. At that point it will be trivial to either use the included trainer object for full explicit flexibility, or just the lightning trainer.

williamFalcon

comment created time in a month

issue commentkarpathy/minGPT

Layer norm weights not excluded from weight decay

Also while we're fixing weight decay I'm not sure that we should be decaying the positional/token embeddings, trying to find some references to this.

michaellavelle

comment created time in a month

issue commentkarpathy/minGPT

Layer norm weights not excluded from weight decay

Do you by any chance know what the history is of this sneaking through to huggingface repo / it being fixed?

michaellavelle

comment created time in a month

issue commentkarpathy/minGPT

Layer norm weights not excluded from weight decay

Yes ty, will try to find some time to take a look today.

michaellavelle

comment created time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 4e152c7aeed35224f2ccf91a8f44d9b767044f48

add demo of image gpt trained on CIFAR-10

view details

Andrej Karpathy

commit sha d100e2251a258ea6c72e59eeba83539567e8fc8c

Merge branch 'master' of github.com:karpathy/minGPT

view details

push time in a month

pull request commentkarpathy/minGPT

Replace tqdm with labml for nicer outputs

You're right filing under "won't merge", but I can see how some would find labml useful, thank you for the PR!

vpj

comment created time in a month

pull request commentkarpathy/minGPT

Sort characters to always return same mapping

excellent PR, thank you so much for the fix!

jkravanja

comment created time in a month

push eventkarpathy/minGPT

Jaka Kravanja

commit sha 88bf19a869b8b5524fc220099d0d9a78e7fe0172

Sort characters to always return same mapping

view details

Andrej

commit sha 382ac70290d1461b40fd9734e047a5e64723941e

Merge pull request #21 from jkravanja/master Sort characters to always return same mapping

view details

push time in a month

PR merged karpathy/minGPT

Sort characters to always return same mapping

list(set()) does not always return characters in the same order. If this code is reused for creating character mapping when loading the saved model and sampling from it, the text does not make any sense.

+1 -1

0 comment

1 changed file

jkravanja

pr closed time in a month

pull request commentkarpathy/minGPT

[WIP] codestyle and Catalyst example

My default is to dislike monolithic (Keras-like) frameworks and I also dislike forced coding styles that expand out code into billion lines of code and whitespace. I don't find it more readable at all.

That said I understand this is a strong matter of preference. I will not be merging this PR into this repo but please definitely feel free to adapt any of this code in whatever way you like, build on it, include it into Catalyst examples, etc.

Thank you!

Scitator

comment created time in a month

issue closedkarpathy/minGPT

Use PyTorch Lightning to handle the training (free checkpointing + logging + 16-bit precision)

Awesome repo!

However, not sure why go through the effort of implementing your own trainer again...

In lightning we already support:

  • automatic checkpoint loading/saving
  • multi-cpu
  • multip-gpu
  • multi-tpu core
  • 16-bit precision (amp and native)
  • accumulated gradients
  • and about 40+ more features.

Not to mention it's maintained by a team of over 20+ fulltime engineers and 200+ open-source contributors and has been adopted by over 400 companies and research labs.

https://pytorch-lightning.readthedocs.io/en/latest/new-project.html

closed time in a month

williamFalcon

issue commentkarpathy/minGPT

Use PyTorch Lightning to handle the training (free checkpointing + logging + 16-bit precision)

@williamFalcon your link above doesn't work btw. Closing the Issue and will look at the PR shortly.

williamFalcon

comment created time in a month

issue closedkarpathy/minGPT

nn.MHA and MultiheadAttentionContainer in torchtext

Just saw the comment here. Please let us know if you have any feedback for nn.MHA. At the same time, torchtext releases a new MHA container, which is shorter and more flexible for new research variants link.

closed time in a month

zhangguanheng66

issue commentkarpathy/minGPT

nn.MHA and MultiheadAttentionContainer in torchtext

ty. it's just so verbose, i like mine a lot better. and it would be a 3rd party dependency, which I'm trying to avoid. Right now it's only PyTorch.

zhangguanheng66

comment created time in a month

pull request commentkarpathy/minGPT

Update README.md

Haha ok ty :)

shivamtawari

comment created time in a month

push eventkarpathy/minGPT

Shivam Tawari

commit sha 31cd9896105af613836a95a57312b64d2f78d0d2

Update README.md

view details

Andrej

commit sha 3f1d1036d77cd68aa4c730583cc1f777015f45b0

Merge pull request #6 from shivamtawari/patch-1 Update README.md

view details

push time in a month

PR merged karpathy/minGPT

Update README.md

Corrected few typo's.

+2 -2

0 comment

1 changed file

shivamtawari

pr closed time in a month

PR closed karpathy/minGPT

Porting minGPT to Hydra

This PR is porting minGPT to Hydra. In an nutshell, you can override everything from the command line. You can use composition to alternate between whole nodes of your config (in this example model, dataset and trainer are all pluggable).

Hydra has it's own plugins for launching all Hydra applications to various environments (think launching to SLURM without writing extra scripts), and to do Hyper parameter optimization (think optimizing with Ax or Nevergrad without changing your code), or just doing simple grid search (--multirun) to find the best model/dataset combination for example.

This example also uses Stuctured Configs, which are a way to get type safe config that is validated both statically (duck typing) and at runtime. try to override a parameter to a wrong type to see what I mean. The config structure is all in mingpt/conf/__init__.py in the PR. take a look.

This example demonstrate what I currently consider best practices for Hydra 1.0.0rc3 (released today), but Hydra is very flexible and you can change many things.

There is a whole lot more to Hydra, the best way to learn about it is to go through the website tutorials.

$ python mingpt/main.py                                                                                                                                                                                                                                                                                                                                   
train_dataset:                                                        
  _target_: mingpt.chardataset.CharDataset                                                                                                                                                                                                                                                                                                                                           
  url: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
  filename: tinyshakespeare.txt                                                                                                                                                                                                                                                                                                                                                      
  block_size: 128                                                     
  cache_dir: ${hydra:runtime.cwd}/datasets_cache                                                                                                                                                                                                                                                                                                                                     
test_dataset: null                                                    
model:                                                                                                                                                                                                                                                                                                                                                                               
  _target_: mingpt.model.GPT                                          
  config:                                                                                                                                                                                                                                                                                                                                                                            
    embd_pdrop: 0.1    
    resid_pdrop: 0.1                               
    attn_pdrop: 0.1                                      
    n_layer: 8                     
    n_head: 8                                                                                
    n_embd: 512                                          
    block_size: ${train_dataset.block_size}
    vocab_size: ???                                                                                                                                                                                                                                                                                                                                                                  
seed: 42                                                                                                              
trainer:                                                                                                                                                                                                                                                                                                                                                                             
  _target_: mingpt.trainer.Trainer                                                
  config:                                                                                                                                                                                                                                                                                                                                                                            
    max_epochs: 200                                                                                                                                                                                                                                                                                                                                                                  
    batch_size: 512                                                                                                                                                                                                                                                                                                                                                                  
    learning_rate: 0.0006                                             
    betas:                                                                                                                                                                                                                                                                                                                                                                           
    - 0.9                                                                                                                                                                                                                                                                                                                                                                            
    - 0.95                                                                                                                                                                                                                                                                                                                                                                           
    grad_norm_clip: 1.0                                                                                                                                                                                                                                                                                                                                                              
    weight_decay: 0.1                                                                                                                                                                                                                                                                                                                                                                
    lr_decay: true
    warmup_tokens: 10240
    final_tokens: ???
    ckpt_path: checkpoint.t7
    num_workers: 4

[2020-08-17 16:27:51,115][mingpt.chardataset][INFO] - Loading /home/omry/dev/minGPT/datasets_cache/tinyshakespeare.txt
[2020-08-17 16:27:51,124][mingpt.chardataset][INFO] - data has 1115394 characters, 65 unique.
[2020-08-17 16:27:51,436][mingpt.model][INFO] - number of parameters: 2.535219e+07
[2020-08-17 16:27:51,524][mingpt.trainer][INFO] - Using device : 0
  0%|                                                                                                                                                                                                                                                                                                                                                         | 0/17 [00:00<?, ?it/s]
/home/omry/miniconda3/envs/mingpt/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
epoch 1 iter 16: train loss 3.34601. lr 5.999637e-04: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:16<00:00,  1.04it/s]
[2020-08-17 16:28:09,451][mingpt.trainer][INFO] - saving checkpoint.t7
+498 -403

4 comments

14 changed files

omry

pr closed time in a month

pull request commentkarpathy/minGPT

Porting minGPT to Hydra

Will take a look, thank you! Filing under "won't merge" and closing.

omry

comment created time in a month

issue closedkarpathy/minGPT

Possible Improvement to top_k_logits

I have a possible improvement to the mingpt.utils.top_k_logits function:

def top_k_logits(logits, k):
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = -float('Inf') # changed from 1e-10
    return out

I was using the play_char notebook to train against the IMDB dataset, but was getting really terrible samples out of it after training, unless I set temperature very low. Looking into the sampling code I noticed the odd choice of 1e-10 in top_k_logits. It seemed odd since most logits are negative, so using 1e-10 may actually make many characters higher probability, not less/none. Replacing with negative infinity vastly improved sampling for me. A demonstration follows below. I'm happy to open a pull request, just let me know.

Demo Code:

print("Original top_k:")
for _ in range(10):
    context = "O God  O God!"
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = sample(model, x, 200, temperature=0.9, sample=True, top_k=5)[0]
    completion = ''.join([train_dataset.itos[int(i)] for i in y])
    print(completion)
    print()

print("\n\n\nModified top_k:")
for _ in range(10):
    context = "O God  O God!"
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = my_sample(model, x, 200, temperature=0.9, sample=True, top_k=5)[0]
    completion = ''.join([train_dataset.itos[int(i)] for i in y])
    print(completion)
    print()

Output:

Original top_k:
O God  O God!  The DANGER of TERROR RATS GARDEN, 1959 5 79. The movie starts vZERO on quite 

ewtGPX79 end24? Was ChaveBoQwell 6 deserves gKX02 80101 HOURS OF THE WORST POSSIBLE SHOW ON THE LAST? Was 4 7 ratings o

O God  O God!" 9 !!! :  ? ;  #I stumbled aJes DansUn0 JeUS into the 1930 ; So'Pa#AeF7 7 I was so judged.' I'm not sure what tJ2" , The Harri4 X YeR S2ND  20010  Jackson 3 X 44 LD 703010. AT THE TRAIN   THE TRIED O

O God  O God! Don't 5 Dw8becks!#Thekno84 #Watched this movie, like great zombie 30's year Ze7 30 AQUATER EVIL ;1958 tSiff and 3.59Ja :  S47 episodes a5L60 Years and I didn't own REAL WAY very much 'squatreWeLs of 

O God  O God! DoWn Knox  has a WaSTE OF THE STAR TIME!!!  has a stellar point VHS ! No! knowing the 6 Hours.#I Am not The Best Picture or knowing #This Way ToP" do5 POST was a

1970D Very beautifulHbO" Midnight in

O God  O God!! JQ DfH lCnah just a; x's 80X year old Diw4od jokes z? fFf0 to video America4 Z Cou#This was onRIGHT 60 and by far the most past P!! 9 6 I was NOT the, and ;princeOus which made X Le4crophiles pick u

O God  O God!  could ma's 

? Some of you #I , Thed7R 3 not#I really enjoyed this movie how long I didn't...U: The x Newsu? It was '2001'.sLugQleJKxliE : Z63 piece

A 00 ,Vef BoIIled BX 700022 ,: artM dodgy788 pro

O God  O God! Tockzie both , you 8, tAUgues and the...just rjaZzro#; Finally Unstained #I'm 5 SR Movies 8PM keeping ! F" movies Thne? Firstly, I caught this film OK, 4. I QUEST 6 and struck to 9.j. ;3 Out of 10.  

O God  O God!!! a4. I VIETE p6H! iW's not a5 years old as daggerMib0kfijijSS noO.diedk#Wek4 :X m, because the actEMPTION AND The resolution Nazi LiRA ! When seeing Daniels kUNG! , DrNNG of Friends , "The Black Clu

O God  O God!!!! Very lUckK!#IN THE SECOND MANNERS #Izo ons61 is a hiQ :  u havegottCh?  Actually, bI'm the only one xing 50's Naked, 'Calpda7 kids1...iB has watched it WITTY THE BAD MOVIE withvy HIGHER 60WOODS 60

O God  O God!  Un35 U4 END THIS MOVIE . The USA ...Bw??? The sKW, X 1954, 1983I's BETTER....8 10 comes kids.Unless you're not 9 out of 10.!!!QUALITIES!!!!? just when it you a5PM now!#This coulder DVD ;Iun75 was #I




Modified top_k:
O God  O God!!  This is a mediocre film! I was wondering if you are that big fan of the show that you will like to watch.#I saw this movie and I had to be surprised by all the gay movies that take my excellent pla

O God  O God!!  Then, in fact, the movie dies work well together. The plot is all too bad. It is nice to see the movie without a lot of standards to change these comedies, but that is a masterpiece to the film's c

O God  O God!!!!! This is not a good film. It is a good movie, which is no established as a movie with a childhood that is a story that is no more sub than a horror film.#With the best part of the movie the words,

O God  O God!!!!!! ! ! That movie would be bad every silent and the one thing is so good... it is not the best. it's a simple show..it's not a better film. and there is one reason that the child's plots holes are 

O God  O God!

Another reviewer will say that this film is a big fan of my favorite actors. It was too long. I can't really believe the film together but the movie was a stinker. And the script also had always bee

O God  O God!  and then the scary plus in this movie with a bunch of serious comics they are not even mentioned in the cast. I have not seen summarise that they were not expecting a gun or a bad story. The plot wa

O God  O God!  They don't live them up a little bit, and there was much more to tell them. I was also the waters are all over again. They were so cool by the end of the movie because I don't really know what to do

O God  O God!!   And when I was a fan of Sonny Bruce and I was inspired by the scene where she is so stupid and not to mention these scenes with Sanji and her parents trying to stand out, she stays away for the so

O God  O God!  Another show is a meal through her mother and she was tried to be somewhat of her brain.

I highly disagree with her. I didn't even go back to the energy of the movie, and in the movie that it was m

O God  O God!!!!!!!! !!!!!!!!?  THIS SHOUT OF HER OUT!!!! . It is a great film. It is not a good movie but not a good film. I have no idea of hearing the man in a comet. It is a great movie to be a star. I would r

EDIT: Slight addendum: I just have to say how impressive the results of this model are with the fixed sampling, given that I only trained it for a few hours on a 2070.

closed time in a month

fpgaminer

issue commentkarpathy/minGPT

Possible Improvement to top_k_logits

I believe https://github.com/karpathy/minGPT/commit/8909e1b646d6fd5235ec33259fb22fdc2c91037c is the fix, ty.

fpgaminer

comment created time in a month

push eventkarpathy/minGPT

Andrej Karpathy

commit sha 8909e1b646d6fd5235ec33259fb22fdc2c91037c

fix a dumb bug, intended to use -1e10 instead of 1e-10. thank you @fpgaminer for spotting and bringing to my attention

view details

push time in a month

issue commentkarpathy/minGPT

Possible Improvement to top_k_logits

Omg this is a bug, I'm pretty sure I meant to use -1e10 instead of 1e-10. Nice find thank you!

fpgaminer

comment created time in a month

pull request commentkarpathy/minGPT

automatic mixed precision training

Thank you, half precision training was def on my list of todos, will take a look when I get a chance!

nebw

comment created time in a month

issue commentkarpathy/minGPT

Question about the CharDataset

Yep! This is exactly what I meant by "over sequence length" when I say "being clever with batching (both across examples and over sequence length) so that training is efficient." in the readme. We're amortizing the forward pass to the backward pass at each point in the sequence as well, but with a reduced context. This is helpful when the test time prompt is shorter than block size, as an example, because the model is trained to work for variably sized context windows.

gcoter

comment created time in a month

pull request commentkarpathy/minGPT

Porting minGPT to Hydra

Whoa. I certainly won't be able to merge this because it would violate the "min" in minGPT, but it's fun to see!

omry

comment created time in a month

issue closedkarpathy/minGPT

Any experiment results?

Does this project reproduce the results of papers?

Thank you very much.

closed time in a month

guotong1988

issue commentkarpathy/minGPT

Any experiment results?

please read readme file.

guotong1988

comment created time in a month

issue commentkarpathy/minGPT

What hardware is supported?

Decrease the batch size until it runs. E.g. halve it.

jtrakk

comment created time in a month

PublicEvent

issue openedteddykoker/image-gpt

hidden size of mlp layer is 4X the bottleneck size

https://github.com/teddykoker/image-gpt/blob/92ce11653864730263a161b642dd1b5387f07381/src/gpt2.py#L13

btw the GPT-3 paper, as well as its older versions: "we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel". So this Linear should probably be 4*embed_dim for the second argument.

(also not certain that the pytorch MultiHeadAttention has a linear projection at the end of it, like openai uses "c_proj" in their code.)

created time in a month

more