profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/jaketae/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.

bigscience-workshop/Megatron-DeepSpeed 55

Ongoing research training transformer language models at scale, including: BERT & GPT-2

jaketae/koclip 50

KoCLIP: Korean port of OpenAI CLIP, in Flax

bigscience-workshop/evaluation 29

Code and Data for Evaluation WG

jaketae/g-mlp 16

PyTorch implementation of Pay Attention to MLPs

jaketae/param-share-transformer 15

PyTorch implementation of Lessons on Parameter Sharing across Layers in Transformers

jaketae/fnet 12

PyTorch implementation of FNet: Mixing Tokens with Fourier transforms

jaketae/deep-malware-detection 11

A neural approach to malware detection in portable executables

jaketae/mlp-mixer 11

PyTorch implementation of MLP-Mixer: An all-MLP Architecture for Vision

jaketae/image-classifier 3

Image classifier web application based on MobileNet, built using Flask, TensorFlow, and Matplotlib

jaketae/realformer 3

PyTorch implementation of RealFormer: Transformer Likes Residual Attention

issue commentbigscience-workshop/Megatron-DeepSpeed

[tensorboard] log grad norm for individual layers

Am I correct in that the grad norms are coming from DeepSpeed? I've done a quick scan and it ultimately boils down to

grad_norm = model[0].get_global_grad_norm()

And model[0] is an instance of deepspeed.PipelineEngine, which inherits from deepspeed.DeepSpeedEngine. This class gets the get_global_grad_norm from its optimizer.

if hasattr(self.optimizer, '_global_grad_norm'):
    self._global_grad_norm = self.optimizer._global_grad_norm

But I got lost after this. I assumed that the optimizer we're using is from apex, but at least based on my cursory search of the apex repo, I wasn't able to verify that apex optimizers have the _global_grad_norm attribute. Is this the right direction, or am I missing something?

stas00

comment created time in 19 hours

issue openedjaketae/wordwise

GitHub templates

Add GitHub templates for PRs and issues.

created time in a day

issue openedjaketae/wordwise

Leverage GPU

Currently, all computations are done on the CPU. Allow users to specify the device to speed up computation when using larger models. Note that this might mean that we have to replace sklearn's cosine similarity function with that from PyTorch.

created time in a day

push eventjaketae/wordwise

Alex Saad

commit sha 53ea0e2b584c8f052677eca5648b1298e129a04f

Change BERT model to all-MiniLM-L6-v2

view details

Alex Saad

commit sha f650ae6c47a8b3469bce74fde22b5d705418c2e0

Truncate tokenization to prevent IndexError

view details

Alex Saad

commit sha 484d164e66ad05b869c57759e53ce01b4cd3b0af

Revert "Truncate tokenization to prevent IndexError" This reverts commit f650ae6c47a8b3469bce74fde22b5d705418c2e0.

view details

Alex Saad

commit sha 760fb7b00240bb9fd3bd95c77c926ba8108ea108

Change BERT model to all-MiniLM-L12-v2

view details

Jake Tae

commit sha 8a07d5997aa92114dda748ace36fca70f827b962

Merge pull request #6 from xesaad/asaad_bert_model Update BERT model

view details

push time in a day

PR merged jaketae/wordwise

Update BERT model

Context

The default Sentence-BERT model used by the Extractor class is deprecated.

After some exploring, I found that a good alternative model is sentence-transformers/all-MiniLM-L6-v2. The embedding space dimension is half that of the previous default sentence-transformers/distilbert-base-nli-stsb-mean-tokens, which makes it significantly faster.

The top performing Sentence-BERT model (according to Sentence Transformers) is sentence-transformers/all-mpnet-base-v2, which could be used instead. Note, however, that runtime is 5 times longer than sentence-transformers/all-MiniLM-L6-v2.

Contribution

This PR changes the default BERT model to sentence-transformers/all-MiniLM-L6-v2.

+1 -1

3 comments

1 changed file

xesaad

pr closed time in a day

pull request commentjaketae/wordwise

Update BERT model

Thanks for updating the default model. LGTM!

xesaad

comment created time in a day

issue commentjaketae/wordwise

Question about removing specific words from output

I'm glad you like WordWise and found it useful! I definitely agree with you that it's always better to build things up incrementally in a piecemeal fashion.

The PRs are great! I personally think the whole point of open source is to accept contribution and feedback from others, so I'm glad you took the time to open them. Appreciate your time and effort!

xesaad

comment created time in a day

issue commentjaketae/wordwise

Question about removing specific words from output

Hey @xesaad, sorry I wasn't able to look into this. I assume we can accept an additional parameter (maybe blacklist), store it as a class variable in the Extractor class, and remove them from the final output. What requires more thinking is whether/if we want the top_k parameter to be always respected.

I'm a full-time student and admittedly it's difficult to make time during the semester, but I'd be more than happy to reopen this and discuss this further with you either here or in a separate PR!

xesaad

comment created time in a day

issue commentjaketae/wordwise

ModuleNotFoundError: No module named 'core'

Closed via #3.

DrRaja

comment created time in a day

issue closedjaketae/wordwise

ModuleNotFoundError: No module named 'core'

Hi, Was trying to test your package but I keep getting the error ModuleNotFoundError: No module named 'core' when I try to import the extractor.

Any ideas?

closed time in a day

DrRaja

issue closedjaketae/wordwise

ValueError related to nlp.max_length: wordwise 0.0.4

Was trying out the library, and run the following error

ValueError: [E088] Text of length 4290144 exceeds maximum of 1000000. The parser and NER 
models require roughly 1GB of temporary memory per 100,000 characters in the input. This 
means long texts  may cause memory allocation errors. If you're not using the parser or NER, 
it's probably safe to increase  the `nlp.max_length` limit. The limit is in number of characters, 
so you can check whether your inputs are  too long by checking `len(text)`.

Just wondering where in the code to fix nlp.max_length

closed time in a day

gsalfourn

issue commentjaketae/wordwise

ValueError related to nlp.max_length: wordwise 0.0.4

Closed via #7 for now.

gsalfourn

comment created time in a day

pull request commentjaketae/wordwise

Update BERT model

Hello @xesaad, thanks for looking into this!

I'm looking at this table, and I'm wondering whether the default should be all-MiniLM-L12-v2 instead of L6. DistilRoBERTa is also an option, but it seems to be slightly larger and slower. Ultimately it's a matter of tradeoff, but I'm curious to hear what you think.

xesaad

comment created time in a day

push eventjaketae/wordwise

Alex Saad

commit sha a5f7f5ace457aac48a3c4bc0ee55fed92ea2bf67

Truncate tokenization to model_max_length to prevent IndexError

view details

Jake Tae

commit sha 04e60fa5f82847890d62cc95a204227d83b2f9e1

Merge pull request #7 from xesaad/asaad_truncation Truncate tokenization

view details

push time in a day

PR merged jaketae/wordwise

Truncate tokenization

Context

As discussed in this issue, an IndexError can be raised if too many tokens are passed to the BERT model at once. One initial solution is to truncate the tokenized output to the maximum sequence length.

Contribution

Truncated the tokenized output. I think the tokenizer will automatically truncate to the max length of the associated BERT model, but for clarity (and to allow users to look this up within the code) we can also get the max length from the attribute self.tokenizer.model_max_length.

+7 -1

1 comment

1 changed file

xesaad

pr closed time in a day

pull request commentjaketae/wordwise

Truncate tokenization

Thank you for looking into this. LGTM!

xesaad

comment created time in a day

startedhyunwoongko/large-scale-lm-tutorials

started time in 2 days

Pull request review commentbigscience-workshop/Megatron-DeepSpeed

Fix glu activation

 def __init__(self, init_method, output_layer_init_method):         super(ParallelMLP, self).__init__()         args = get_args() -        # Project to 4h.+        # Project to ffn_hidden_size         self.dense_h_to_4h = mpu.ColumnParallelLinear(             args.hidden_size,-            args.ffn_hidden_size,+            # GLU is a special activation that divides the dimension by a factor 2.+            2 * args.ffn_hidden_size if args.glu_activation else args.ffn_hidden_size,

The most salient change seems to be this, and this modification is necessary because GLU activations halve the hidden size. Am I seeing this right? (Other changes are of course important as well, but they seem to concern arguments and/or testing.)

thomasw21

comment created time in 2 days

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentbigscience-workshop/Megatron-DeepSpeed

Fix glu activation

 def __init__(self, init_method, output_layer_init_method):         super(ParallelMLP, self).__init__()         args = get_args() -        # Project to 4h.+        # Project to ffn_hidden_size         self.dense_h_to_4h = mpu.ColumnParallelLinear(             args.hidden_size,-            args.ffn_hidden_size,+            # GLU is a special activation that divides the dimension by a factor 2.+            2 * args.ffn_hidden_size if args.glu_activation else args.ffn_hidden_size,

The most salient change seems to be this, and this modification is necessary because GLU activations halve the hidden size. Am I seeing this right?

thomasw21

comment created time in 2 days

PullRequestReviewEvent

delete branch bigscience-workshop/Megatron-DeepSpeed

delete branch : style-formatter

delete time in 5 days

push eventbigscience-workshop/Megatron-DeepSpeed

Jake Tae

commit sha d0799e7c451fb810c522475c03d209a113be36f7

Configure code style formatters (#130) * chore: add formatter config files, add make cmds * feature: add make help cmd Co-authored-by: Jake Tae <>

view details

push time in 5 days

PR merged bigscience-workshop/Megatron-DeepSpeed

Configure code style formatters

This PR fixes #129 by implementing the following:

  • Add black and isort to dependencies
  • Add formatting Make command
  • Add pyproject.toml and setup.cfg for consistent formatting options
+40 -3

0 comment

4 changed files

jaketae

pr closed time in 5 days

issue closedbigscience-workshop/Megatron-DeepSpeed

Consistency in code convention

As discussed in #128, maintain code style consistency by setting up some basic formatters.

  • Makefile
  • pyproject.toml
  • setup.cfg

Benchmark HF transformers as point of reference.

closed time in 5 days

jaketae

push eventbigscience-workshop/Megatron-DeepSpeed

Jake Tae

commit sha 23dded0831d5e31585809457e4b9facc5ac22b5e

Save tokenizer in conversion script (#128) * feature: save tokenizer based on script args * chore: use none instead of empty str for consistency * fix: rm duplicate args, save `tokenizer_class` key * Update tools/convert_checkpoint/deepspeed_to_transformers.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jake Tae <> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

view details

Jake Tae

commit sha b416c8d05fcb2c87f27aea9edcef2fd9ac779896

fix: only trigger ci on .py file changes (#131) Co-authored-by: Jake Tae <>

view details

Conglong Li

commit sha 0035f06a0ee6df41156ec60fd9c2664d2643deed

Curriculum learning support (#132) * CL initial commit * CL+PP support * update * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * apply code review comments * make it easier to read large numbers * add a cl test * apply review comments * Update examples/curriculum_learning/README.md Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * update * fix * new requirement * Update megatron/learning_rates.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update megatron/learning_rates.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * fix samples and tokens - thank you Conglong * fix truncation * switch to deepspeed@master * extend the doc * Trigger CI Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Stas Bekman <stas@stason.org>

view details

Stas Bekman

commit sha b5098e68ddca323d8adf14d2a8c53d132e2c1dda

[CL] fix default placement (#133)

view details

Thomas Wang

commit sha da31db6431b3f3b93da75b3d0d48753e0826ecd8

Fix deepspeed prefix-lm (#107) * Fix pretrain prefix lm using deepspeed * Fix: self._args to args * First set attn_mask in model and then build model * Fix: enforce that we pass down tuple instead of generator * Attention mask does not need to be transposed * BIGGEST HACK EVER * Remove BIGGEST HACK * Skip prefix test as PP>1 doesn't work yet on deepspeed * Unskip prefix test * Merge branch 'main' into thomas/fix_deepspeed_prefix

view details

Stas Bekman

commit sha 93ab439ab6cea9df9d9248e689e063f50323879c

[codecarbon] switch to master (#135) The frozen sha is breaking the test suite, trying with master.

view details

Stas Bekman

commit sha e1574ce34b98f2a7d8cbcac51ae185d6a28a5207

[codecarbon] switch to master (#135) The frozen sha is breaking the test suite, trying with master.

view details

Stas Bekman

commit sha 162ffdb0cc156962cf195e429558619d9802524d

run on pull_request branch (#141)

view details

Stas Bekman

commit sha 3b22aa212698014aae8559700cf1b2abac149b74

print number of params only on rank 0 (#140) * print number of params only on rank 0 * update requirements * failing test with 4gpus, try 2 gpus

view details

Jake Tae

commit sha 4046aa68f9719b58c07cb265c25de32df5b2dc6e

chore: add formatter config files, add make cmds

view details

Jake Tae

commit sha d00c9d67aa3986b97112e9ac6e6a0ee941592903

feature: add make help cmd

view details

push time in 5 days

Pull request review commentbigscience-workshop/Megatron-DeepSpeed

adding scalenorm, attention_init_method and relu^2

 def custom_forward(*inputs):      def set_input_tensor(self, input_tensor):         """Set input tensor to be used instead of forward()'s input.-

FYI hopefully #130 will resolve issues like this!

ontocord

comment created time in 5 days

PullRequestReviewEvent

Pull request review commentbigscience-workshop/Megatron-DeepSpeed

Configure code style formatters

-.PHONY: test+.PHONY: test style -# Run tests for the library+check_dirs := tests tools/convert_checkpoint +# this target tests for the library

Sorry for getting back to you late. I checked out the link and it's very cool! I copied and pasted the help part and it seemed to work as intended. I've somewhat simplified the comments, so now it gives

(base) jt856@ziva:~/bigscience/Megatron-DeepSpeed$ make

Usage:
  make <target>
  help                    this help
  test                    run tests
  style                   checks for code style and applies formatting
jaketae

comment created time in 5 days