profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/navdeepkk/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Navdeep Kumar navdeepkk Indian Institute of Science, Bangalore Bnagalore, India

akshaybaviskar/LeetCode 0

LeetCode practice problems

navdeepkk/academicpages.github.io 0

Github Pages template for academic personal websites, forked from mmistakes/minimal-mistakes

navdeepkk/ACM-ICPC-Algorithms 0

Algorithms used in Competitive Programming

navdeepkk/algorithms_and_data_structures 0

180+ Algorithm & Data Structure Problems using C++

navdeepkk/C-Plus-Plus 0

All Algorithms implemented in C++

navdeepkk/create_ap 0

[NOT MAINTAINED] This script creates a NATed or Bridged WiFi Access Point.

navdeepkk/cs344 0

Introduction to Parallel Programming class code

push eventnavdeepkk/dotfiles

Navdeep Kumar

commit sha 8a89f0fcdd2e4699a62434f592df1e8794d5199e

Update vimrc to include vimtex conf

view details

push time in 24 days

fork navdeepkk/vimtex

VimTeX: A modern Vim and neovim filetype plugin for LaTeX files.

fork in 25 days

push eventnavdeepkk/cuda-pointwise

Navdeep Kumar

commit sha 0e0f260b28bfa9c1df8177c6ba806cc23f1def46

verifying tests added

view details

push time in a month

push eventnavdeepkk/cuda-pointwise

navdeepkk

commit sha da517fa04fa24715ee0cb3bb740962c07bd0a2bc

add relu_gemm prologue

view details

push time in a month

push eventnavdeepkk/cuda-pointwise

Navdeep Kumar

commit sha 861cc7fccbfea07f156181f1d2d65d12f30b4298

correction in secon test

view details

push time in a month

push eventnavdeepkk/cuda-pointwise

Navdeep Kumar

commit sha ca8d6dce89e3922a5931c75c50bc25729b85afc6

addition of new scripts and gencode option

view details

push time in a month

push eventnavdeepkk/cuda-pointwise

Navdeep Kumar

commit sha b81badcbb224a4c3e061fb6c01710b7dfaa41378

correction in relu

view details

push time in a month

create barnchnavdeepkk/cuda-pointwise

branch : master

created branch time in a month

created repositorynavdeepkk/cuda-pointwise

created time in a month

push eventnavdeepkk/CUDALibrarySamples

navdeepkk

commit sha fd5685c44bb595aaf53809ed8992e7828ebd0fb9

add profiling scripts

view details

push time in a month

create barnchnavdeepkk/CUDALibrarySamples

branch : testing

created branch time in a month

issue commentopenai/triton

Poor matmul-square-nn bench performance

@daadaada oh okay. Were they different for volta? Also are these sizzling functions standard, is there a work that I can refer too? Thanks!

axeln31459

comment created time in a month

issue commentopenai/triton

Poor matmul-square-nn bench performance

Hi @daadaada, thanks for the reply. Please correct me if I am wrong. Doesn't the swizzling of data in shared memory depend on the PTX instruction which will finally use the data. Each PTX instruction may have a different thread data mapping and to have minimum bank conflicts the swizzling function might also need some tweaking, i.e., a swizzling function for one PTX instruction may not result in the same bank conflicts for a different PTX instruction. Is that correct?

Thanks!

axeln31459

comment created time in a month

issue commentopenai/triton

Poor matmul-square-nn bench performance

Hi all, @daadaada. I am trying to understand what is meant by adding support for Turing in Triton. Do you mean that you have to introduce the instruction PTX in triton for code generation? Does this also mean that you will have to implement the functions to load data from shared memory into registers using some modified ldmatrix instruction? Will you have to swizzle data in shared memory buffers to prevent bank conflicts? What about the WMMA API?

Is adding the inline PTX instruction the only thing, or is it much more than that?

Thanks!

axeln31459

comment created time in a month

issue commentopenai/triton

Testing triton when C matrix has some initial values. Possible?

Oh Okay, I have some matmul kernels generated, which load C from global memory. I wanted to compare against Triton. I'll get on that when this is fixed. Thanks!

navdeepkk

comment created time in a month

issue commentopenai/triton

Testing triton when C matrix has some initial values. Possible?

Thanks! I'll try this.

navdeepkk

comment created time in a month

issue commentopenai/triton

Testing triton when C matrix has some initial values. Possible?

@ptillet any help would be appreciated. Thanks!

navdeepkk

comment created time in 2 months

issue openedopenai/triton

Testing triton when C matrix has some initial values. Possible?

Hi, Is it possible to benchmark triton when C matrix is not assumed to be all zeros, i.e., it is loaded from global memory rather than initializing an accumulator tile with zeros?

Thanks!

created time in 2 months

startedNVIDIA/cutlass

started time in 2 months