profile
viewpoint

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 2cc2952301f6b0c0605dfbee40e2f88cae9c1bfb

modified readme

view details

push time in 9 hours

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 945af8207779dc89720b8cd3d9dba74b1c509bec

add alex fp32/BF16 train fp32/bf16/int8 inference

view details

Leslie-Fang

commit sha f915e187cadf9952d6c1dd9d0a6cfa2ea604a6b3

Merge branch 'master' of https://github.com/Leslie-Fang/kaggle

view details

push time in 9 hours

push eventLeslie-Fang/C-Solution2Leetcode

leslie-fang

commit sha 0ac54b510cef7aefb3182f853b41bacf62703d6f

647 accept

view details

leslie-fang

commit sha 8ea9ffb528fca5399f1991718f52ad9fe45fd780

560 accept

view details

push time in 2 days

startedmicrosoft/vcpkg

started time in 4 days

issue closedhorovod/horovod

load and pre-process training data when doing the sync of weight

Hi All, I am using horovod with TF. Since my dataset is little bit larger. When I am using horovod to do the training among the CPU cluster. After each training step, horovod would sync the data among the system with ring-allreduce. My question is that, is it possible for the worker to load and pre-process the training data for next step when sync the weights of current training step among the worker.

closed time in 4 days

Leslie-Fang

issue commenthorovod/horovod

load and pre-process training data when doing the sync of weight

@tgaddair Thanks that's what I am looking for.

Leslie-Fang

comment created time in 4 days

issue commenthorovod/horovod

How horovod loads data into memory

https://github.com/horovod/horovod/issues/1840 Thanks this reply resolves my problem

sudheerExperiments

comment created time in 4 days

issue commenthorovod/horovod

How horovod loads data into memory

@sudheerExperiments In each step of training, each rank creates a batch from its slices of data, calculates local gradients and then horovod calculates the global gradients and distributes it back to the ranks. The weights on all the ranks are updated together in synchronous steps.

Hi @abditag2 Thanks for the reply. I have another question is is it possible for each rank creates a batch of data meanwhile updates the weights. I think these 2 steps are irreverent, Right? After both of this steps are finished, the new step could start the forward pass.

sudheerExperiments

comment created time in 4 days

issue openedhorovod/horovod

load and pre-process training data when doing the sync of weight

Hi All, I am using horovod with TF. Since my dataset is little bit larger. When I am using horovod to do the training among the CPU cluster. After each training step, horovod would sync the data among the system with ring-allreduce. My question is that, is it possible for the worker to load and pre-process the training data when sync the weights among the worker after each training step.

created time in 4 days

startedmindspore-ai/mindspore

started time in 7 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 422686cd0ad7f85509f165bdfe6a3a755cb288c4

update readme

view details

push time in 7 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 766d565cfe5ef1cf57b2ecdf7b14287aae0a1e5a

update readme add 2 process each node 1 process

view details

push time in 7 days

issue openedtensorflow/tensorflow

Binary add op BF16 has lower performance than FP32

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Centos 7.6
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): - TensorFlow version (use command below): binary pip install tensorflow==2.1.0
  • Python version: - Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: - GPU model and memory:

Describe the current behavior Running the binary op of two tensor, change the tensor type to Bfloat16 increase the running time of 2x

Describe the expected behavior Bfloat16 has lower memory consumption should be a little bit quick.

Standalone code to reproduce the issue pip install tensorflow==2.1.0

import tensorflow as tf
tf.compat.v1.disable_eager_execution()

if __name__ == "__main__":
  input_shape = [1, 224, 224, 1024]
  images = tf.random.uniform(input_shape, 0.0, 255.0, dtype=tf.float32, name='images')
  images2 = tf.random.uniform(input_shape, 0.0, 255.0, dtype=tf.float32, name='images2')
  images_bf16 = tf.random.uniform(input_shape, 0.0, 255.0, dtype=tf.bfloat16, name='images_bf16')
  images_bf16_2 = tf.random.uniform(input_shape, 0.0, 255.0, dtype=tf.bfloat16, name='images_bf16_2')
  mysum = tf.add(images, images2)
  mysum_bf16 = tf.add(images_bf16, images_bf16_2)
  with tf.compat.v1.Session() as sess:
    def run():
      res = sess.run(mysum)
      #print(res)
    def run2():
      res = sess.run(mysum_bf16)
    import timeit
    cost = timeit.timeit(stmt=run, number=200)
    print("fp32 cost time: {}".format(cost))
    cost_bf16 = timeit.timeit(stmt=run2, number=200)
    print("bf16 cost time: {}".format(cost_bf16))

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Currently result: fp32 cost time: 15.485223675030284 bf16 cost time: 30.486599242896773

created time in 10 days

startedmlperf/training

started time in 11 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 86c83de6203b6dc58e3a7eef29dc2ddaadbc4c1a

add BF16 demo

view details

push time in 11 days

startedMegEngine/MegEngine

started time in 12 days

startedPaddlePaddle/Paddle

started time in 13 days

startedimsheridan/DeepRec

started time in 13 days

issue closedJittor/jittor

Failed to build from source

Thanks for open-source this project. I am following the readme in the main-page to build jittor from source(commit-id: 5f84bf11f393ab08342caa23c416d4461ee7243b ) However, I got this error message when I test the installation

export cc_path="/usr/bin/g++"
python -m jittor.test.test_example
....
  File "/usr/local/lib/python3.7/dist-packages/jittor/compiler.py", line 720, in check_cache_compile
    assert jit_utils.cc
AssertionError

With preliminary debug, I suspect I failed to import the jit_utils_core https://github.com/Jittor/jittor/blob/5f84bf11f393ab08342caa23c416d4461ee7243b/python/jittor_utils/init.py#L91-L97 And I find jit_utils_core is defined in the jit_utils.cc with pybind. Any thought to help debug this problem?

closed time in 14 days

Leslie-Fang

issue commentJittor/jittor

Failed to build from source

Thanks for the quick response. It solve the problem.

Leslie-Fang

comment created time in 14 days

issue openedJittor/jittor

Failed to build from source

Thanks for open-source this project. I am following the readme in the main-page to build jittor from source(commit-id: 5f84bf11f393ab08342caa23c416d4461ee7243b ) However, I got this error message when I test the installation

export cc_path="/usr/bin/g++"
python -m jittor.test.test_example
....
  File "/usr/local/lib/python3.7/dist-packages/jittor/compiler.py", line 720, in check_cache_compile
    assert jit_utils.cc
AssertionError

With preliminary debug, I suspect I failed to import the jit_utils_core https://github.com/Jittor/jittor/blob/5f84bf11f393ab08342caa23c416d4461ee7243b/python/jittor_utils/init.py#L91-L97 And I find jit_utils_core is defined in the jit_utils.cc with pybind. Any thought to help debug this problem?

created time in 14 days

startedJittor/jittor

started time in 14 days

startedtensorflow/model-optimization

started time in 14 days

issue commenttensorflow/tensorflow

Python crashes when computing max of a complex tensor

Hi @zaccharieramzi Thanks for the suggestions. I think this PR (https://github.com/tensorflow/tensorflow/pull/37483) has been merged. Please take a look when feasible.

zaccharieramzi

comment created time in 14 days

issue closedhorovod/horovod

Example to run training on multi-sockets CPU with TF2.0

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet) TF2.0
  2. Framework version:
  3. Horovod version: 0.19.0
  4. MPI version: OPENMPI 3.0.4
  5. CUDA version:
  6. NCCL version:
  7. Python version:
  8. OS and version:
  9. GCC version:

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Your question: Please ask your question here. Is there an example to use horovod and run training on multi-sockets CPU with TF2.0? I mean run each training instance on 1 socket of CPU.

closed time in 15 days

Leslie-Fang

issue commenthorovod/horovod

Example to run training on multi-sockets CPU with TF2.0

Thanks all for the valuable comments.

Leslie-Fang

comment created time in 15 days

issue closedtensorflow/hub

Support for SSD-Mobilenet fine-tune

Dear tensorflow-hub developers, as frequently reported: https://github.com/tensorflow/models/issues/4881 The objection detection API: https://github.com/tensorflow/models/tree/master/research/object_detection is not stable. Could we support objection detection fine-tune in tensorflowhub?

closed time in 15 days

Leslie-Fang

issue commenttensorflow/hub

Support for SSD-Mobilenet fine-tune

Close and keep an eye to https://github.com/tensorflow/hub/issues/509

Leslie-Fang

comment created time in 15 days

push eventLeslie-Fang/tensorflowjs_demo

Leslie-Fang

commit sha 47ba66a8de62c642c921a4d7f91b681c0b5502f4

add image bbox plot

view details

Leslie-Fang

commit sha 789c3ad8dafe2d4cc7fcf02b4cc42565d1f41a84

add plot text

view details

push time in 16 days

push eventLeslie-Fang/models-1

Leslie-Fang

commit sha f595f4bcf7e49b5098a9f9e8877ce40f1748bd2f

add jayz demo

view details

push time in 16 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 9d9047995cb33ed1fea330de45ed23d8abaca96f

add ssd jayz tensorflowjs

view details

push time in 16 days

create barnchLeslie-Fang/tensorflowjs_demo

branch : self_trained_jayz_ssd

created branch time in 16 days

push eventLeslie-Fang/models-1

Leslie-Fang

commit sha 6a2138b977c6a1d6157d200ddac390075dc01bbc

add debug

view details

push time in 16 days

push eventLeslie-Fang/models-1

Leslie-Fang

commit sha 3e16b7d495495078fa9d3e56e8e228aac183d292

add dataset jayz

view details

push time in 16 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 878382b6f25749243ee01d1b4b83062bdd53af00

add example of jayz ssd-mobilenet

view details

push time in 16 days

push eventLeslie-Fang/models-1

Leslie-Fang

commit sha 28ae2a9ac193c6fb1074b08132ee1e392e081e4d

add example jayz

view details

push time in 16 days

push eventLeslie-Fang/models-1

Leslie-Fang

commit sha 6dac87b41be821ff5630939ac8d561df7760c367

add csv2tfrecord

view details

push time in 16 days

create barnchLeslie-Fang/models-1

branch : lesliefang/ssd

created branch time in 16 days

fork Leslie-Fang/models-1

Models and examples built with TensorFlow

fork in 17 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 0ea79f3bc45c629f8f5db5e17fad54fe1085a9b0

add readme for caffe 2 tf error

view details

push time in 17 days

startedHiKapok/SSD.TensorFlow

started time in 17 days

push eventLeslie-Fang/caffe

Leslie-Fang

commit sha 3d077c3a92eaa09c87f291c7d3d96775c70a19ba

add single image inference in coco pretrained model

view details

push time in 17 days

push eventLeslie-Fang/caffe

Leslie-Fang

commit sha 9470e83348d77a9f8161ee0466c64e179a60e86c

add single image inference

view details

push time in 17 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 47c61f4001d0289c1c099ebf6292893c00f780c7

add build and run multi-node caffe

view details

push time in 17 days

create barnchLeslie-Fang/caffe

branch : ssd

created branch time in 17 days

delete branch Leslie-Fang/caffe

delete branch : ssd

delete time in 17 days

create barnchLeslie-Fang/caffe

branch : ssd

created branch time in 17 days

delete branch Leslie-Fang/caffe

delete branch : ssd

delete time in 17 days

push eventLeslie-Fang/caffe

Leslie-Fang

commit sha ec71ac0cf9e1ebd56bbb101f56b26865b4d7768f

change readme

view details

push time in 17 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha bf01813715ce1e0879d421fee5ab555f5fb8817d

add caffe training readme

view details

push time in 17 days

push eventLeslie-Fang/caffe

leslie-fang-intel

commit sha 7cafd12ea636f9701e7ee6ea688058285cd48be9

add Makefile and model

view details

push time in 17 days

issue commenttensorflow/hub

Support for SSD-Mobilenet fine-tune

@rmothukuru Thanks for the reply. Yes, similar question, I am looking for a ssd-mobilenet model trained in coco-dataset for objection detection. I tried the https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md But as the https://github.com/tensorflow/models/issues/4881 reported. The loss is not stable when we fine-tune them.

Leslie-Fang

comment created time in 18 days

fork Leslie-Fang/caffe

Caffe: a fast open framework for deep learning.

http://caffe.berkeleyvision.org/

fork in 18 days

issue openedtensorflow/hub

Support for SSD-Mobilenet fine-tune

Dear tensorflow-hub developers, as frequently reported: https://github.com/tensorflow/models/issues/4881 The objection detection API: https://github.com/tensorflow/models/tree/master/research/object_detection is not stable. Could we support objection detection fine-tune in tensorflowhub?

created time in 18 days

pull request commenttensorflow/tensorflow

fix the core_dump in reduce_max with input type tf.complex64

Hi @sanjoy, Thanks for the review. Fix the minor nits.

Leslie-Fang

comment created time in 18 days

push eventLeslie-Fang/tensorflow

Leslie-Fang

commit sha ec799ba3e10e9ba1508f09c9bc0372795a28d691

fix the method name and input name

view details

push time in 18 days

pull request commenttensorflow/tensorflow

fix the core_dump in reduce_max with input type tf.complex64

Add more unsupported types

Leslie-Fang

comment created time in 18 days

Pull request review commenttensorflow/tensorflow

fix the core_dump in reduce_max with input type tf.complex64

 REGISTER_XLA_OP(Name("Min").CompileTimeConstantInput("reduction_indices"), class MaxOp : public XlaReductionOp {  public:   explicit MaxOp(OpKernelConstruction* ctx)-      : XlaReductionOp(ctx, ctx->input_type(0)) {}+      : XlaReductionOp(ctx, ctx->input_type(0)) {+    OP_REQUIRES_OK(ctx, TypeCheck(xla_reduction_type_));+  }++  Status TypeCheck(xla::PrimitiveType xla_reduction_type_) {+    if (xla_reduction_type_ == xla::C64) {

Hi @sanjoy Thanks for the review. Resolve this by adding for unsupported types based on https://github.com/tensorflow/tensorflow/blob/c48d2a48bbbb9036a9b86194a8ac6bd3160ade00/tensorflow/compiler/xla/literal_util.cc#L207-L250

Leslie-Fang

comment created time in 18 days

push eventLeslie-Fang/tensorflow

Leslie-Fang

commit sha 22b65b2d1b2abb60055966b7acd5d7042902c666

add more unsupport type

view details

push time in 18 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 552228faa393dd8598982a2b1a80fd93e0795a48

add coco inference

view details

Leslie-Fang

commit sha 7956b01f5f57007efc3bb6452294e64480ab152d

modified readme

view details

push time in 18 days

startedweiliu89/caffe

started time in 19 days

startedtiny-dnn/tiny-dnn

started time in 19 days

issue openedtensorflow/models

Docs or help to set the objection detection training pipeline file

Hi all, If there any docs tell us how to set the objection training pipeline configuration file. Especially specify each configuration parameter means. Here is the example pipeline file without detail explanation. https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/configuring_jobs.md

created time in 19 days

startedapache/incubator-mxnet

started time in 20 days

startedCartucho/mAP

started time in 21 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha d4d66443e78dee525d18b7ab8186261879250bb9

fix typo in plot boox

view details

push time in 21 days

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha e54fd634c235dffa09162f8efd10fb7b5d587988

add inference from pre-trained model

view details

push time in 21 days

push eventLeslie-Fang/tensorflowjs_demo

Leslie-Fang

commit sha 54d5e91822d3650eef0c2d3e796e82b47ae3e381

add comments

view details

push time in 21 days

create barnchLeslie-Fang/tensorflowjs_demo

branch : pre_trained_ssd

created branch time in 22 days

create barnchLeslie-Fang/tensorflowjs_demo

branch : dog_cat_self_trained

created branch time in 22 days

startedbazelbuild/intellij

started time in 22 days

push eventLeslie-Fang/tensorflowjs_demo

leslie-fang

commit sha 519768729225758e72134fd5cfe651a1ecd24a34

add top5 calculation

view details

push time in 22 days

create barnchLeslie-Fang/tensorflowjs_demo

branch : master

created branch time in 22 days

issue commenttensorflow/tfjs

mobilenet2.0.2 set modelURL:gstaticcnapps.cn,has cors error

setting requestInit:{mode:'no-cors'} Could avoid the CORS error, but the fetched model is empty...

allanguys

comment created time in 23 days

startedtensorflow/tfjs

started time in 23 days

issue commenttensorflow/tfjs

mobilenet2.0.2 set modelURL:gstaticcnapps.cn,has cors error

Hi, any update of this issue? I have got the same error...

allanguys

comment created time in 23 days

created repositoryLeslie-Fang/tensorflowjs_demo

created time in 23 days

startedtensorflow/tfjs-examples

started time in 23 days

startedtensorflow/tfjs-wechat

started time in 23 days

push eventLeslie-Fang/C-Solution2Leetcode

leslie-fang

commit sha 0c0ffcbd2ed003552ed96bb0a2c948a50744fe0c

238 accept

view details

leslie-fang

commit sha 458b60d7f5a7703cbbc11ddb5b9b226b0f9dbfac

295 accept

view details

push time in 24 days

pull request commenttensorflow/tensorflow

fix the core_dump in reduce_max with input type tf.complex64

Hi @nickdesaulniers Sorry to interrupt. But any comments?

Leslie-Fang

comment created time in 25 days

issue commenttensorflow/tensorflow

Python crashes when computing max of a complex tensor

@jvishnuvardhan Thanks for the remind. I think it's already in the doc https://github.com/tensorflow/tensorflow/blob/78b09356f5c1a17d869a2fbc571bd720b3450d9b/tensorflow/python/ops/math_ops.py#L2409

input_tensor: The tensor to reduce. Should have real numeric type.
zaccharieramzi

comment created time in 25 days

issue commenttensorflow/tensorflow

Master branch failed to build the debug version

Thanks @Saduf2019 Since I don't have GCC7.x available. I have tried GCC63 which gives the same error. Is GCC7.x specifically required to build the debug version?

Leslie-Fang

comment created time in a month

issue openedtensorflow/tensorflow

Master branch failed to build the debug version

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Centos7.6
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: master, commit-id: b45adaf6efeeb8e4acf8517a01f7dc01bdf21db9
  • Python version: 3.6
  • Installed using virtualenv? pip? conda?:
  • Bazel version (if compiling from source): bazel 2.0.0
  • GCC/Compiler version (if compiling from source): 8.3

Describe the problem The debug version build failed with the command:

bazel build -c dbg //tensorflow/tools/pip_package:build_pip_package

Error:

external/aws-checksums/source/intel/crc32c_sse42_asm.c: In function 's_crc32c_sse42_clmul_256':
external/aws-checksums/source/intel/crc32c_sse42_asm.c:61:5: error: 'asm' operand has impossible constraints
     asm volatile("enter_256_%=:"
     ^~~
Target //tensorflow/tools/pip_package:build_pip_package failed to build

created time in a month

push eventLeslie-Fang/tensorflow

Leslie-Fang

commit sha 614e211481339c2f2917613d9642cb0ebbe4e92c

format error message in create op

view details

push time in a month

issue commenttensorflow/tensorflow

Python crashes when computing max of a complex tensor

@zaccharieramzi you will failed to create the max op and see the error message as this:

OP_REQUIRES failed at reduction_ops.cc:88 : Invalid argument: Unsupported PrimitiveType of MaxOp: 'C64'
zaccharieramzi

comment created time in a month

issue commenttensorflow/tensorflow

Python crashes when computing max of a complex tensor

Hi @zaccharieramzi, I think the input type of tf.complex64 is not support in reduce_max. Here make a PR to pick out this condition and avoid core_dump.

zaccharieramzi

comment created time in a month

PR opened tensorflow/tensorflow

fix the core_dump in reduce_max with input type tf.complex64

As report in this issue: https://github.com/tensorflow/tensorflow/issues/37446 When the input type of tf.complex64 , reduce_max op will core dump. This PR add the condition judgement to avoid core-dump directly.

+12 -1

0 comment

1 changed file

pr created time in a month

create barnchLeslie-Fang/tensorflow

branch : lesliefang/fix_reduce_mean_coredump

created branch time in a month

issue commenttensorflow/tensorflow

Building bazel on OpenSuSE Leap 15.1 fails due to unrecognized option.

please use bazel-2.0 refer to this PR: https://github.com/tensorflow/tensorflow/pull/36913

jlturriff

comment created time in a month

push eventLeslie-Fang/C-Solution2Leetcode

leslie-fang

commit sha 3220b12b120ab4900ac72e8b785e9227929ff026

234 accept

view details

leslie-fang

commit sha 760ef77c0d34ba6b792beecd3748973f9846fd15

617 accept

view details

leslie-fang

commit sha 6d2ba469bedbf3bb7bbe2b5db7b0186d7a6abf33

739 accept

view details

push time in a month

push eventLeslie-Fang/C-Solution2Leetcode

leslie-fang

commit sha 92aed0a73de6a9d1213560d169f61261968aeaf0

remove test

view details

push time in a month

issue commenthorovod/horovod

TensorFlow master branch (tf-nightly) compatibility

@tgaddair Yes, I set experimental_run_tf_function=False. Downgrade the TF t0 2.0 fix this problem.

tgaddair

comment created time in a month

issue commenthorovod/horovod

Example to run training on multi-sockets CPU with TF2.0

Thanks, Guys. I will try it.

Leslie-Fang

comment created time in a month

startedbazelbuild/bazel

started time in a month

issue commentbazelbuild/bazel

Illegal reflective access by com.google.protobuf.UnsafeUtil

I have used bazel-1.2.1 and get the same error:

ERROR: Unrecognized option: --experimental_repo_remote_exec
buchgr

comment created time in a month

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 9de84aad4729b0c11bbfa03275b659e25b2474db

add cat_or_dog horovod

view details

push time in a month

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha 8d185df1d8b717f72f03bdfc90dd9631294c11bc

modified format

view details

push time in a month

push eventLeslie-Fang/kaggle

Leslie-Fang

commit sha e3b08916a7b445ea71a60cdce3695bbc484909e7

add multi-node training

view details

push time in a month

issue openedopen-mpi/ompi

What's the difference of map-by and bind-to

Open-MPI version: 3.0.4

Question: When I use the mpirun, What's the difference of map-by and bind-to?

created time in a month

more