profile
viewpoint

tensorflow/lingvo 1911

Lingvo

gdhungana/desisim 0

DESI simulations

samikama/amazon-dsstne 0

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

samikama/benchmarks 0

Benchmark code

samikama/caffe 0

Caffe: a fast open framework for deep learning.

samikama/DeepLearningExamples 0

Deep Learning Examples

samikama/desispec 0

DESI spectral pipeline

samikama/DIGITS 0

Deep Learning GPU Training System

samikama/kamerka 0

Build interactive map of cameras from Shodan

pull request commenttensorflow/serving

Add command-line flags to set model memory limit

@googlebot I signed it!

samikama

comment created time in 4 hours

issue commenttensorflow/tensorflow

[Bug] Wrong device placement for tf.constant with int32

I believe this is intentional as can be seen in #28051.

VoVAllen

comment created time in a day

fork samikama/TensorRT

TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

https://developer.nvidia.com/tensorrt

fork in a day

PR opened tensorflow/serving

Add command-line flags to set model memory limit

This PR adds a command-line flag to expose total model memory limit parameter to user. For ease of use limit is expressed in MB units and converted to bytes internally when passed to model server.

+13 -2

0 comment

3 changed files

pr created time in 3 days

create barnchsamikama/serving

branch : model_memory_limit

created branch time in 3 days

fork samikama/serving

A flexible, high-performance serving system for machine learning models

https://www.tensorflow.org/serving

fork in 3 days

pull request commenttensorflow/tensorflow

Introducing BatchedNMS and CombinedNMS GPU ops

@gbaned build errors are unrelated with this PR. @sanjoy Will there be a review of this?

samikama

comment created time in 7 days

issue commenttensorflow/addons

Implement fast-nms

@fsx950223 how does this algorithm compare against existing GPU NMS implementations in TF? Do you have any idea?

fsx950223

comment created time in 16 days

pull request commenttensorflow/tensorflow

Re-enable GPUNMSv3 and v4 for small box count users

@tatianashp it depends on the input parameters but it is in the order of 10x to several 100x.

samikama

comment created time in 21 days

create barnchsamikama/tensorflow

branch : FP16BatchedNMS

created branch time in a month

pull request commenttensorflow/tensorflow

Fix Validation error on GenerateBoxProposals op

@sanjoy FYI

samikama

comment created time in 2 months

PR opened tensorflow/tensorflow

Fix Validation error on GenerateBoxProposals op

This PR fixes a argument validation error introduced during the review process due to inversion of logic between CHECK* and OP_REQUIRES macros. I suggest cherry-picking this to other branches whenever possible.

+1 -1

0 comment

1 changed file

pr created time in 2 months

create barnchsamikama/tensorflow

branch : GBBPFix

created branch time in 2 months

push eventsamikama/tf_tensor_dumper

Sami Kama

commit sha c6b91612e995c7f2de2b65aca09dd5c577f42013

Add forgotten items()

view details

push time in 2 months

push eventsamikama/tensorflow

Sami

commit sha 25467bd50124529beea35a12d084b0f1a8ef0307

Debugging

view details

push time in 2 months

create barnchsamikama/tensorflow

branch : MultiStream

created branch time in 2 months

push eventsamikama/tensorflow

Alexandre Passos

commit sha 3aa42f1cbb0e5c6f555d7ed59a278fe391e2ed38

Avoid passing a number as a dtype PiperOrigin-RevId: 285859917 Change-Id: I0176671ce2833cc3d589cbff1b84c9ee89a99361

view details

Rohan Jain

commit sha 42afc3e5acae4c76afef5338e090a4809275000b

Adding unicode / PY3 support for feature column vocab files. PiperOrigin-RevId: 285862836 Change-Id: I2eec29c2300dfbc99f29b30b56e3e7dfea6d047e

view details

Smit Hinsu

commit sha f36cd515e9bfc029d8c10aec915b9f9b1008c022

Add type constraints in patterns using BinBroadcastDimensions and ConstantSplat helpers ConstantSplat requires static shaped types BinBroadcastDimensions requires ranked types PiperOrigin-RevId: 285863722 Change-Id: Ia2e1220568ab4eae8683b4c7b74ab1c4a38a1240

view details

Dero Gharibian

commit sha d3f32b0c698c30f03dbde033c81a0cc4f6566789

Update snapshot_dataset_op.cc to use tensorflow::tstring. PiperOrigin-RevId: 285863781 Change-Id: Id8c15e610cd15b4ae78d89efe1a5aed9956d6e11

view details

Mihai Maruseac

commit sha ebd8e27075f46e0f2400c5bc095cde6a0169248a

Add h5py dependency to setup.py. It used to be transitively brought in by keras_applications but we no longer depend on that. PiperOrigin-RevId: 285864272 Change-Id: Iae2de4f9b8686602d661cff716ff592788b43b09

view details

TensorFlower Gardener

commit sha a6b7f76f4c7973c2efc1be8b3150b730e32796d5

Merge pull request #35166 from tensorflow:jvishnuvardhan-patch-2 PiperOrigin-RevId: 285864348 Change-Id: I63ba578cbb1a23f776570cc8a4cdb7fe3312a7d4

view details

Yunxing Dai

commit sha f0fe35b3a23325328860744c5c821b0e67dc9adb

Set schedule when parsing a single instruction as a module. PiperOrigin-RevId: 285867393 Change-Id: Ia9bbb5933a157c053cd26357160bdb729b7b5f90

view details

Andy Ly

commit sha d7fac809606f526f573ead77fb73a69591f2743d

Update StridedSlice TF -> HLO legalization to support non zero begin_mask and end_mask. This reuses helper functions for calculating begin indices, end indices, and strides from StridedSliceGrad. The unit test for StridedSlice with shrink_axis_mask is removed until helper functions are updated to support shrink_axis_mask. PiperOrigin-RevId: 285870165 Change-Id: I278c023770b767d287bc0de994e7301800638821

view details

Yifei Feng

commit sha b125fe25420b9cf589a817174840f728a040b76b

Add docker file for custom op for cuda10.1 (release 2.1+). PiperOrigin-RevId: 285871263 Change-Id: I10b5eacb8861ffc07384ea76e4e4597283f69fcb

view details

Derek Murray

commit sha d124074145a11645632f6c7f866f158c1adaba89

Split sparse_tensor.h into separate .h and .cc files. PiperOrigin-RevId: 285880109 Change-Id: Ifa5b65fe41bea505d1454a8c43b56959eaa04431

view details

A. Unique TensorFlower

commit sha 8f3c029a38b46ffbe292cdb299c3d4a7f82401a5

move third_party/tensorflow/core/tfrt_delegate to third_party/tensorflow/compiler/mlir/tfrt/tfrt_delegate. PiperOrigin-RevId: 285880398 Change-Id: I716a73a66d418a58a6a930175e7673b9b59ae671

view details

Zhenyu Tan

commit sha cdb4ad114704ad08b6a4fe70db0ccaa745e61fcb

Update Xception URL. PiperOrigin-RevId: 285882699 Change-Id: If4d365661a0376ee815fd02fe2a16dd823fca13d

view details

Mihai Maruseac

commit sha f7f7b48e4230885b1000e05d56252fa78e8bac95

Fix typo in setup.py PiperOrigin-RevId: 285882855 Change-Id: I252217310d589d4c331b0287f5fc328d7bb05b75

view details

Karim Nosir

commit sha 3d42bdcff7dbf7f216786fcb70c56017d89bfcb7

- Update build rules for tflite diff to compile for iOS, and avoid compiling flex delegate for iOS. - Add new method for GenerateTestSpec to generate testspec from TFLite. PiperOrigin-RevId: 285884997 Change-Id: I803bd619013f7410bd56283a715e46c8719d4810

view details

A. Unique TensorFlower

commit sha 10c882dfbd41863b20eb711a76f7b4a2aa1076b8

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 285891169 Change-Id: I20b55062c56c5b04c8dc0fc03ad6df6f5f02c8f5

view details

Karim Nosir

commit sha 8abcbc77a0bc2af38b1f2b0b270504cf1bc9a225

Add new methods for diff tests that compares TFLite vs any custom TestRunner PiperOrigin-RevId: 285891574 Change-Id: I67da00e81f6ca60e939b5f77502be1249d59a930

view details

Brian Atkinson

commit sha 2a7c103ffea38b7ef074ff1468676f11a191a554

proto2::TestFormat should be google::protobuf::TextFormat. PiperOrigin-RevId: 285896522 Change-Id: I2116497da6a683c5fb62e574794e4b1e04dee571

view details

Brian Atkinson

commit sha 3dc81a18dae23148db8178afa75ad9c239c43e69

Fix link to point to external documentation. PiperOrigin-RevId: 285898531 Change-Id: I0404ae4a9759ca01042db4684bc2282062137418

view details

A. Unique TensorFlower

commit sha b7bd9b0950b65618516f173e52285ee41821607e

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 285899949 Change-Id: Idfe131765d1b691937b26692623dbfb9c125f066

view details

A. Unique TensorFlower

commit sha 8e42b57fc4497948d4dccb6525216c139e89c63f

Add MetadataMatcher to help processing xplane. PiperOrigin-RevId: 285905624 Change-Id: I68e174c654ee975067065dbfcb8b59d08d1fdb00

view details

push time in 2 months

issue commenttensorflow/tensorflow

non_max_suppression is full of bugs!

@MoussaMM NMS is defined on boxes and not lines or points. I don't think there is any issue with the code there. I believe your inputs are not valid inputs. Please point me to the reference where NMS is defined on point or lines. Otherwise this can be closed.

MoussaMM

comment created time in 2 months

fork samikama/Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time

fork in 2 months

pull request commenttensorflow/tensorflow

[WIP] Adding generic TRT plugin support

As per private discussions I am closing this. If anybody would require this functionality, you can contact me.

samikama

comment created time in 3 months

PR closed tensorflow/tensorflow

Reviewers
[WIP] Adding generic TRT plugin support cla: yes comp:gpu size:L

This PR enables generic plugin support for TFTRT, allowing custom TRT plugins to be included TRT conversion process, increasing TFTRT coverage.

+886 -34

9 comments

9 changed files

samikama

pr closed time in 3 months

delete branch samikama/tensorflow

delete branch : NCCL_cuda10.2

delete time in 3 months

pull request commenttensorflow/tensorflow

[WIP] Adding generic TRT plugin support

@aaroey I was working on sorting out merge conflict and I noticed there were some changes made on of TRT plugins are initialized. Current changes, especially passing logger to plugin initialization function causing ownership or duplication problems. In current schema, caller of Converter::Create() owns the logger and it is passed down to plugins. However plugins library will require initialization without presence of converter which in turn require another copy of logger. I believe passing the logger to converter then to the plugins is not a good choice. How should we solve this?

Thanks, Sami

samikama

comment created time in 3 months

pull request commenttensorflow/tensorflow

Make nccl bindings compilable with cuda 10.2

@gunan please assign appropriately if you are not the right person. Thanks.

samikama

comment created time in 3 months

PR opened tensorflow/tensorflow

Make nccl bindings compilable with cuda 10.2

This PR fixes the bazel NCCL bindings and enables TF compilation with cuda 10.2 or later.

+25 -9

0 comment

2 changed files

pr created time in 3 months

create barnchsamikama/tensorflow

branch : NCCL_cuda10.2

created branch time in 3 months

push eventsamikama/tensorflow

Frank Chen

commit sha ff8d2709e24077efc5a87e5078e0d0e022a90f34

Add support for recording TPU driver in the Python TPU driver. PiperOrigin-RevId: 281351702 Change-Id: Iddfd6312a26ac4a0856fab7e0a5b0796ffbca0a5

view details

Nicolas Vasilache

commit sha f949bf8e29073395ad5a9f70ebb9b30ada9d2a6f

Add VectorOps.StridedSliceOp The `vector.strided_slice` takes an n-D vector, k-D `offsets` integer array attribute, a k-D `sizes` integer array attribute, a k-D `strides` integer array attribute and extracts the n-D subvector at the proper offset. Returns an n-D vector where the first k-D dimensions match the `sizes` attribute. The returned subvector contains the elements starting at offset `offsets` and ending at `offsets + sizes`. Example: ``` %1 = vector.strided_slice %0 {offsets : [0, 2], sizes : [2, 4], strides : [1, 1]}: vector<4x8x16xf32> // returns a vector<2x4x16xf32> ``` This op will be useful for progressive lowering within the VectorOp dialect. PiperOrigin-RevId: 281352749 Change-Id: I8987b3ac76058ad5e4336a54015a20822f98d96b

view details

Andrew Selle

commit sha 5eea888e6f42784d629a0a3bff54b3ae0906084d

Give more helpful error when toco_from_protos binary is not found. PiperOrigin-RevId: 281362420 Change-Id: I46d58f8859909da5121175e5595c4b61b79c9bd3

view details

Russell Power

commit sha 14b36622431f695c17cb7576bff64b38de33a3a2

TpuDriver: Add client_version field to provide better feedback in the event of non-backwards-compatible protocol changes. PiperOrigin-RevId: 281362786 Change-Id: If7de70f60c5bfacd9971003ceb9ab0e6db973066

view details

Nat Jeffries

commit sha 52be12a34111757ee7de8176b761e7e5bd6c58af

As a first step to synchronizing our op versions with TFLite, update our currently supported ops based on lite/tools/versioning/op_version.cc. This is imperfect, since the TFLite op versioning scheme does not perfectly fit the types of ops we have on Micro. For edge cases that we do not handle, expect to see a runtime error. PiperOrigin-RevId: 281362815 Change-Id: I2db5671cd9cc7e6a77601a5e1e0d3d0258de8b3c

view details

Christian Sigg

commit sha 15bb9848c68d710e8d1331354fb28c3305426ccc

Make type and rank explicit in mcuMemHostRegister function. Fix registered size of indirect MemRefType kernel arguments. PiperOrigin-RevId: 281362940 Change-Id: I99c3fbbc4cfc22129be7e24b8dcdef458c1ad996

view details

Feng Liu

commit sha 0a793b9dfd60ebfacda33cb52f0a12f089d41428

Create a flag to disable per-channel quantized weight even the ops have the support If the user prefers to performance, he might want to disable per-channel quantization. When the per-channel quantization is disabled, the weight is using asymmetric quantization with narrow range. PiperOrigin-RevId: 281365699 Change-Id: Ie5154e7bc3d3c1d7893531c308e0e0989b5ff117

view details

Abdurrahman Akkas

commit sha be937a3290223f926fe50684f1344569a573ed4b

Fixing per axis quantization bug in flatbuffer importer. PiperOrigin-RevId: 281367969 Change-Id: Iade9732ca349b81ddd7601717e6c558dfd32c723

view details

Wenhao Jia

commit sha c90a24463ded1d22fc7b029a37d481dc3a626da6

Make GRPC client events wait for requirement events before waiting for itself. PiperOrigin-RevId: 281374353 Change-Id: I84483a658683b52091f488b7ed47a5a4d7833f3b

view details

Brian Atkinson

commit sha 98c671e8649ed815bbe932376c10483a525d4661

Internal change PiperOrigin-RevId: 281375332 Change-Id: I3c7efce7889b6b3129ad37a531e435ba46ad61b5

view details

George Karpenkov

commit sha adefdd3e986f88130581ac272aac363d6971010c

[XLA] Fix the use-afrer-free of device assignment on XLA client codepath PiperOrigin-RevId: 281375460 Change-Id: I819517a103732d2c0567bee7b0f3ce093bac6e7e

view details

A. Unique TensorFlower

commit sha e355d46cee1dd793bc7d4a435ab0b3d88d980366

Restore op compatibility for older models. PiperOrigin-RevId: 281375549 Change-Id: I5202c47bd41c9cd34626279f17129df98e499dfd

view details

Frank Chen

commit sha b008bafe594386e7f8ecb6a923b786aa3c7ec59a

Add a force method to skip the caching of existing clients. PiperOrigin-RevId: 281376671 Change-Id: I33dc80b9c1f0ee57d7e9937ffa3acb80f42a2844

view details

TensorFlower Gardener

commit sha 26acde8b73122b9cd0b190e3cf1275078c928847

Merge pull request #34400 from frreiss:issue-data-unique-test PiperOrigin-RevId: 281389907 Change-Id: I944acbf73f3cc9685fc428b8b45a1acf1d282843

view details

Shanqing Cai

commit sha fc0e31111abf27bf1e1facda5846a7cfc05ed7f4

[tfdbg] tf.debugging.enable_check_numerics() uses CheckNumericsV2 op - Add the CPU and GPU implementationsof CheckNumericsV2Op - CheckNumericsV2Op inherits from CheckNumercsOp, but has the new feature of distinguishes +/- infinities. PiperOrigin-RevId: 281390258 Change-Id: Id45ef975da95104ef9b482bbb7034790b314262e

view details

Yifei Feng

commit sha bc45d196c39ffb29b59951237d88e09b1b158c2b

Implement CollectData(XSpace*) in HostTracer PiperOrigin-RevId: 281393075 Change-Id: I9ec369157e17cb1bbb5fb01bbc3b3914b19e464f

view details

A. Unique TensorFlower

commit sha 0231137ed043ebc01517856f5472d26bc71a6fa8

Allow parsing HLO constant without literals. This feature is used when parsing HLO for evaluating cost model instead of running the HLO graph. PiperOrigin-RevId: 281393128 Change-Id: I6e986f9b34304844942731cb364c9479a8edaf00

view details

Prakalp Srivastava

commit sha 15d103c40933ad8c89a5ae4882fdc05c19b0f78d

Add support for PadOp in HLO exporter. Generic support cannot be used for PadOp because hlo dialect representation breaks the `PaddingConfig` Pad HLO instruction into three separate attributes of int64 vectors in hlo dialect. PiperOrigin-RevId: 281399246 Change-Id: Iab6f9b5fff925c9a3936ad8714d68c49b8893c00

view details

TensorFlower Gardener

commit sha ba63d9082a2265da91ec4daefecfa4cd47fcf07f

Merge pull request #24796 from mrTsjolder:unify_doccitations PiperOrigin-RevId: 281399355 Change-Id: Icec56ac2937bf6abba6149243c7759c7c6e244ec

view details

Frank Chen

commit sha 1c53cc6943caa5fc22b63dcc0d778a8396645ac8

Use absl::optional instead of std::optional to ensure C++14 compatibility PiperOrigin-RevId: 281399770 Change-Id: If502ba2e74f6b83de34cf41164cfbd94e1509ded

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Adding NMSv4 GPU implementation

@qinyao-he check #34852

samikama

comment created time in 3 months

pull request commenttensorflow/tensorflow

Introducing BatchedNMS and CombinedNMS GPU ops

@pkanwar23 and @aaroey please ask your internal teams to give these ops a try.

samikama

comment created time in 3 months

PR opened tensorflow/tensorflow

Reviewers
Introducing BatchedNMS and CombinedNMS GPU ops

This PR adds BatchedNMS op, which uses significantly less memory than GPUNMSV[234] ops at the expense of some performance for very small batch sizes. A GPU implementation of CombinedNMS op using BatchedNMS kernels is also included. GPU CombinedNMS improve SSD images/s about 50% with respect to multi-class CPU NMS ops and about 4x vs CPU CombinedNMS implementation

+2358 -44

0 comment

7 changed files

pr created time in 3 months

push eventsamikama/tensorflow

A. Unique TensorFlower

commit sha f5866078ee935cad8e0be0879316c22fdd379777

Documentation update to reference learning rate schedules in the optimizer documentation. PiperOrigin-RevId: 281104367 Change-Id: Id814018dfb8f21b4d1b46b7d675838c56765b975

view details

A. Unique TensorFlower

commit sha 837b26bd578d33cc44ca449f5d3b33357fddac36

Update Eigen to: https://bitbucket.org/eigen/eigen/commits/66be6c76fc01ad8efbb6aadacf21b32f3246e7c5 PiperOrigin-RevId: 281106977 Change-Id: I05c7b39677ab04620d0a5e8830621346e2b2a9dc

view details

Mihai Maruseac

commit sha c455ab45559806831d74333a61e8c873de9cfc9d

Implement modular POSIX filesystem support for deleting files and directories. We also provide tests to make sure all API requirements are satisfied. Just a small sized part of work for modular filesystem plugins. For more details, consult the RFC at https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md PiperOrigin-RevId: 281107956 Change-Id: I0be89ff20d2c865fe552a9eeee12573eff63eb01

view details

Russell Power

commit sha c85d9b13696e0a3a462a93ce0a6c027a57c6a5d9

TpuDriver: Add more detail to connection errors and make timeout configurable. PiperOrigin-RevId: 281108883 Change-Id: I6bbeb951815816b00d9591ba2fed27879edcd51c

view details

Mihai Maruseac

commit sha de2127833462b2d2b2915c72565601d1ceb798ff

Implement modular POSIX filesystem support for testing if paths exist. We also provide tests to make sure all API requirements are satisfied. Just a small sized part of work for modular filesystem plugins. For more details, consult the RFC at https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md PiperOrigin-RevId: 281109338 Change-Id: I060ff60ce6502770d43798910a10b2d0d0a2a601

view details

Mihai Maruseac

commit sha 87451d7147d69cc571256e9795ce4c68c514e2c1

Allow modular POSIX filesystem to get file/dir statistics. We also provide tests to make sure all API requirements are satisfied. Just a small sized part of work for modular filesystem plugins. For more details, consult the RFC at https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md PiperOrigin-RevId: 281110604 Change-Id: I644b8d6da400e1639f7e8ec91ceaa440d4cb193d

view details

Gaurav Jain

commit sha 309a3c7964d14306da73196f536f33eebd333d0f

Avoid allocating ScopedStepContainer for each Run We avoid recreating a ScopedStepContainer by storing one for reuse in the KernelAndDeviceOp & KernelAndDeviceFunc classes. Further, we can avoid doing a resource manager lookup to perform a clean by adding a dirty flag to indicate the ScopedStepContainer was accessed. In addition, we simplify the signature of MakeResourceHandle by avoiding the need to pass in the entire OpKernelContext object. PiperOrigin-RevId: 281110991 Change-Id: I0a186583a1ff50b08bf68c18cfb99c912e05386d

view details

Derek Murray

commit sha ad703bf932e454fb9f2eea7228b7affea7058a9d

Use std::atomic<CancellationToken> to store the next token in CancellationManager. This avoids a lock acquisition for each token. In addition, the method is moved to the header to enable inlining, and a CHECK is replaced with a DCHECK. PiperOrigin-RevId: 281111766 Change-Id: I2f39446ec38bcebf264cba2381e8ff98ee38f172

view details

A. Unique TensorFlower

commit sha 7f2d946f9b5841e32862ecc9bdaec7befbd6b8ca

Fix Affine Loop Fusion test case reported on github. This CL utilizies the more robust fusion feasibility analysis being built out in LoopFusionUtils, which will eventually be used to replace the current affine loop fusion pass. PiperOrigin-RevId: 281112340 Change-Id: I6ec7d189f1444d1a3ce1b8f8b01ce54d98338e47

view details

Russell Power

commit sha ede4ce2ac82a6c2d51733fa8434b5362ff0966d7

TpuDriver: Adjust gRPC keepalive settings to be more aggressive. PiperOrigin-RevId: 281113288 Change-Id: I04437740488880b99a5707a7ea7ffc46e792857f

view details

A. Unique TensorFlower

commit sha 9403dfbb776109a74178764e5a3107b9623ea0e0

ConvertStandardToLLVM: replace assertion with graceful failure The assertion was introduced in the early days of dialect conversion infrastructure when we had the matching function separate from the rewriting function. The infrastructure evolved to have a common matchAndRewrite function and the separate matching function was dropped without chaning the rewriting that became matchAndRewrite. This has led to assertion being triggered. Return a matchFailure instead of failing an assertion on unsupported types. Closes #230 PiperOrigin-RevId: 281113741 Change-Id: Ifccd6230c88f961f3d746dc6ab2870c64f78cd24

view details

A. Unique TensorFlower

commit sha 335dec7961a0f0f3323265a7d91c4c5df2b3870b

Replace dependency on //third_party/tensorflow/core/platform:env with //third_party/tensorflow/core:lib. This was introduced in https://github.com/tensorflow/tensorflow/commit/fbfc8d55c72ead2c39a3374a0ae74d268fb31324 and currently works fine because when --incompatible_legacy_whole_archive flag is set to true, Bazel will allow the linker to drop symbols that it considers unused. However we'll be reverting the value of this flag back to false given that it has caused multiple issues. Once the flag is set to false, having //third_party/tensorflow/core/platform:env as a dep causes issues because the linker ends up linking static symbols multiple times. PiperOrigin-RevId: 281113891 Change-Id: Ic137d03ab8544ba11d2b6b3611fe56f76a55908e

view details

Srihari Humbarwadi

commit sha 23928f94276ad9bb4cf3fe277d95fb55ef759704

Update builder_impl.py

view details

frreiss

commit sha 0684124a58f66e92d79ad607e5c21711dd36c597

Merge branch 'master' of https://github.com/tensorflow/tensorflow into issue-data-unique-test

view details

Jing Pu

commit sha 1cfe93ac9be931af595ffb7bb932130a46c67e76

Also elide large array attribute in OpGraph Dump PiperOrigin-RevId: 281114034 Change-Id: I6f3c8bc9c8f5950a712aead8ade9a2321b77c536

view details

Dan Moldovan

commit sha 0af3daec153d100acd5a60d53956df00c23a8cb3

Fix bug causing the converted function context manager to reset the conversion status to ENABLED in local functions. This bug can lead to incorrect behavior when an explicit toggle to DISABLED (e.g. as done with @do_not_convert) is being inserted in the call stack. PiperOrigin-RevId: 281120324 Change-Id: I7f7fe7fa5bf909e5d252d5bec81399e3ebc3588e

view details

A. Unique TensorFlower

commit sha dc31ec1dadfc0963dd7d5ded39fb5653fce852ca

When executing a TRTEngineOp fails, release the output tensors. Otherwise ExecuteTrtEngine() may check-fail if any outputs are already allocated. PiperOrigin-RevId: 281121980 Change-Id: I3ff788a7dc2db58a5adda3c0c2466685bcf7df1a

view details

Yifei Feng

commit sha 7a98805cb0d8c5398990456e815ff9b9d51e42e6

Export the GetPythonWrappers functions from C++ to Python with pybind11 instead of swig. This is part of a larger effort to deprecate swig and eventually with modularization break pywrap_tensorflow into smaller components. It will also make exporting C++ ops to Python significantly easier. XLA is using the pybind11 macros already. Please refer to https://github.com/tensorflow/community/blob/master/rfcs/20190208-pybind11.md for more information. PiperOrigin-RevId: 281122988 Change-Id: I47eb899954f8e4728fb9f69e8f0b3eb95fc33257

view details

Bixia Zheng

commit sha 612a496602a0c4d166152cc1ba5a2b4cdd9414c2

[TF:MLIR] Add an option for requesting full conversion from TF dialect to XLA_HLO dialect. Add a boolean argument to legalizeTF pass for requesting full conversion. The default behavior of the pass is to perform a full conversion. That is, the pass will report an error if there is any operation that can't be legalized. Add test cases. PiperOrigin-RevId: 281123858 Change-Id: I6f3b3e66c14696fae3517ce1f4532747d0da27ff

view details

Blake Hechtman

commit sha ce452bd9966da35ad27de1d3a670eb7fa3b46253

[XLA] Assume that CustomCall instructions created with dynamic dimensions from the client are valid with any padding value. PiperOrigin-RevId: 281124054 Change-Id: Ic9402eabf1b44bf00702bbe7c324d721e34bc7e6

view details

push time in 3 months

create barnchsamikama/tensorflow

branch : BatchedAndCombinedNMSOps

created branch time in 3 months

push eventsamikama/tf_tensor_dumper

Sami Kama

commit sha 320491772334a74b287a04c96ed22f6abf621311

py3 compat

view details

push time in 3 months

push eventsamikama/tf_tensor_dumper

Sami Kama

commit sha 66275824d8147b0a83618fbfea720e2502977977

change default name

view details

push time in 3 months

create barnchsamikama/tensorflow

branch : BatchedNMS

created branch time in 3 months

pull request commenttensorflow/tensorflow

Re-enable GPUNMSv3 and v4 for small box count users

@mihaimaruseac There is no reason to leave this op out of 2.1 even though google internal models which seem to be using >=256k boxes, most object detection/segmentation networks use O(10k) boxes where memory implications are negligible. Google internal models keep using CPU implementation with a device scope trivially if they don't want to use this op. I don't believe punishing other users because of a trivially solvable issue in Google internal models is right approach.

samikama

comment created time in 3 months

PR opened tensorflow/tensorflow

Reviewers
Re-enable GPUNMSv3 and v4 for small box count users

Reopening #34331 for master, to be CP'ed to 2.1. @mihaimaruseac @aaroey

This PR enables GPU implementation for NonMaxSuppresion ops v3 and v4 which was disabled due to high memory utilization at extremely large input counts (~8GB for 256000 boxes). Op has modest memory requirements for most models having about O(10k) inputs.

+14 -14

0 comment

1 changed file

pr created time in 3 months

create barnchsamikama/tensorflow

branch : EnableGPUNMSv3Master

created branch time in 3 months

pull request commenttensorflow/tensorflow

Enable GPUNMS v3 and v4

@mihaimaruseac I thought you didn't want this in master because of your internal models.

samikama

comment created time in 3 months

PR opened tensorflow/tensorflow

Enable GPUNMS v3 and V3

This PR enables GPU implementation for NonMaxSuppresion ops v3 and v4 which was disabled due to high memory utilization at extremely large input counts (~8GB for 256000 boxes). Op has modest memory requirements for most models having about O(10k) inputs.

+14 -14

0 comment

1 changed file

pr created time in 3 months

create barnchsamikama/tensorflow

branch : EnableGPUNMSv3

created branch time in 3 months

push eventsamikama/tensorflow

Davide Libenzi

commit sha 0f059539a09672fb7ca85cec9bdd46b76827c12b

Fix linker error for missing constexpr initialization. PiperOrigin-RevId: 277589327 Change-Id: Ib9d493679ed3b77950360879b809f9f1fda9635d

view details

Smit Hinsu

commit sha fc42bd38ace7a77a8978db7e9aba1d28e6649d56

Drop AnyTensor operand constraint for TensorFlow patterns TensorFlow dialect ops verifies that the operands and results are of Tensor types so this constraint is unnecessary. PiperOrigin-RevId: 277594471 Change-Id: I227710ad587a4f4a8f0393424d6431e7f557ab27

view details

Berkin Ilbeyi

commit sha d962c80c855a691328cb5e9337fbe3000c9c6e28

[XLA] Use cost analysis to order buffers to be assigned in mem space assmt Using cost analysis, we can estimate the "memory boundedness" of an HLO instruction and prioritize those that will benefit the most of being placed in alternate (fast) memory space. PiperOrigin-RevId: 277595877 Change-Id: I78d186e518195c73edfb7e56fad805ac5bbdbb56

view details

Robert David

commit sha 09667f1c8c1f8b5b9a3a6d6c795eb470bf14814d

Chage EvalUsingLookupTable to always run using unsigned chars. PiperOrigin-RevId: 277596039 Change-Id: I575c583e03f4844fc919351d18d893f7f9640c02

view details

River Riddle

commit sha cee141cd719683314709357e1bb8bed78485c2f7

NFC: Simplify UseRange::iterator to just be a std::vector::const_iterator. At some point the implementation of UseRange was more complex, but now it is just a simple wrapper around a std::vector<SymbolUse>. PiperOrigin-RevId: 277597294 Change-Id: I20f665bfa4a39fdd623280853f3e39f30a674b3a

view details

Akshay Modi

commit sha 6d93363b7aa17df9ec435d4adbbbad8989eb028e

Disable linear_operator_kronecker_test in XLA PiperOrigin-RevId: 277598054 Change-Id: I45c697a1d1e40b283c769e184dcb10dd17f2d3e6

view details

A. Unique TensorFlower

commit sha 068fc51ed65a9df8a8c80ac8a04eff179219c5e4

TF SquareOp and SquaredDifferenceOp lowering to xla_hlo PiperOrigin-RevId: 277598948 Change-Id: Ib80bbc4bc27795bc74dcd840b96c0fca09567cd5

view details

Robert David

commit sha a525e50ae6c0db74f6eee1b51cd4da4dff05f4a0

Change reinterpret_cast to static_cast when converting between void* and T* pointers. PiperOrigin-RevId: 277601066 Change-Id: Ia5c0e410fa7785246ded10529246058409a8a742

view details

Jian Li

commit sha f949b5a3e64cdc9cc451a7d405911f9a94346b7f

Use reference AveragePool kernel when the filter size is bigger than 16 * 16. PiperOrigin-RevId: 277602834 Change-Id: Iad00006ea31de143c9337918de2413b50249b3fe

view details

Jacques Pienaar

commit sha 778862fdebb99fc0e52e560a209ed702ea4a37a6

Exclude g3docs exported to website already PiperOrigin-RevId: 277603984 Change-Id: Ife0d742a8df16c1761fed944944fce8665499d3f

view details

A. Unique TensorFlower

commit sha 13760fd760993777321f1191175a83e3aa1fa66d

TF Pow lowering to HLO Pow with HloFunctionImporter work PiperOrigin-RevId: 277604178 Change-Id: I2c37ffe6239cf0f0ef206ec4a670f361d9d73087

view details

A. Unique TensorFlower

commit sha e505d6ab4c160b7a7f3248190a727d6559ea7784

updating performance benchmarks for tflite PiperOrigin-RevId: 277604213 Change-Id: I16079519fedbebaaf230608e1fabe6176ebd01af

view details

Jian Li

commit sha daf7bef491323006b7274bdb54587c6cf752811b

Move the calculation of scale outside the utility functions. This refactor is to support LSTM where the scales can be from non-tensor values. PiperOrigin-RevId: 277606316 Change-Id: I444f82a56229b70d5826ae81684b947f208d5cc2

view details

Andy Ly

commit sha 692ac0a77f65be48e4a6ea295342562493b031c9

Add pass that lowers single `tf_device.replicate` `tf_executor.island` ops. This pass creates individual islands for each replica of a `tf_device.replicate`. Devices on the replicate op are assigned to each respective replica inner op. PiperOrigin-RevId: 277608979 Change-Id: I5e4e679dda18cd66faf54529f8d49f84c527dae8

view details

A. Unique TensorFlower

commit sha 7b18cacc33087621c363e533b4a9daa4e6680e30

TF Sin lowering to xla_hlo Sin, with HloFunctionImporter PiperOrigin-RevId: 277612864 Change-Id: Ie3949af4e6c15c0863acceb40e7dfba6f7089150

view details

Andrew Audibert

commit sha dcee68688c6fcca71144a258ec80ad304b2d7785

Fix issue with statically cached window dataset types and shapes. Previously Dataset.window() would store its type/shape to static variables the first time they were accessed. If future window datasets have different types or shapes, we would hit an error along the lines of InvalidArgumentError: Number of components does not match: expected 1 types but got 4. PiperOrigin-RevId: 277615274 Change-Id: I407a5d9156b7367c880bcac2ed2e0f80b7740fce

view details

Parker Schuh

commit sha bbea50494ef629d50cfbc53687399a5279381f76

Add missing builder for PowOp. PiperOrigin-RevId: 277625541 Change-Id: Ib663fb84954854084d6b75d3edaf7c1ec6220238

view details

Lu Wang

commit sha 8514696cf70f5dddae405989b0e9e86cdaa6d197

Migrate the image classification reference app with the support library. PiperOrigin-RevId: 277627575 Change-Id: Id2edc6be1290c13105b91ab08a932657179c99ba

view details

Skye Wanderman-Milne

commit sha 7325f617ed06a483037c3c8b8712c6480444dffe

[XLA:Python] Allow D2D transfers to be enqueued on the src or dst stream. PiperOrigin-RevId: 277627820 Change-Id: Ie9bc02182ec686aee28501184450bee9de11da88

view details

Eugene Zhulenev

commit sha e0b101b88eae011d66d3b2784d5313523f3cd05c

[Grappler] Small optimization for HasRegularOutputs PiperOrigin-RevId: 277628086 Change-Id: Iee238a8de7749983c0ac44e604ed82730bb65997

view details

push time in 3 months

push eventsamikama/tensorflow

Sami

commit sha 05aa39e2fa9d5ff2bdcce61b90d70b653ae763f8

max->std::max

view details

push time in 4 months

push eventsamikama/tensorflow

Sami

commit sha 920b543dd0de83823efc5ad9dae5d5d4733ec79e

Mark GenerateBoxProposalsOp non-differentiable

view details

push time in 4 months

push eventsamikama/tensorflow

Sami

commit sha d62161283e443ed9f3e7cd73bd27f4cd3a24e2e3

Add RoIAlign Op

view details

push time in 4 months

push eventsamikama/tensorflow

Sami

commit sha 899b5e0770c8e5096df8db5a947423f2e38c7466

Mark attributes converted to inputs as host memory

view details

push time in 4 months

push eventsamikama/tensorflow

Ruoxin Sang

commit sha 5f47ceeaa050a4cd74c259783d4f9300037a7307

Avoid re-evaluating batch_size in BatchDataset. This will help the startup time performance in remote eager, as each evaluation is underlying calling EagerTensor.numpy(), which is a blocking point. PiperOrigin-RevId: 275493018 Change-Id: I08db359dbaf56fe361c3c13d7edbbb15f58d05be

view details

Nick Kreeger

commit sha df95d758d836fc6c259b15be72b0d43d7b4dc856

Fix micro SVDF unit test for hybrid-quantized scratch tensor dims. This patch fixes incorrect logic setup for hybrid-quant tests of SVDF for TF Micro. The previous logic incorrectly set the hybrid-quant scratch input tensor to have the wrong dimensions. This patch fixes that problem and cleans up variable names in the test.

view details

A. Unique TensorFlower

commit sha 07b5ca443887fd6d739ef6055e6cd3f2c02ba226

Moved Workgroup Selection generation to common, for use with other engines. PiperOrigin-RevId: 275498053 Change-Id: I6c446a39db909a51908f2edad93e1fc73c3dbbe5

view details

Frank Chen

commit sha 92ece5c37e22d47fbb5c7da4c8d0723999e7e288

Adds auto_shard_policy option to control how auto-sharding should work (AUTO, FILE, DATA). Auto-sharding can now be controlled via specifying whether it should shard by files, shard by data, or attempt to first shard by file and fallback on sharding by data (auto). PiperOrigin-RevId: 275502876 Change-Id: I68f3ebc74692baa78a109a4e807075f844403216

view details

Robert David

commit sha 4fb493d4c1bcb6783ef1fd0168a97087f4b556d0

Remove usings to avoid a symbol clash with TFLite types. PiperOrigin-RevId: 275504394 Change-Id: I387463293fe50cfc1538032a1c5359c7d409a80b

view details

Blake Hechtman

commit sha 02d2a2e0720bd326d9a920f9b89fdb2c983ae53a

[XLA] Speedup some special cases of handle convolution. PiperOrigin-RevId: 275506040 Change-Id: I27d8ccf78c003b902c46bead9998e3cc5a978984

view details

Robert David

commit sha 455998d5179a0df690f852c02da144d630c60a8f

Guard populating FP16 tensors with TFLITE_ENABLE_FP16_CPU_BENCHMARKS macro in benchmarks, to avoid missing __gnu_f2h_ieee symbol error in default builds. PiperOrigin-RevId: 275507507 Change-Id: I450e36eec749bb58b6a8843fff039c06e5813c01

view details

A. Unique TensorFlower

commit sha ba96c40cb452c68ba08eb30f4e34795f4774cb92

Log full status information when failing to load a saved model. PiperOrigin-RevId: 275507819 Change-Id: I232f32998d179c0a66963a5039ad1db7d21862f0

view details

A. Unique TensorFlower

commit sha ca3b71e5f529d6e6b093988036db72e80cbbc838

Update ops-related pbtxt files. PiperOrigin-RevId: 275510866 Change-Id: I867ef329664f463822581bce462e40e7b32c60e3

view details

A. Unique TensorFlower

commit sha d6e481667d1dae4c7ef01f2f5c880fa338294f1e

optimization on zeros_like_impl Instead of calling convert to tensor all the time, we first check it is a tensor, if those exist, no need to call. Otherwise, call. This can help a lot on variable input. In benchmark the performance is improved by ~30% PiperOrigin-RevId: 275515727 Change-Id: I677a98406d1a4af9af4a5f80db0fcfd59afb79cd

view details

Haoyu Zhang

commit sha bcb615c42a6215037ed7f1c81316bc0960e76269

Reduce verbosity of rendezvous initialize logging. PiperOrigin-RevId: 275516202 Change-Id: I0ea0620332e6b1e059abd65bc7fbdb978e0a928f

view details

Brian Zhao

commit sha 7e688a7e77518e5d7950c10de8f1f143623bcb57

Wiring tensorflow/core/lib/hash into tensorflow/core/BUILD. PiperOrigin-RevId: 275517262 Change-Id: I4405cc4eb93f9899a96559aa9fb9c103de5a7fb0

view details

Pete Warden

commit sha 8b7ae8a92cced7e80912e2252719a565216c530a

Fix for Mbed CI Build PiperOrigin-RevId: 275518680 Change-Id: Ic7e709f7903cdf1afca9b0a8d0bea881ed61d91b

view details

Brian Zhao

commit sha 6f77ca3849a833e1dcf764619369dc7c3e165d5c

Adding remaining non-test targets to tensorflow/core/gtl/BUILD, and wiring them into tensorflow/core/BUILD. PiperOrigin-RevId: 275523153 Change-Id: Ib0dd74c3d216a58577ab6c661dcb3b17304b9648

view details

A. Unique TensorFlower

commit sha ba71f0ab613e8f9a7e2912b331626d4bf77ef198

Make TFLite GPU link against native GLES in Chrome OS. PiperOrigin-RevId: 275523523 Change-Id: I7cadc58ae3084a0fef06afee5bac3cb544650d51

view details

Yanhua Sun

commit sha e682dba0061a195505cdd10a5337d1b214804b2a

minor change to update BUILD PiperOrigin-RevId: 275527036 Change-Id: I3fdbe661269f4db8fe7e7f31b161ac659f2305c6

view details

Yanhua Sun

commit sha 47a05381a11ccbe7a2bdb3d6afc7b684afcba764

Add a comment and example to show how to run a subset of eager microbenchmarks PiperOrigin-RevId: 275530589 Change-Id: I90199233a35e3d824f7471b80d296df927fb0c0d

view details

Yanhua Sun

commit sha 78db1466a77f34e0c62ceab81ad4a3fd598763a2

Fewer iterations of warmup calls in benchmarks Some benchmarks can be slow, warming up for too many iterations can lead the benchmark to run for too long. PiperOrigin-RevId: 275531902 Change-Id: I1cae118d0432ce6ef7199abd9593a5bbbf1f6f75

view details

Anna R

commit sha a450c64d09e9a034e00409e38f15f39be0c829c6

Update renames_v2.py file and fix a few renaming inconsistencies. PiperOrigin-RevId: 275535416 Change-Id: I1261abe2b97aedc786a35236c5af422b3f4544d3

view details

Haoyu Zhang

commit sha 479131ef80cb852af110dd7db4376e5b76c8b2c2

Fix a bug where it sets context_id_ to 0 when closing any remote contexts. The context_id_ should only be set to an invalid value (0) if it's closing and clearing all remote contexts. PiperOrigin-RevId: 275535526 Change-Id: I069e4724c64ef0f3ca1782cf8b94bcf98e14a36a

view details

push time in 4 months

pull request commenttensorflow/tensorflow

Generate box proposals op

Hi, @alextp I don't intend to move this to tf-addons, especially not after more than 6 months of going back on forth in this PR.

samikama

comment created time in 4 months

push eventsamikama/tensorflow

Sami

commit sha dfc8a1e7f366becf7e988d008e0743424741bb18

Replaced CHECK_OK calls with context returns even though most of the failures were already FATAL.

view details

push time in 4 months

Pull request review commenttensorflow/tensorflow

Generate box proposals op

+/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++// An example Op.++#if GOOGLE_CUDA+#define EIGEN_USE_GPU++#include <algorithm>+#include <vector>+#include "tensorflow/core/kernels/non_max_suppression_op.h"+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"++#include "tensorflow/core/framework/numeric_types.h"+#include "tensorflow/core/framework/op_kernel.h"+#include "tensorflow/core/framework/tensor_types.h"+#include "tensorflow/core/lib/core/errors.h"+#include "tensorflow/core/platform/logging.h"+#include "tensorflow/core/platform/stream_executor.h"+#include "tensorflow/core/platform/types.h"+#include "tensorflow/core/util/gpu_kernel_helper.h"+#include "tensorflow/core/util/gpu_launch_config.h"+#include "third_party/cub/device/device_radix_sort.cuh"+#include "third_party/cub/device/device_segmented_radix_sort.cuh"+#include "third_party/cub/device/device_select.cuh"++namespace tensorflow {+typedef Eigen::GpuDevice GPUDevice;+#define TF_RETURN_IF_CUDA_ERROR(result)                   \+  do {                                                    \+    cudaError_t error(result);                            \+    if (!SE_PREDICT_TRUE(error == cudaSuccess)) {         \+      return errors::Internal("Cuda call failed with ",   \+                              cudaGetErrorString(error)); \+    }                                                     \+  } while (0)++#define TF_OP_REQUIRES_CUDA_SUCCESS(context, result)                   \+  do {                                                                 \+    cudaError_t error(result);                                         \+    if (!SE_PREDICT_TRUE(error == cudaSuccess)) {                      \+      context->SetStatus(errors::Internal("Cuda call failed with",     \+                                          cudaGetErrorString(error))); \+      return;                                                          \+    }                                                                  \+  } while (0)++namespace {++// Decode d_bbox_deltas with respect to anchors into absolute coordinates,+// clipping if necessary.+// prenms_nboxes maximum number of boxes per image to decode.+// d_boxes_keep_flags mask for boxes to consider in NMS.+// min_size is the lower bound of the shortest edge for the boxes to consider.+// bbox_xform_clip is the upper bound of encoded width and height.+__global__ void GeneratePreNMSUprightBoxesKernel(+    const Cuda2DLaunchConfig config, const int* d_sorted_scores_keys,+    const float4* d_bbox_deltas, const float4* d_anchors, const int height,+    const int width, const int num_anchors, const float min_size,+    const float* d_img_info_vec,  // Input "image_info" to the op [N,5]+    const float bbox_xform_clip, float4* d_out_boxes,+    const int prenms_nboxes,  // leading dimension of out_boxes+    char* d_boxes_keep_flags) {+  // constants to calculate offsets in to the input and output arrays.+  const int anchor_stride = height * width;              // Stride of Anchor+  const int height_stride = width * num_anchors;         // Stride of height+  const int image_stride = anchor_stride * num_anchors;  // Stride of image+  CUDA_AXIS_KERNEL_LOOP(image_index, config.virtual_thread_count.y, Y) {+    CUDA_AXIS_KERNEL_LOOP(ibox, config.virtual_thread_count.x, X) {+      // box_conv_index : # of the same box, but indexed in the+      // scores from the conv layer, of shape (height,width,num_anchors) the+      // num_images dimension was already removed box_conv_index =+      // a*image_stride + h*width+      // + w+      const int box_conv_index =+          d_sorted_scores_keys[image_index * image_stride + ibox];++      // We want to decompose box_conv_index in (h,w,a)+      // such as box_conv_index = h*width*num_anchors + width*num_anchors + a+      // (avoiding modulos in the process)+      int remaining = box_conv_index;+      const int delta_height = height_stride;  // stride of height+      const int h = remaining / delta_height;+      remaining -= h * delta_height;+      const int delta_width = num_anchors;  // stride of width+      const int w = remaining / delta_width;+      remaining -= w * delta_width;+      // Loading the anchor a+      // float4 is a struct with float x,y,z,w+      const float4 anchor = d_anchors[box_conv_index];+      // x1,y1,x2,y2 :coordinates of anchor a, shifted for position (h,w)+      float x1 = anchor.y;+      float x2 = anchor.w;+      float y1 = anchor.x;+      float y2 = anchor.z;++      // TODO use fast math when possible++      // Deltas of shape (N,height,width,num_anchors x 4)+      int deltas_idx = box_conv_index + image_index * image_stride;+      float4 deltas = d_bbox_deltas[deltas_idx];+      float dx = deltas.y;+      float dy = deltas.x;+      float dw = deltas.w;+      float dh = deltas.z;+      // Upper bound on dw,dh+      dw = fmin(dw, bbox_xform_clip);+      dh = fmin(dh, bbox_xform_clip);++      // Applying the deltas+      float width = x2 - x1;+      const float ctr_x = x1 + 0.5f * width;+      const float pred_ctr_x = ctr_x + width * dx;  // TODO fuse madd+      const float pred_w = width * expf(dw);+      x1 = pred_ctr_x - 0.5f * pred_w;+      x2 = pred_ctr_x + 0.5f * pred_w;++      float height = y2 - y1;+      const float ctr_y = y1 + 0.5f * height;+      const float pred_ctr_y = ctr_y + height * dy;+      const float pred_h = height * expf(dh);+      y1 = pred_ctr_y - 0.5f * pred_h;+      y2 = pred_ctr_y + 0.5f * pred_h;++      // Clipping box to image+      const float img_height = d_img_info_vec[5 * image_index + 0];+      const float img_width = d_img_info_vec[5 * image_index + 1];+      const float min_size_scaled =+          min_size * d_img_info_vec[5 * image_index + 2];+      x1 = fmax(fmin(x1, img_width), 0.0f);+      y1 = fmax(fmin(y1, img_height), 0.0f);+      x2 = fmax(fmin(x2, img_width), 0.0f);+      y2 = fmax(fmin(y2, img_height), 0.0f);++      // Filter boxes+      // Removing boxes with one dim < min_size+      // (center of box is in image, because of previous step)+      width = x2 - x1;  // may have changed+      height = y2 - y1;+      bool keep_box = fmin(width, height) >= min_size_scaled;++      // We are not deleting the box right now even if !keep_box+      // we want to keep the relative order of the elements stable+      // we'll do it in such a way later+      // d_boxes_keep_flags size: (num_images,prenms_nboxes)+      // d_out_boxes size: (num_images,prenms_nboxes)+      const int out_index = image_index * prenms_nboxes + ibox;++      d_boxes_keep_flags[out_index] = keep_box;+      d_out_boxes[out_index] = {x1, y1, x2, y2};+    }+  }+}++// Copy the selected boxes and scores to output tensors.+//+__global__ void WriteUprightBoxesOutput(+    const CudaLaunchConfig nboxes, const float4* d_image_boxes,+    const float* d_image_scores, const int* d_image_boxes_keep_list,+    const int n_rois, float* d_image_out_rois, float* d_image_out_rois_probs) {+  CUDA_1D_KERNEL_LOOP(i, nboxes.virtual_thread_count) {+    if (i < n_rois) {  // copy rois to output+      const int ibox = d_image_boxes_keep_list[i];+      const float4 box = d_image_boxes[ibox];+      const float score = d_image_scores[ibox];+      // Scattered memory accesses+      // postnms_nboxes is small anyway+      d_image_out_rois_probs[i] = score;+      const int base_idx = 4 * i;+      d_image_out_rois[base_idx + 0] = box.y;+      d_image_out_rois[base_idx + 1] = box.x;+      d_image_out_rois[base_idx + 2] = box.w;+      d_image_out_rois[base_idx + 3] = box.z;+    } else {  // set trailing entries to 0+      d_image_out_rois_probs[i] = 0.;+      const int base_idx = 4 * i;+      d_image_out_rois[base_idx + 0] = 0.;+      d_image_out_rois[base_idx + 1] = 0.;+      d_image_out_rois[base_idx + 2] = 0.;+      d_image_out_rois[base_idx + 3] = 0.;+    }+  }+}++template <typename T>+void ResetTensor(Tensor* t, const Eigen::GpuDevice& d) {+  CudaLaunchConfig zconfig = GetCudaLaunchConfig(t->NumElements(), d);+  TF_CHECK_OK(GpuLaunchKernel(+      SetZero<T>, zconfig.block_count, zconfig.thread_per_block, 0, d.stream(),+      zconfig.virtual_thread_count, (*t).flat<T>().data()));+}+// Allocate scratch spaces that are needed for operation+//++Status AllocateGenerationTempTensors(+    OpKernelContext* context, Tensor* d_conv_layer_indexes,+    Tensor* d_image_offset, Tensor* d_cub_temp_buffer,+    Tensor* d_sorted_conv_layer_indexes,+    Tensor* d_sorted_scores, Tensor* dev_boxes, Tensor* dev_boxes_keep_flags,+    int num_images, int conv_layer_nboxes, size_t cub_temp_storage_bytes,+    int num_boxes_to_generate,+    int box_dim) {+  auto d = context->eigen_gpu_device();+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_images, conv_layer_nboxes}),+      d_conv_layer_indexes));+  ResetTensor<int>(d_conv_layer_indexes, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_images + 1}), d_image_offset));+  ResetTensor<int>(d_image_offset, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT8, TensorShape({(int64)cub_temp_storage_bytes}),+      d_cub_temp_buffer));+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_images, conv_layer_nboxes}),+      d_sorted_conv_layer_indexes));+  ResetTensor<int32>(d_sorted_conv_layer_indexes, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT, TensorShape({num_images, conv_layer_nboxes}),+      d_sorted_scores));+  ResetTensor<float>(d_sorted_scores, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT,+      TensorShape({num_images, box_dim * num_boxes_to_generate}), dev_boxes));+  ResetTensor<float>(dev_boxes, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT8, TensorShape({num_images, num_boxes_to_generate}),+      dev_boxes_keep_flags));+  ResetTensor<int8>(dev_boxes_keep_flags, d);+  return Status::OK();+}++// Allocate workspace for NMS operation+Status AllocatePreNMSTempTensors(+    OpKernelContext* context, Tensor* dev_image_prenms_boxes,+    Tensor* dev_image_prenms_scores, Tensor* dev_image_boxes_keep_list,+    Tensor* dev_postnms_rois, Tensor* dev_postnms_rois_probs,+    Tensor* dev_prenms_nboxes, int num_images, int num_boxes_to_generate,+    int box_dim, int post_nms_topn, int pre_nms_topn) {+  auto d = context->eigen_gpu_device();+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT, TensorShape({box_dim * num_boxes_to_generate}),+      dev_image_prenms_boxes));+  ResetTensor<float>(dev_image_prenms_boxes, d);++  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT, TensorShape({num_boxes_to_generate}),+      dev_image_prenms_scores));+  ResetTensor<float>(dev_image_prenms_scores, d);++  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_boxes_to_generate}),+      dev_image_boxes_keep_list));+  ResetTensor<int32>(dev_image_boxes_keep_list, d);++  const int max_postnms_nboxes = std::min(num_boxes_to_generate, post_nms_topn);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT,+      TensorShape({box_dim * num_images * max_postnms_nboxes}),+      dev_postnms_rois));+  ResetTensor<float>(dev_postnms_rois, d);++  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT, TensorShape({num_images * max_postnms_nboxes}),+      dev_postnms_rois_probs));+  ResetTensor<float>(dev_postnms_rois_probs, d);++  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_images}), dev_prenms_nboxes));+  ResetTensor<int32>(dev_prenms_nboxes, d);++  return Status::OK();+}++// Initialize index and offset arrays.+// num_images is the batch size.+__global__ void InitializeDataKernel(const Cuda2DLaunchConfig config,+                                     int* d_image_offsets,+                                     int* d_boxes_keys_iota) {+  const int image_size = config.virtual_thread_count.x;+  const int num_images = config.virtual_thread_count.y;+  CUDA_AXIS_KERNEL_LOOP(img_idx, config.virtual_thread_count.y, Y) {+    CUDA_AXIS_KERNEL_LOOP(box_idx, config.virtual_thread_count.x, X) {+      d_boxes_keys_iota[img_idx * image_size + box_idx] = box_idx;++      // One 1D line sets the 1D data+      if (box_idx == 0) {+        d_image_offsets[img_idx] = image_size * img_idx;+        // One thread sets the last+1 offset+        if (img_idx == 0) d_image_offsets[num_images] = image_size * num_images;+      }+    }+  }+}++}  // namespace++class GenerateBoundingBoxProposals : public tensorflow::OpKernel {+ public:+  explicit GenerateBoundingBoxProposals(+      tensorflow::OpKernelConstruction* context)+      : OpKernel(context) {+    OP_REQUIRES_OK(context, context->GetAttr("post_nms_topn", &post_nms_topn_));+    CHECK_GT(post_nms_topn_, 0);

That is strange since I can see commits as recent as last month that has same CHECK_* ops in the trunk. Are you sure that it is blocking the merge? If so how are some of the links below passed these tests? Pretty much all of them created after this pr is opened. Perhaps there should be some PR requirements document that everybody should adhere since if we use style/code similar to what is in TF or recent PRs, they are being marked as blockers. Could either somebody fix these in the TF so that other contributors can't use them as reference or can there be some requirements document created that everybody follows that document and prevent confusion and ease up non-google contributors life and improve their enthusiasm.

Thanks,

https://github.com/tensorflow/tensorflow/blob/6c32a22396b7ac9f9b9d77b147113281b689b345/tensorflow/stream_executor/rocm/rocm_gpu_executor.cc#L421 https://github.com/tensorflow/tensorflow/blob/6c32a22396b7ac9f9b9d77b147113281b689b345/tensorflow/stream_executor/host/host_gpu_executor.cc#L45 https://github.com/tensorflow/tensorflow/blob/6c32a22396b7ac9f9b9d77b147113281b689b345/tensorflow/stream_executor/gpu/redzone_allocator.cc#L275 https://github.com/tensorflow/tensorflow/blob/6c32a22396b7ac9f9b9d77b147113281b689b345/tensorflow/python/lib/core/py_func.cc#L193 https://github.com/tensorflow/tensorflow/blob/6c32a22396b7ac9f9b9d77b147113281b689b345/tensorflow/python/eager/pywrap_tensor.cc#L121

samikama

comment created time in 4 months

Pull request review commenttensorflow/tensorflow

Generate box proposals op

+/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++// An example Op.++#if GOOGLE_CUDA+#define EIGEN_USE_GPU++#include <algorithm>+#include <vector>+#include "tensorflow/core/kernels/non_max_suppression_op.h"+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"++#include "tensorflow/core/framework/numeric_types.h"+#include "tensorflow/core/framework/op_kernel.h"+#include "tensorflow/core/framework/tensor_types.h"+#include "tensorflow/core/lib/core/errors.h"+#include "tensorflow/core/platform/logging.h"+#include "tensorflow/core/platform/stream_executor.h"+#include "tensorflow/core/platform/types.h"+#include "tensorflow/core/util/gpu_kernel_helper.h"+#include "tensorflow/core/util/gpu_launch_config.h"+#include "third_party/cub/device/device_radix_sort.cuh"+#include "third_party/cub/device/device_segmented_radix_sort.cuh"+#include "third_party/cub/device/device_select.cuh"++namespace tensorflow {+typedef Eigen::GpuDevice GPUDevice;+#define TF_RETURN_IF_CUDA_ERROR(result)                   \+  do {                                                    \+    cudaError_t error(result);                            \+    if (!SE_PREDICT_TRUE(error == cudaSuccess)) {         \+      return errors::Internal("Cuda call failed with ",   \+                              cudaGetErrorString(error)); \+    }                                                     \+  } while (0)++#define TF_OP_REQUIRES_CUDA_SUCCESS(context, result)                   \+  do {                                                                 \+    cudaError_t error(result);                                         \+    if (!SE_PREDICT_TRUE(error == cudaSuccess)) {                      \+      context->SetStatus(errors::Internal("Cuda call failed with",     \+                                          cudaGetErrorString(error))); \+      return;                                                          \+    }                                                                  \+  } while (0)++namespace {++// Decode d_bbox_deltas with respect to anchors into absolute coordinates,+// clipping if necessary.+// prenms_nboxes maximum number of boxes per image to decode.+// d_boxes_keep_flags mask for boxes to consider in NMS.+// min_size is the lower bound of the shortest edge for the boxes to consider.+// bbox_xform_clip is the upper bound of encoded width and height.+__global__ void GeneratePreNMSUprightBoxesKernel(+    const Cuda2DLaunchConfig config, const int* d_sorted_scores_keys,+    const float4* d_bbox_deltas, const float4* d_anchors, const int height,+    const int width, const int num_anchors, const float min_size,+    const float* d_img_info_vec,  // Input "image_info" to the op [N,5]+    const float bbox_xform_clip, float4* d_out_boxes,+    const int prenms_nboxes,  // leading dimension of out_boxes+    char* d_boxes_keep_flags) {+  // constants to calculate offsets in to the input and output arrays.+  const int anchor_stride = height * width;              // Stride of Anchor+  const int height_stride = width * num_anchors;         // Stride of height+  const int image_stride = anchor_stride * num_anchors;  // Stride of image+  CUDA_AXIS_KERNEL_LOOP(image_index, config.virtual_thread_count.y, Y) {+    CUDA_AXIS_KERNEL_LOOP(ibox, config.virtual_thread_count.x, X) {+      // box_conv_index : # of the same box, but indexed in the+      // scores from the conv layer, of shape (height,width,num_anchors) the+      // num_images dimension was already removed box_conv_index =+      // a*image_stride + h*width+      // + w+      const int box_conv_index =+          d_sorted_scores_keys[image_index * image_stride + ibox];++      // We want to decompose box_conv_index in (h,w,a)+      // such as box_conv_index = h*width*num_anchors + width*num_anchors + a+      // (avoiding modulos in the process)+      int remaining = box_conv_index;+      const int delta_height = height_stride;  // stride of height+      const int h = remaining / delta_height;+      remaining -= h * delta_height;+      const int delta_width = num_anchors;  // stride of width+      const int w = remaining / delta_width;+      remaining -= w * delta_width;+      // Loading the anchor a+      // float4 is a struct with float x,y,z,w+      const float4 anchor = d_anchors[box_conv_index];+      // x1,y1,x2,y2 :coordinates of anchor a, shifted for position (h,w)+      float x1 = anchor.y;+      float x2 = anchor.w;+      float y1 = anchor.x;+      float y2 = anchor.z;++      // TODO use fast math when possible++      // Deltas of shape (N,height,width,num_anchors x 4)+      int deltas_idx = box_conv_index + image_index * image_stride;+      float4 deltas = d_bbox_deltas[deltas_idx];+      float dx = deltas.y;+      float dy = deltas.x;+      float dw = deltas.w;+      float dh = deltas.z;+      // Upper bound on dw,dh+      dw = fmin(dw, bbox_xform_clip);+      dh = fmin(dh, bbox_xform_clip);++      // Applying the deltas+      float width = x2 - x1;+      const float ctr_x = x1 + 0.5f * width;+      const float pred_ctr_x = ctr_x + width * dx;  // TODO fuse madd+      const float pred_w = width * expf(dw);+      x1 = pred_ctr_x - 0.5f * pred_w;+      x2 = pred_ctr_x + 0.5f * pred_w;++      float height = y2 - y1;+      const float ctr_y = y1 + 0.5f * height;+      const float pred_ctr_y = ctr_y + height * dy;+      const float pred_h = height * expf(dh);+      y1 = pred_ctr_y - 0.5f * pred_h;+      y2 = pred_ctr_y + 0.5f * pred_h;++      // Clipping box to image+      const float img_height = d_img_info_vec[5 * image_index + 0];+      const float img_width = d_img_info_vec[5 * image_index + 1];+      const float min_size_scaled =+          min_size * d_img_info_vec[5 * image_index + 2];+      x1 = fmax(fmin(x1, img_width), 0.0f);+      y1 = fmax(fmin(y1, img_height), 0.0f);+      x2 = fmax(fmin(x2, img_width), 0.0f);+      y2 = fmax(fmin(y2, img_height), 0.0f);++      // Filter boxes+      // Removing boxes with one dim < min_size+      // (center of box is in image, because of previous step)+      width = x2 - x1;  // may have changed+      height = y2 - y1;+      bool keep_box = fmin(width, height) >= min_size_scaled;++      // We are not deleting the box right now even if !keep_box+      // we want to keep the relative order of the elements stable+      // we'll do it in such a way later+      // d_boxes_keep_flags size: (num_images,prenms_nboxes)+      // d_out_boxes size: (num_images,prenms_nboxes)+      const int out_index = image_index * prenms_nboxes + ibox;++      d_boxes_keep_flags[out_index] = keep_box;+      d_out_boxes[out_index] = {x1, y1, x2, y2};+    }+  }+}++// Copy the selected boxes and scores to output tensors.+//+__global__ void WriteUprightBoxesOutput(+    const CudaLaunchConfig nboxes, const float4* d_image_boxes,+    const float* d_image_scores, const int* d_image_boxes_keep_list,+    const int n_rois, float* d_image_out_rois, float* d_image_out_rois_probs) {+  CUDA_1D_KERNEL_LOOP(i, nboxes.virtual_thread_count) {+    if (i < n_rois) {  // copy rois to output+      const int ibox = d_image_boxes_keep_list[i];+      const float4 box = d_image_boxes[ibox];+      const float score = d_image_scores[ibox];+      // Scattered memory accesses+      // postnms_nboxes is small anyway+      d_image_out_rois_probs[i] = score;+      const int base_idx = 4 * i;+      d_image_out_rois[base_idx + 0] = box.y;+      d_image_out_rois[base_idx + 1] = box.x;+      d_image_out_rois[base_idx + 2] = box.w;+      d_image_out_rois[base_idx + 3] = box.z;+    } else {  // set trailing entries to 0+      d_image_out_rois_probs[i] = 0.;+      const int base_idx = 4 * i;+      d_image_out_rois[base_idx + 0] = 0.;+      d_image_out_rois[base_idx + 1] = 0.;+      d_image_out_rois[base_idx + 2] = 0.;+      d_image_out_rois[base_idx + 3] = 0.;+    }+  }+}++template <typename T>+void ResetTensor(Tensor* t, const Eigen::GpuDevice& d) {+  CudaLaunchConfig zconfig = GetCudaLaunchConfig(t->NumElements(), d);+  TF_CHECK_OK(GpuLaunchKernel(+      SetZero<T>, zconfig.block_count, zconfig.thread_per_block, 0, d.stream(),+      zconfig.virtual_thread_count, (*t).flat<T>().data()));+}+// Allocate scratch spaces that are needed for operation+//++Status AllocateGenerationTempTensors(+    OpKernelContext* context, Tensor* d_conv_layer_indexes,+    Tensor* d_image_offset, Tensor* d_cub_temp_buffer,+    Tensor* d_sorted_conv_layer_indexes,+    Tensor* d_sorted_scores, Tensor* dev_boxes, Tensor* dev_boxes_keep_flags,+    int num_images, int conv_layer_nboxes, size_t cub_temp_storage_bytes,+    int num_boxes_to_generate,+    int box_dim) {+  auto d = context->eigen_gpu_device();+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_images, conv_layer_nboxes}),+      d_conv_layer_indexes));+  ResetTensor<int>(d_conv_layer_indexes, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_images + 1}), d_image_offset));+  ResetTensor<int>(d_image_offset, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT8, TensorShape({(int64)cub_temp_storage_bytes}),+      d_cub_temp_buffer));+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_images, conv_layer_nboxes}),+      d_sorted_conv_layer_indexes));+  ResetTensor<int32>(d_sorted_conv_layer_indexes, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT, TensorShape({num_images, conv_layer_nboxes}),+      d_sorted_scores));+  ResetTensor<float>(d_sorted_scores, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT,+      TensorShape({num_images, box_dim * num_boxes_to_generate}), dev_boxes));+  ResetTensor<float>(dev_boxes, d);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT8, TensorShape({num_images, num_boxes_to_generate}),+      dev_boxes_keep_flags));+  ResetTensor<int8>(dev_boxes_keep_flags, d);+  return Status::OK();+}++// Allocate workspace for NMS operation+Status AllocatePreNMSTempTensors(+    OpKernelContext* context, Tensor* dev_image_prenms_boxes,+    Tensor* dev_image_prenms_scores, Tensor* dev_image_boxes_keep_list,+    Tensor* dev_postnms_rois, Tensor* dev_postnms_rois_probs,+    Tensor* dev_prenms_nboxes, int num_images, int num_boxes_to_generate,+    int box_dim, int post_nms_topn, int pre_nms_topn) {+  auto d = context->eigen_gpu_device();+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT, TensorShape({box_dim * num_boxes_to_generate}),+      dev_image_prenms_boxes));+  ResetTensor<float>(dev_image_prenms_boxes, d);++  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT, TensorShape({num_boxes_to_generate}),+      dev_image_prenms_scores));+  ResetTensor<float>(dev_image_prenms_scores, d);++  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_boxes_to_generate}),+      dev_image_boxes_keep_list));+  ResetTensor<int32>(dev_image_boxes_keep_list, d);++  const int max_postnms_nboxes = std::min(num_boxes_to_generate, post_nms_topn);+  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT,+      TensorShape({box_dim * num_images * max_postnms_nboxes}),+      dev_postnms_rois));+  ResetTensor<float>(dev_postnms_rois, d);++  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_FLOAT, TensorShape({num_images * max_postnms_nboxes}),+      dev_postnms_rois_probs));+  ResetTensor<float>(dev_postnms_rois_probs, d);++  TF_RETURN_IF_ERROR(context->allocate_temp(+      DataType::DT_INT32, TensorShape({num_images}), dev_prenms_nboxes));+  ResetTensor<int32>(dev_prenms_nboxes, d);++  return Status::OK();+}++// Initialize index and offset arrays.+// num_images is the batch size.+__global__ void InitializeDataKernel(const Cuda2DLaunchConfig config,+                                     int* d_image_offsets,+                                     int* d_boxes_keys_iota) {+  const int image_size = config.virtual_thread_count.x;+  const int num_images = config.virtual_thread_count.y;+  CUDA_AXIS_KERNEL_LOOP(img_idx, config.virtual_thread_count.y, Y) {+    CUDA_AXIS_KERNEL_LOOP(box_idx, config.virtual_thread_count.x, X) {+      d_boxes_keys_iota[img_idx * image_size + box_idx] = box_idx;++      // One 1D line sets the 1D data+      if (box_idx == 0) {+        d_image_offsets[img_idx] = image_size * img_idx;+        // One thread sets the last+1 offset+        if (img_idx == 0) d_image_offsets[num_images] = image_size * num_images;+      }+    }+  }+}++}  // namespace++class GenerateBoundingBoxProposals : public tensorflow::OpKernel {+ public:+  explicit GenerateBoundingBoxProposals(+      tensorflow::OpKernelConstruction* context)+      : OpKernel(context) {+    OP_REQUIRES_OK(context, context->GetAttr("post_nms_topn", &post_nms_topn_));+    CHECK_GT(post_nms_topn_, 0);

@aaroey Is this a new requirement? I am seeing still there are a lot of references to CHECKs in TF code base. Can you be a bit more specific on which tests it is breaking?

samikama

comment created time in 4 months

Pull request review commenttensorflow/tensorflow

Generate box proposals op

 tf_module {     name: "gather_nd"     argspec: "args=[\'params\', \'indices\', \'name\', \'batch_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'0\'], "   }+  member_method {+    name: "generate_bounding_box_proposals"

Done

samikama

comment created time in 4 months

push eventsamikama/tensorflow

Sami

commit sha 4dc3be29621bac333cb4a46135aef0a46ec5d5f0

Fix buildifier issue

view details

push time in 4 months

pull request commenttensorflow/tensorflow

Generate box proposals op

@aaroey @pkulzc This should be good to go now.

samikama

comment created time in 4 months

push eventsamikama/tensorflow

Sami

commit sha d411e31e70bf666170bec818aadbbfc9b2f5c2ae

Fix review comments and remove obsolete flag that is removed by earlier review modifications

view details

push time in 4 months

create barnchsamikama/tensorflow

branch : JoCUpdate

created branch time in 4 months

pull request commenttensorflow/tensorflow

Generate box proposals op

Hi, Yes this is still being worked on. Waiting for some internal updates.

samikama

comment created time in 5 months

fork samikama/kamerka

Build interactive map of cameras from Shodan

fork in 5 months

pull request commenttensorflow/tensorflow

Nms stability

@aaroey I force pushed fix to the CPU op bug that is blocking this GPU ops.

samikama

comment created time in 5 months

push eventsamikama/tensorflow

Nathan Wells

commit sha 4380f1c762850a2e9557752c934bcdb5207ab3b7

Updated speech_commands example to work with TensorFlow 2.0+

view details

frreiss

commit sha b40f25eefa82ed83c03378e17748acfe1f8a8d13

Add API docs for LMDBDataset Add API docs for LMDBDataset Additional corrections to API docs Fix missing curly brace

view details

Yuxin Wu

commit sha bb7509d4af3b8412dd5da2d01dc0ecbcecd5e5e0

Register flops for BatchMatMulV2 fix #22071

view details

Yongfeng Gu

commit sha eb3d6d3d88958772dfea7563fd62ff80d2d02ee3

Phase 3 of XLA int8 convolution on CUDA 1. Allow convolution with integer input/kernel and float output/bias/scaling/side and disallow int8-to-int8 convolution node in XLA. 2. Add a new traversal to cudnn_fused_convolution_rewriter to fuse clamping and data type conversion nodes with convolution custom-call node for int8 convolution. 3. Set convolution layout constraints to NHWC for integer convolution.

view details

Yongfeng Gu

commit sha cd63197f6190176f45dad47b8b63ff0b625c0b50

Fix a bug in the previous commit by returning Unimplemented.

view details

Yongfeng Gu

commit sha 403f811fdd64b4d9d884cd3f7c2cbec36fb256c8

Pass error status to the caller from RunOnInstruction().

view details

Yongfeng Gu

commit sha 793dd9b351f920f978d87b967a0c20509f61f1f9

Use pattern matching to identify and map integer convolutions to CuDNN. See comments in cudnn_conv_rewriter.h and cudnn_fused_conv_rewriter.h for details.

view details

Yongfeng Gu

commit sha 89ba54f5cd734a247ed42cc686a309b739fd598e

1. Update int8-to-float convolution patterns by removing the clamp. 2. Remove tracked_custom_calls.

view details

Yongfeng Gu

commit sha 7c87d1b019c594ae3ff99fd283eeb6c008f9293c

1. Update function names for those rewrite functions that always succeed. 2. Remove a unused variable.

view details

Yongfeng Gu

commit sha 6e2c5f61d659e0a16d5b7f6e3eec8019f6dad061

Move changes from Phase 4 of XLA int8 convolution on CUDA to this PR.

view details

Trent Lo

commit sha a82ec8a0726adf2a180dc6bb746edc01b738644b

Disable the Grappler memory opt when GlobalJitLevel is on.

view details

Trent Lo

commit sha cb3f976f5fe5886857c266f848977e2b5ff2e3da

Check GlobalJitLevel only for DEFAULT_MEM_OPT. This relaxes the disable check and should be a slightly better behavior, as users still have some ways to enable the memory optimizer when they want to.

view details

Frederic Bastien

commit sha ad0062e0d38f357e11f850e3de846e849ee8929c

Add an utility function HasOverlappinWindow.

view details

frreiss

commit sha 1c57c863e851f8c7712c92639e2a8ebe7a0c2a1c

Document ParallelInterleaveDatasetV2 op Document ParallelInterleaveDatasetV2 op Fix missing curly brace

view details

Frederic Bastien

commit sha 1a7a495f527a0709a2ebe858323ed93748a66dfe

Fix review comment

view details

Frederic Bastien

commit sha f53913a9abcee04d0747967ca9cd88efb64168be

Fix review comment

view details

frreiss

commit sha a34c7481e9c84a0da5b6361a2422384727580aa8

Removed mentions of 'sloppy' attribute from doc

view details

Frederic Bastien

commit sha df5e0c2d75ed7213e7cb74507063cda4fd14d50a

Fix buildifier

view details

frreiss

commit sha 9dbd330c9e0cde7839c00fe30db2cf637202beb4

Improve documentation of deterministic flag. Add note that deterministic flag is True by default More-complete explanation

view details

Trent Lo

commit sha 28d774bde1348d73f01d0de9a84ec3f06ae82b59

Add a XlaConfigProxy class. Callbacks can be registered to this class so that runtime environment flags can be parsed to change configs in the Tensorflow core. A primary use of this is for the Tensorflow core to query the XLA JIT level, which can be configured by some runtime environment flags in addition to ConfigProto.

view details

push time in 5 months

push eventsamikama/tensorflow

Sami

commit sha 4c09e4bbc832c92179ecd928debecfb6a4d51cf7

Fix allocation Size

view details

push time in 5 months

create barnchsamikama/tensorflow

branch : NewNMS

created branch time in 5 months

push eventsamikama/mask-rcnn-tensorflow

Sami

commit sha a3e7d26f57b601fbc6cc263e576ea904b406ad70

Add patch file remove Nsight deb reference

view details

push time in 5 months

create barnchsamikama/mask-rcnn-tensorflow

branch : DLProf

created branch time in 5 months

fork samikama/mask-rcnn-tensorflow

Fork of Tensorpack to make breaking performance improvements to the Mask RCNN example. Training is approximately 2x faster than the original implementation on AWS.

fork in 5 months

pull request commenttensorflow/tensorflow

[r1.15Cherrypick]: Cherry-pick #30893 to r1.15

@mihaimaruseac I didn't create the cherry-pick, @ppwwyyxx did, I expected you would verify it is a cherry-pick before merging.

ppwwyyxx

comment created time in 5 months

push eventsamikama/tensorflow

Sami

commit sha eb6ba10c6cfd1d117e63170b3a3c3eb5bfeb1c7a

Address review comments

view details

push time in 5 months

pull request commenttensorflow/tensorflow

[r1.15Cherrypick]: Cherry-pick #30893 to r1.15

@mihaimaruseac I would have preferred if the original commits have been kept from #30893. After all you already had them in the master and cherry-picking would be better than creating a new commit with a different author.

ppwwyyxx

comment created time in 5 months

PR opened tensorflow/tensorflow

Reviewers
Nms stability

Current CPU implementation has an unstable sort which can lead to significant difference the results. This PR adds ordering criteria for same scored boxes, stabilizing algorithm. Also GPU selection criteria has been changed from greater than to greater or equal to match CPU implementation.

+181 -20

0 comment

3 changed files

pr created time in 5 months

create barnchsamikama/tensorflow

branch : NMSStability

created branch time in 5 months

push eventsamikama/tensorflow

vcarpani

commit sha 8302a824dd95e6ea4e291b15924d37dc892afa85

Substitute array_ops.where with array_ops.whare_v2.

view details

jim.meyer

commit sha 129c2c08690759d9d4ad0093c5b9f15e86330073

Improved the docs for tensorflow.keras.callbacks.ModelCheckpoint's filepath argument

view details

vcarpani

commit sha 63247cad5300c4eae85cb04d7bd2f83b5fa1d89d

Merge branch 'master' into math-grad-test-deprecated

view details

Ivan Habernal

commit sha 1eea021bda150e9ea7c3945d82fb198c08939f76

Fixing warning from array_ops.where()

view details

Koan-Sin Tan

commit sha 5dcec0bb67a51d2c582d3b0c6e49b97438ea054f

[tflite] fix allow fp16 in_tflite benchmark model interpreter_->SetAllowFp16PrecisionForFp32() should be called before setting delegate

view details

Ivan Habernal

commit sha 9524480eac0a3c0d5a825a23dfd750423a8cfed9

Fixing pylint failures

view details

Zaccharie Ramzi

commit sha c3d766b6aa1f4afd491abd454f1e93d1fe19ad92

made fftshift and ifftshit accept tensors with shape not specified in more cases

view details

Jeff Daily

commit sha a1dd689ecb9d3eab4b4f072dbfe8eff20173fb88

move nccl stream to member of StreamGroup This allows the compute and nccl stream to be force-initialized as immediate siblings which is necessary for ROCm performance.

view details

habernal

commit sha 1d48bbf52cc672e9badc5b78580753308c0a2d29

Merge branch 'master' of github.com:tensorflow/tensorflow

view details

vcarpani

commit sha 1da5b8e09d24b40d849347464878a3c02921ce0f

Merge branch 'master' into math-grad-test-deprecated

view details

Wen-Heng (Jack) Chung

commit sha 58b78b318fefd2c4bfa280d8bf75607f09326c4e

[ROCm] enable Xlogy op on ROCm.

view details

Ivan Habernal

commit sha b969af7a2482ff80a77f2f3afe370f8f40870841

Adding example

view details

Jeff Daily

commit sha ac3fa819c2d56d285a80904438f08aaadae2556f

Merge branch 'master' into google_upstream_nccl_stream_early

view details

Ivan Habernal

commit sha d0c9230d7b87cbee5a165cf68c568a774627c515

Merge branch 'master' of github.com:tensorflow/tensorflow

view details

Wen-Heng (Jack) Chung

commit sha 3188e95fe06dc35f720e568a6f25803edacc99f9

[ROCm] enable nextafter op on ROCm.

view details

Wen-Heng (Jack) Chung

commit sha d03dd82231ce3e7a7763b01a4a87de19a164a5fb

Update cwise_op_gpu_xlogy.cu.cc

view details

candy.dc

commit sha 2a72e9993018ad0dc7c50c99e280742539ec6cea

Fix: slot and primary can be different shape

view details

vcarpani

commit sha 15edd58ece4eb83f42c2b2a1a58523eee7d5b6b2

Solve build failures.

view details

Anton Kachatkou

commit sha fb9abc66e44b01b3c678c6f93766fad9b9ef591c

Bug fix for Tensorflow Lite quantized ops tests Use division instead of multiplication when applying the quantisation scale in PerChannelQuantizeBias()

view details

Siddhartha Bagaria

commit sha c3cbf5c0415886747e39ba21047b5486d36d5d71

Go: allow larger C array backed slices on 64 bit machines

view details

push time in 5 months

issue commenttensorflow/tensorflow

Bug in NonMaxSuppressionV3 GPU op added by PR #30893

@aaroey Can you add a bit more specifics so that it helps to the investigation from our side as well. Does it lead to a crash, less boxes, more boxes or different set of boxes than expected?

aaroey

comment created time in 6 months

pull request commenttensorflow/tensorflow

Generate box proposals op

@rthadur please wait until I make another commit since during the review process, there were some api changes in consumers and at current state it would require unnecessary operations to wire it up. I will reformat the data layouts to match users.

samikama

comment created time in 6 months

PR opened tensorflow/tensorflow

Reviewers
Adding NMSv4 GPU implementation

This PR adds GPU implementation of NonMaxSuppresionOpV4.

+176 -14

0 comment

2 changed files

pr created time in 6 months

create barnchsamikama/tensorflow

branch : NMSv4

created branch time in 6 months

push eventsamikama/tensorflow

Sami

commit sha 178ed66d97ae36e7a0cf8d9d4e9626699f193bfb

Fix numeric_limits::min()->lowest() and clang-format

view details

push time in 6 months

push eventsamikama/tensorflow

Sami

commit sha 736eba374e2976d1e8bd415dd13a0d547ab0c8c9

Addressing review comments

view details

push time in 6 months

Pull request review commenttensorflow/tensorflow

Add NMSv3 GPU op

 limitations under the License. #if GOOGLE_CUDA #define EIGEN_USE_GPU #include "absl/strings/str_cat.h"-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"-#include "third_party/cub/device/device_radix_sort.cuh"-#include "third_party/cub/device/device_segmented_radix_sort.cuh"-#include "third_party/cub/device/device_select.cuh" #include "tensorflow/core/framework/numeric_types.h" #include "tensorflow/core/framework/op_kernel.h" #include "tensorflow/core/framework/tensor_types.h" #include "tensorflow/core/kernels/non_max_suppression_op.h" #include "tensorflow/core/util/gpu_kernel_helper.h" #include "tensorflow/core/util/gpu_launch_config.h" #include "tensorflow/stream_executor/stream_executor.h"+#include "third_party/cub/device/device_radix_sort.cuh"

Hi @aaroey These were changed by clang-format, If I revert them clang-format check will complain.

samikama

comment created time in 6 months

more