profile
viewpoint
Igor Ganichev iganichev @Google Mountain View, CA

google/cog 34

Code for the COG dataset and network

iganichev/TensorFlow-Examples 0

TensorFlow Tutorial and Examples for beginners

issue commenttensorflow/tensorflow

Tensorflow predict call crashes when loading a model with gevent enabled

Great. I will make the patch and send it out internally (a little easier). I should make it to github tomorrow night or a bit later.

Intellicode

comment created time in a month

issue commenttensorflow/tensorflow

Tensorflow predict call crashes when loading a model with gevent enabled

WeakKeyDictionary is definitely needed. There are a couple of issues here and I can submit a fix. Can somebody quickly test if something like this would work with gevent?

diff --git a/google3/third_party/tensorflow/python/keras/backend.py b/google3/third_party/tensorflow/python/keras/backend.py
--- a/google3/third_party/tensorflow/python/keras/backend.py
+++ b/google3/third_party/tensorflow/python/keras/backend.py
@@ -110,7 +110,14 @@ py_any = any
 # _DUMMY_EAGER_GRAPH is used as a key in _GRAPH_LEARNING_PHASES.
 # We keep a separate reference to it to make sure it does not get removed from
 # _GRAPH_LEARNING_PHASES.
-_DUMMY_EAGER_GRAPH = threading.local()
+class DummyEagerGraph(threading.local):
+  class Foo(object):
+    pass
+  def __init__(self):
+    self.key = Foo()
+
+
+_DUMMY_EAGER_GRAPH = DummyEagerGraph()
 
 # This boolean flag can be set to True to leave variable initialization
 # up to the user.
@@ -295,17 +302,17 @@ def learning_phase():
     # will always execute non-eagerly using a function-specific default
     # subgraph.
     if context.executing_eagerly():
-      if _DUMMY_EAGER_GRAPH not in _GRAPH_LEARNING_PHASES:
+      if _DUMMY_EAGER_GRAPH.key not in _GRAPH_LEARNING_PHASES:
         # Fallback to inference mode as default.
         return 0
-      return _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH]
+      return _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH.key]
     learning_phase = symbolic_learning_phase()
     _mark_func_graph_as_unsaveable(graph, learning_phase)
     return learning_phase
 
 
 def global_learning_phase_is_set():
-  return _DUMMY_EAGER_GRAPH in _GRAPH_LEARNING_PHASES
+  return _DUMMY_EAGER_GRAPH.key in _GRAPH_LEARNING_PHASES
 
 
 def _mark_func_graph_as_unsaveable(graph, learning_phase):
@@ -356,7 +363,7 @@ def set_learning_phase(value):
     if context.executing_eagerly():
       # In an eager context, the learning phase values applies to both the eager
       # context and the internal Keras graph.
-      _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH] = value
+      _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH.key] = value
     _GRAPH_LEARNING_PHASES[get_graph()] = value
 
 
@@ -384,7 +391,7 @@ def learning_phase_scope(value):
   with ops.init_scope():
     if context.executing_eagerly():
       previous_eager_value = _GRAPH_LEARNING_PHASES.get(
-          _DUMMY_EAGER_GRAPH, None)
+          _DUMMY_EAGER_GRAPH.key, None)
     previous_graph_value = _GRAPH_LEARNING_PHASES.get(get_graph(), None)
 
   try:
@@ -395,9 +402,9 @@ def learning_phase_scope(value):
     with ops.init_scope():
       if context.executing_eagerly():
         if previous_eager_value is not None:
-          _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH] = previous_eager_value
-        elif _DUMMY_EAGER_GRAPH in _GRAPH_LEARNING_PHASES:
-          del _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH]
+          _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH.key] = previous_eager_value
+        elif _DUMMY_EAGER_GRAPH.key in _GRAPH_LEARNING_PHASES:
+          del _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH.key]
 
       graph = get_graph()
       if previous_graph_value is not None:
@@ -427,14 +434,14 @@ def eager_learning_phase_scope(value):
   if global_learning_phase_was_set:
     previous_value = learning_phase()
   try:
-    _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH] = value
+    _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH.key] = value
     yield
   finally:
     # Restore learning phase to initial value or unset.
     if global_learning_phase_was_set:
-      _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH] = previous_value
+      _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH.key] = previous_value
     else:
-      del _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH]
+      del _GRAPH_LEARNING_PHASES[_DUMMY_EAGER_GRAPH.key]
 
 
 def _current_graph(op_input_list):
Intellicode

comment created time in a month

startedgoogle/benchmark

started time in 2 months

issue commenttensorflow/tensorflow

DatasetVariantWrapper "No unary variant device copy function found"

Do you know how is this achieved?

Probably placer needs to understand and respect this constraint. Maybe @iganichev knows?

As @lindong28 mentioned, placer does not know the type inside DT_VARIANT at placement time. So, it can't automatically colocate nodes based on it. If the relevant nodes, always take the same underlying types and they are not copiable, it should be easy to add logic similar to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/colocation_graph.cc#L722

mwalmsley

comment created time in 2 months

issue closedtensorflow/tensorflow

Large Python call overhead in eager mode

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): OSX Mojave
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version (use command below): 1.12.0
  • Python version: 3.6
  • GCC/Compiler version (if compiling from source): 4.2.1 v1.12.0-rc2-3-ga6d8ffae09 1.12.0

We are currently working with Tensorflow Probability as part of PyMC4. The model logp graph is static and rather cheap but the samplers are in python. Thus, for every logp evaluation, we have to evaluate the TF graph. Compared to Theano, this is rather slow (about 20x slower). I suspect that the Python call overhead is just very high, especially in graph mode (see https://github.com/tensorflow/tensorflow/issues/120).

Eager mode makes things better but it's still about 10x slower, even when using defun. I guess I'm asking how much additional work needs to be done other than giving a call hook directly into compiled code and whether that part is optimized at all.

Here is a notebook with some very basic speed comparisons to Theano: https://gist.github.com/twiecki/43d6b78455ef5812bb90b5522fe7686c

The difference grows more dramatic in a real-world scenario but the notebook is more involved: https://github.com/aseyboldt/pymc4/blob/london/pymc4_experiment.ipynb

closed time in 3 months

twiecki

issue commenttensorflow/tensorflow

Large Python call overhead in eager mode

Here is an option to request XLA compilation of tf.functions: https://github.com/tensorflow/tensorflow/blob/e3b2203323c578cc9a3e1a5bca51d00c050cb18e/tensorflow/python/eager/def_function.py#L977

There is a bit more documentation here: https://github.com/tensorflow/tensorflow/blob/e3b2203323c578cc9a3e1a5bca51d00c050cb18e/tensorflow/python/eager/def_function.py#L344

twiecki

comment created time in 3 months

startednorvig/pytudes

started time in 3 months

Pull request review commenttensorflow/tensorflow

Reduce if-else branch,code do nothing in if branch

 void Member::Merge(std::vector<Member>* tree, int x_root, int y_root, // changed. int Member::FindAndUpdateRoot(std::vector<Member>* tree, int node_id) {   Member& member = (*tree)[node_id];-  if (member.parent_ == node_id) {-    // member.parent is the root of this disjoint tree.  Do nothing.-  } else {+  if (member.parent_ != node_id) {

Please add a comment.

leike666666

comment created time in 4 months

pull request commenttensorflow/tensorflow

Force synchronization for GPU ops accessing host memory (issue #33294)

The underlying issue does look like a race condition, but the description in https://github.com/tensorflow/tensorflow/issues/33294 does not seem accurate.

PackOp will run only after _HostRecv's callback is invoked. The callback passed to _HostRecv is this https://github.com/tensorflow/tensorflow/blob/aec60bd94b1ef6b0a65abf6a9699e0cfe149ae26/tensorflow/core/common_runtime/executor.cc#L1850. No consumer of its outputs can be invoked until the PropagateOutputs runs.

Now, maybe when _HostRecv callback is invoked, the copy is not done yet and the destination buffer contains garbage. This should also not be the case, because the Recv logic ends up getting here: https://github.com/tensorflow/tensorflow/blob/e2c703019e78f1e34bf2e7e628f2ebcdc9e62445/tensorflow/core/common_runtime/gpu/gpu_util.cc#L294. This callback will run only after the memcpy, scheduled 10 lines above, actually finishes.

In light of the above and the fact that this change will likely cause large performance regressions, I will close this PR. It would be great if you can provide a simple repro of the race condition you are seeing.

ekuznetsov139

comment created time in 4 months

PR closed tensorflow/tensorflow

Reviewers
Force synchronization for GPU ops accessing host memory (issue #33294) cla: yes size:S

Forcibly synchronize the stream when BaseGPUDevice::Compute() is called with a kernel with inputs in host memory, since the kernel may be executed on the host and therefore may not automatically wait for completion of other kernels in the stream (which may be generating its inputs).

+12 -0

3 comments

1 changed file

ekuznetsov139

pr closed time in 4 months

issue commenttensorflow/tensorflow

XLA bug w/ Keras: "Node name contains invalid characters"

That error is indeed non-fatal. It is warning that some optimizations could not be performed.

@sanjoy maybe some Keras folks can take a look at who produces these badly named nodes.

nsmetanin

comment created time in 5 months

startedgoogle/marl

started time in 6 months

more