profile
viewpoint

allenlavoie/topic-pov 1

Code for fitting a topic and point of view (POV) model of collective intelligence processes, using parallel Gibbs sampling.

allenlavoie/bazel 0

a fast, scalable, multi-language and extensible build system

allenlavoie/clustered-controversy 0

Code for computing a "clustered controversy" score, which measures how concentrated a user is on a focused, controversial topic in a collective intelligence environment.

allenlavoie/community 0

Stores documents used by the TensorFlow developer community

allenlavoie/docs 0

TensorFlow documentation

allenlavoie/models 0

Models built with TensorFlow

allenlavoie/tensorflow 0

Computation using data flow graphs for scalable machine learning

Pull request review commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

 def _AddShardedSaveOpsForV2(self, checkpoint_prefix, per_device):     #     <train dir>/     #        myckpt{.index, .data-?????-of-?????}     #+    #   Filesystems with eventual consistency (such as S3), don't need a temporary+    #   location. Using a temporary directory in those cases+    #   might cause situations where files are not available during copy.+    #     # Users only need to interact with the user-specified prefix, which is     # "<train dir>/myckpt" in this case.  Save() and Restore() work with the     # prefix directly, instead of any physical pathname.  (On failure and     # subsequent restore, an outdated and orphaned temporary directory can be     # safely removed.)-    _SHARDED_SUFFIX = "_temp_%s/part" % uuid.uuid4().hex+    _SHARDED_SUFFIX = control_flow_ops.cond(+        string_ops.regex_full_match(checkpoint_prefix, '^s3://.*'),+        lambda: ".part",+        lambda: "_temp_%s/part" % uuid.uuid4().hex)

So now that I look closely, we're not actually running different operations on the cond branches. So tf.where works here, just switching between the two constants. I think that will be much less traumatic; unfortunately control flow is a bit special.

Without objection I'll just make that change on import rather than messing with the unit tests. There are some lint nits I ended up changing anyway. That also lets us keep the saved_model reference cycle test.

rahul003

comment created time in 12 hours

Pull request review commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

 def _AddShardedSaveOpsForV2(self, checkpoint_prefix, per_device):     #     <train dir>/     #        myckpt{.index, .data-?????-of-?????}     #+    #   Filesystems with eventual consistency (such as S3), don't need a temporary+    #   location. Using a temporary directory in those cases+    #   might cause situations where files are not available during copy.+    #     # Users only need to interact with the user-specified prefix, which is     # "<train dir>/myckpt" in this case.  Save() and Restore() work with the     # prefix directly, instead of any physical pathname.  (On failure and     # subsequent restore, an outdated and orphaned temporary directory can be     # safely removed.)-    _SHARDED_SUFFIX = "_temp_%s/part" % uuid.uuid4().hex+    _SHARDED_SUFFIX = control_flow_ops.cond(+        string_ops.regex_full_match(checkpoint_prefix, '^s3://.*'),+        lambda: ".part",+        lambda: "_temp_%s/part" % uuid.uuid4().hex)

I'll try arguing with the unit test or its owner. I don't think it should block import of your change.

rahul003

comment created time in 14 hours

Pull request review commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

 def _AddShardedSaveOpsForV2(self, checkpoint_prefix, per_device):     #     <train dir>/     #        myckpt{.index, .data-?????-of-?????}     #+    #   Filesystems with eventual consistency (such as S3), don't need a temporary+    #   location. Using a temporary directory in those cases+    #   might cause situations where files are not available during copy.+    #     # Users only need to interact with the user-specified prefix, which is     # "<train dir>/myckpt" in this case.  Save() and Restore() work with the     # prefix directly, instead of any physical pathname.  (On failure and     # subsequent restore, an outdated and orphaned temporary directory can be     # safely removed.)-    _SHARDED_SUFFIX = "_temp_%s/part" % uuid.uuid4().hex+    _SHARDED_SUFFIX = control_flow_ops.cond(+        string_ops.regex_full_match(checkpoint_prefix, '^s3://.*'),+        lambda: ".part",+        lambda: "_temp_%s/part" % uuid.uuid4().hex)

One option is smart_cond: https://github.com/tensorflow/tensorflow/blob/7f0b31af816a4f9936b35c02544e383f1b8c090a/tensorflow/python/framework/smart_cond.py#L27

But I'm not sure that'll actually fix most GraphDef changes, since in 1.x the checkpoint prefix is a placeholder when writing a SaverDef, and we actually feed different values at runtime when constructing a new Saver. It's worth trying, but we'll probably need to fix the GraphDef tests even if we use smart_cond.

The other option would be to make something unconditional... my impression was that always changing the temporary directory structure in this way would not work, but maybe there's something we could do unconditionally?

rahul003

comment created time in 15 hours

issue commenttensorflow/tensorflow

Getting SavedModel format from Checkpoint automatically

Frozen GraphDefs don't really fit into SavedModels. Un-frozen GraphDefs reference a training checkpoint, which a SavedModel has.

For non-frozen GraphDefs you can load (into Python) and re-save as a SavedModel along with a training checkpoint. For frozen GraphDefs this won't work, and I don't know of any efforts in that direction.

mankeyboy

comment created time in 2 days

issue commenttensorflow/tensorflow

OP_REQUIRES failed due to missing tempstate or invalid argument

To clarify, "Databricks 4xGPU cluster X8" means there are 8 machines each with 4 GPUs attached? Or one machine with 4 GPUs and "cluster X8" refers to something else?

So presumably DBFS doesn't like moving and/or deleting files, which we do when saving checkpoints. S3 has possibly related issues, so you could try waiting for https://github.com/tensorflow/tensorflow/pull/36388 to get in and applying that "don't use temporaries" logic to DBFS too.

Or as a workaround you can save to a local directory and copy the resulting checkpoint to DBFS if there's only one machine involved.

birdmw

comment created time in 6 days

pull request commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

Maybe we just don't have S3 registered, so it's falling back to the default? In which case maybe we should skip the test if s3 is not registered. Or if that's only a google thing, we could skip that one internally somehow.

rahul003

comment created time in 6 days

issue commenttensorflow/tensorflow

Getting SavedModel format from Checkpoint automatically

If the GraphDef is frozen, it doesn't refer to a training checkpoint anymore. It just has one set of values for the variables hard-coded. Making a SavedModel from an un-frozen GraphDef that has pointers to variables plus a training checkpoint is easy, but making a SavedModel from a frozen GraphDef and a new training checkpoint isn't really feasible.

mankeyboy

comment created time in 8 days

issue commenttensorflow/tensorflow

Getting SavedModel format from Checkpoint automatically

So the GraphDef here is already frozen?

mankeyboy

comment created time in 12 days

issue commenttensorflow/tensorflow

Getting SavedModel format from Checkpoint automatically

What do you mean by "checkpoint graph"? A GraphDef and a training checkpoint?

mankeyboy

comment created time in 13 days

Pull request review commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

 def call(self, x, y):     return {"loss": loss}  -class MemoryTests(test.TestCase):

Yeah that was my suggestion since the test was failing (I wrote the test). tf.cond creates reference cycles, so if we use tf.cond when saving then the test is broken.

The satisfying fix is to make tf.cond not do that. It's not intractable, we've done it for tf.function. But I don't think we need to block this change on that fix.

I think the reference cycles should just be symbolic tensors, so not holding on to device memory waiting for Python's GC. Although it's a slight risk.

rahul003

comment created time in 14 days

Pull request review commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

 bool OpGradientDoesntRequireInputIndices(           {"SparseSegmentSumWithNumSegments", {false, {3}}},           {"SparseSlice", {false, {2, 4}}},           {"SparseSoftmax", {false, {1}}},-          {"SparseSoftmaxCrossEntropyWithLogits", {true, {}}},+          {"SparseSoftmaxCrossEntropyWithLogits", {false, {1}}},

It uses op.inputs[0], and isn't this a "drop indices" list? So of [0, 1] we should drop 1 if we're using 0.

MichaelKonobeev

comment created time in 19 days

Pull request review commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

 bool OpGradientDoesntRequireOutputIndices(           {"SparseSegmentSum", {true, {}}},           {"SparseSegmentSumWithNumSegments", {true, {}}},           {"SparseSlice", {false, {1, 2}}},-          {"SparseSoftmaxCrossEntropyWithLogits", {false, {0}}},+          {"SparseSoftmaxCrossEntropyWithLogits", {false, {}}},

Isn't this only using the second output still? Not sure this needs to change

It might be cleaner if you can just re-run the update script: https://github.com/tensorflow/tensorflow/blob/392249966a19d0f1923dc1e1b743ccdfb2887ee6/tensorflow/python/eager/gradient_input_output_exclusions.py#L19

If that runs into issues we can ping @saxenasaurabh for help. Sorry for the major merge conflict.

MichaelKonobeev

comment created time in 19 days

pull request commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

It's because control flow (tf.cond) creates reference cycles and we currently have a check that tf.saved_model.save doesn't create reference cycles. Unfortunate, but the easy thing is just to delete this test: https://github.com/tensorflow/tensorflow/blob/8aa4fe59eb98147609dc690c1244dde315312cd4/tensorflow/python/saved_model/save_test.py#L675

rahul003

comment created time in 20 days

Pull request review commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

 def is_directory_v2(path):     return False  +@tf_export(v1=["gfile.NeedsTempLocation"])+def needs_temp_location(path):+  """ Returns whether or not writing to the given path needs to use+      a temporary location for safety+  Args:+    path: string, path to a file+  Returns:+    True, if the path is on a file system that needs to use a temporary+          location to write safely. In such cases it is recommended to write to+          a temporary location and then do (atomic) move to the final location.+    False, if it is safe to write to the path without a temp location+  """+  return needs_temp_location_v2(path)+++@tf_export("io.gfile.needstemp")+def needs_temp_location_v2(path):+  """ Returns whether or not writing to the given path needs to use+      a temporary location for safety

If you just leave off the tf_export decorator this is still usable inside TensorFlow, it just won't be in the tf.* API. We don't use tf.* in TF's Python implementation or unit tests anyway.

rahul003

comment created time in 21 days

issue commenttensorflow/tensorflow

Incompatible checkpoint being restored on estimator with keras model

No, and AFAIK nobody is working on it. Since it only affects Estimator (in 2.x), maybe @karmel has someone who could take a look. I have a queue of other things that look like higher priorities at the moment.

Most of the work is going to be working to migrate existing code which breaks with new shape checking (or maybe we get lucky and it's fine). As @k-w-w mentioned, the Saver flag change itself is a pretty simple fix to try.

kurka

comment created time in 21 days

issue commenttensorflow/tensorflow

Loading a tf.Module with tf.saved_model.load missing attributes

One option is to add attributes before save. So if you know you'll want a list of variables, you can collect that and assign it to a different attribute. We do save lists, tuples, dicts with string keys, and namedtuples, just not @propertys or methods.

Another option is to use GradientTape to collect trainable variables implicitly (tape.trainable_variables() I think).

hartikainen

comment created time in 22 days

issue commenttensorflow/tensorflow

Loading a tf.Module with tf.saved_model.load missing attributes

We have discussed reviving Modules as Modules instead of a minimal type. For Keras Models @k-w-w has implemented it.

I don't have plans to do this for Module, and I think the Module maintainer @tomhennigan was mildly opposed. But if someone felt strongly it's probably something they could get done. The implementation itself would be quite simple, basically adding one of these for Module.

hartikainen

comment created time in 23 days

Pull request review commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

 Status S3FileSystem::RenameFile(const string& src, const string& target) {   return Status::OK(); } +Status S3FileSystem::NeedsTempLocation(const string& path) {+  return Status(tensorflow::error::FAILED_PRECONDITION,

Can this return a status and a boolean (as a last-argument pointer)? Looks like it can fail to fetch the status, and asking callers to figure out which statuses to bubble up and which are normal seems like too much.

rahul003

comment created time in a month

Pull request review commenttensorflow/tensorflow

[S3] Skip using temp files/folders when writing to S3

 def is_directory_v2(path):     return False  +@tf_export(v1=["gfile.NeedsTempLocation"])

Not sure adding to the v1 API gets us anything at this point; is there a reason to add it to v1 separately?

rahul003

comment created time in a month

issue commenttensorflow/tensorflow

No example exporting a model from an Estimator on TensorFlow.org

There's a section in the SavedModel guide: https://www.tensorflow.org/guide/saved_model#savedmodels_from_estimators

But maybe we should have an example in the Estimator guide too.

hanmeng31

comment created time in a month

issue commenttensorflow/tensorflow

@tf.custom_gradient does not behave as expected when used with @tf.function (for 2nd order derivatives)

I commented out the two @tf.function lines and I see the same "Input not invertible" error (from MatrixInverse via LogMatrixDeterminantGrad). Adding back the stop_gradient does avoid the error in either case. This is with 2.2.0-dev20200128.

It's totally possible that tf.function changes the gradient graphs for higher-order gradients in some cases. They're still correct (assuming the original registrations are correct), but there are cases where we could/should do extra pruning or avoid building some extra graphs in the first place. I don't see evidence of that issue from this repro, though.

Is the argument that the stop_gradient should be implicit here?

maxiwelian

comment created time in a month

issue commenttensorflow/tensorflow

`files_io.get_matching_files` fails for valid filenames that contain globs

Do you have a fix in mind? I doubt I'll get to this soon, but if you want to make changes I'm happy to discuss/review.

silky

comment created time in a month

issue commenttensorflow/tensorflow

tf.keras.Model overrides 'self' behaviour with lists

@Danmou thanks for the report. I've sent a fix for review; I think we can just be more lazy about re-wrapping the result of operators in ListWrapper.

jpatts

comment created time in 2 months

issue commenttensorflow/tensorflow

ListWrapper does not support insert method for nested lists of layers

Checkpointing doesn't support insert, so it'll throw an exception on save. The Layer is added to the list, though. Keras probably still wants to collect layers through lists in this case. Hopefully someone who works on Keras can iron out that policy. Scott, can you find an owner?

fjean

comment created time in 2 months

issue closedtensorflow/tensorflow

Reading checkpoint file fail will ignore all checkpoints

When restoring a checkpoint, if it read checkpoint file fail then all checkpoints will be ignored and train from scratch. I think it should throw exception rather than ignore, otherwise it will train from scratch silently. Especially read fail could happen when using distributed storage.

image

Link to the code: https://github.com/tensorflow/tensorflow/blob/r2.1/tensorflow/python/training/checkpoint_management.py#L287

closed time in 2 months

zakar

issue commenttensorflow/tensorflow

Reading checkpoint file fail will ignore all checkpoints

I'd just check whether <dir>/checkpoint exists. If it does and that function returns None, you can throw an exception. Otherwise you can use the parsed proto.

Throwing new exceptions here isn't backward compatible.

zakar

comment created time in 2 months

issue commenttensorflow/tensorflow

Incompatible checkpoint being restored on estimator with keras model

Ah, I see. The shapes that come from RestoreV2 are completely undefined (None) so they're compatible with the variable's shape when in a tf.Graph().

Given that we always run RestoreV2 eagerly in 2.x except for Estimator, we should probably just make whichever tweak fixes this for Estimator. Maybe you're right that reshape=True is an easy workaround.

kurka

comment created time in 2 months

issue commenttensorflow/tensorflow

Incompatible checkpoint being restored on estimator with keras model

I don't think we want to reshape. shape_safe_assign_variable_handle does a shape check, so if we passed a restored_shapes argument we'd be doing less checking than we are now (exact shape match vs. only checking the number of elements). I've seen that error come up a bunch when refactoring models...

kurka

comment created time in 2 months

issue commenttensorflow/tensorflow

Incompatible checkpoint being restored on estimator with keras model

RestoreV2 just returns a Tensor with the value of the variable. Then that feeds into an assignment here: https://github.com/tensorflow/tensorflow/blob/55e68264ba670301fb861aa4d28f9642b251c6fc/tensorflow/python/training/saving/saveable_object_util.py#L113

There is a reshape, but in this case there'd be a different number of elements, so even if restored_shapes wasn't None (I think it is) that'd be an error.

kurka

comment created time in 2 months

issue commenttensorflow/tensorflow

Incompatible checkpoint being restored on estimator with keras model

Assignment ops definitely do shape checking, though. Restores into variables with different shapes do fail.

kurka

comment created time in 2 months

issue commenttensorflow/tensorflow

Incompatible checkpoint being restored on estimator with keras model

Just an educated guess, but Keras Layers are numbered globally within the Python process. So if you re-create the Model you'll get totally new variable names each time. Then there's no conflict: it just looks like there are a bunch of new variables and the old ones don't match. Sad but a known incompatibility between Keras and name-based variable matching.

I'm somewhat more confused as to why Estimator isn't complaining or at least warning about variables that aren't in the checkpoint (given that it is restoring a checkpoint). Is that expected behavior? @k-w-w @karmel

kurka

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

 def _constant_if_small(value, shape, dtype, name):   return None  -def _eager_mark_zeros_tensor(tensor):-  if context.executing_eagerly():-    setattr(tensor, "_is_zeros_tensor", True)-  return tensor+def _tag_zeros_tensor(fun):+  """ Tags the result of function by setting _is_zeros_tensor attribute.++  This is useful to compute Hessians of fused ops such as cross_entropy.+  """+  def wrapped(*args, **kwargs):+    tensor = fun(*args, **kwargs)+    tensor._is_zeros_tensor = True+    return tensor+  return wrapped   @tf_export("zeros")+@_tag_zeros_tensor

I think as-is this will hide the tf.zeros docstring/args (and use the decorator's docstring instead). Can you use tf_decorator.make_decorator to make sure it gets copied over? https://github.com/tensorflow/tensorflow/blob/94adea511f3f8fcc64ab89569fce0ddbc0c1c82c/tensorflow/python/util/tf_decorator.py#L67

Example usage: https://github.com/tensorflow/tensorflow/blob/94adea511f3f8fcc64ab89569fce0ddbc0c1c82c/tensorflow/python/eager/function.py#L3184

MichaelKonobeev

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

 def _constant_if_small(value, shape, dtype, name):   return None  +def _eager_mark_zeros_tensor(tensor):+  if context.executing_eagerly():+    setattr(tensor, "_is_zeros_tensor", True)

tensor._is_zeros_tensor = True? Not sure what the setattr is bringing here

MichaelKonobeev

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

 def _constant_if_small(value, shape, dtype, name):   return None  +def _eager_mark_zeros_tensor(tensor):+  if context.executing_eagerly():

Is there any harm to tagging non-EagerTensors? If not, we should just tag unconditionally. Then below we can check the property unconditionally too.

MichaelKonobeev

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

 def testGradient(self):       x = nn_ops.sparse_softmax_cross_entropy_with_logits(           labels=l, logits=f, name="xent")       err = gradient_checker.compute_gradient_error(f, [3, 4], x, [3])++      # Check that no extra computation performed. When only first derivative is+      # requested, second derivative must not be computed. So when there is no+      # second derivative, there is no `BatchMatMul` op in the graph.+      op_names = [+          op.op_def.name for op in sess.graph.get_operations() if op.op_def+      ]+      self.assertNotIn("BatchMatMul", op_names)+      self.assertNotIn("BatchMatMulV2", op_names)+     print("cross entropy gradient err = ", err)     self.assertLess(err, 5e-8)    @test_util.run_deprecated_v1   def testSecondGradient(self):-    images_placeholder = array_ops.placeholder(dtypes.float32, shape=(3, 2))-    labels_placeholder = array_ops.placeholder(dtypes.int32, shape=(3))-    weights = variables.Variable(random_ops.truncated_normal([2], stddev=1.0))-    weights_with_zeros = array_ops.stack([array_ops.zeros([2]), weights],-                                         axis=1)-    logits = math_ops.matmul(images_placeholder, weights_with_zeros)-    cross_entropy = nn_ops.sparse_softmax_cross_entropy_with_logits(-        labels=labels_placeholder, logits=logits)-    loss = math_ops.reduce_mean(cross_entropy)--    # Taking ths second gradient should fail, since it is not-    # yet supported.-    with self.assertRaisesRegexp(LookupError,-                                 "explicitly disabled"):-      _ = gradients_impl.hessians(loss, [weights])+    with self.session() as sess:+      l = constant_op.constant([3, 0, 1], name="l")+      f = constant_op.constant(+          [0.3, 0.4, 0.1, 1.2, 0.1, 1.9, 0.1, 0.7, 0.8, 0.2, 1.3, 1.3],+          shape=[3, 4],+          dtype=dtypes.float64,+          name="f")+      x = nn_ops.sparse_softmax_cross_entropy_with_logits(+          labels=l, logits=f, name="xent")++      gradients = gradients_impl.gradients(x, [f])[0]+      err = gradient_checker.compute_gradient_error(f, [3, 4], gradients,+                                                    [3, 4])++      # Check that second derivative is calculated.+      # (it is equivalent to being `BatchMatMul` op in the graph because of+      # implementation of xentropy grad)+      op_names = [+          op.op_def.name for op in sess.graph.get_operations() if op.op_def+      ]+      if compat.forward_compatible(2019, 4, 25):

I think this is well and truly expired at this point; do we need the condition?

MichaelKonobeev

comment created time in 2 months

issue commenttensorflow/tensorflow

tf.keras.backend.zeros implementation ends up tracking tensors as well in graph mode

Pavithra, looks like you added the tracking. Should it just be made conditional?

ksachdeva

comment created time in 2 months

issue commenttensorflow/tensorflow

Gradient of matmul in while_loop works when run eagerly but not as tf.function

We have not yet fixed higher-order gradients for cond/while under a tape AFAIK. I believe they need a similar fix to the one we applied to tf.function for recording/accepting gradients of side-outputs. CC @saxenasaurabh. tf.gradients likely works.

That said, I'm not sure why this specific error is the one we see first. If we're not going to fix/test these soon we should probably start throwing an exception (we've had an internal bug open for a while).

mjwatkins2

comment created time in 2 months

pull request commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

I think in this case persistent=True is changing the way function gradients are generated, preparing to possibly take higher-order gradients. Whereas with persistent=False it's using the first-order-only path. This case for first-order vs. this case for higher-order.

The difference is just that it goes over the function with tf.gradients multiple times until it doesn't need more side-outputs for persistent tapes. It adds side outputs in the (successful) first iteration, then tries again to take gradients this time with respect to those side outputs too and that fails for some reason. I'm guessing that the side output at issue is grad itself, which of course doesn't have a gradient.

If that's a correct diagnosis, then even without the SavedModel issue we won't be able to use custom_gradients which capture undeclared outputs since that's incompatible with our higher-order function gradient code. There are hand-wavey plans to generate function gradients with tapes which might fix this, but I wouldn't hold your breath.

So basically that's a good find. Maybe we should reconsider whether custom_gradient is useful for library code as things are; it seems like we can't use it.

MichaelKonobeev

comment created time in 3 months

issue commenttensorflow/tensorflow

Forward-mode-by-double-backprop fails on tf.square ops

Even if the original NaN tensor is referenced for a shape or control dependency, the first tf.gradient call can still be pruned out. I expect that would actually be fine, although inspecting the optimized graph would be the way to tell for sure.

Having the NaN show up in the computation itself does sound bad. If you can isolate that I'd be interested. The MWE does seem to work fine.

cooijmanstim

comment created time in 3 months

issue commenttensorflow/tensorflow

Forward-mode-by-double-backprop fails on tf.square ops

That is a bit of an odd difference between x**2 and tf.square(x) I agree. Likely because of the control dependency here: https://github.com/tensorflow/tensorflow/blob/d1265d5e7897898b30b53f9337d0fc7c64d87258/tensorflow/python/ops/math_grad.py#L583

But like I say, even if we changed that gradient there will be others that run operations on the gradients to get their size and such. I don't think placeholders will work, but using NaNs should be pretty robust.

cooijmanstim

comment created time in 3 months

issue commenttensorflow/tensorflow

Forward-mode-by-double-backprop fails on tf.square ops

This is expected if a bit ugly. Gradient functions can use tf.shape and similar operations on the will-be-pruned placeholders, and Session can't really tell in advance that the value itself isn't necessary since there's enough shape information.

You can fill the tensor with NaNs to make sure its value isn't used:

def fwd_gradients(ys, xs, d_xs):
  """Forward-mode pushforward analogous to the pullback defined by tf.gradients.
  With tf.gradients, grad_ys is the vector being pulled back, and here d_xs is
  the vector being pushed forward."""
  v = tf.fill(value=float('NaN'), dims=tf.shape(ys))
  g = tf.gradients(ys, xs, grad_ys=v)
  return tf.gradients(g, v, grad_ys=d_xs)

That works for me. But please do comment/reopen if you still think something's wrong.

We do have a new 2.x API for forward-mode autodiff:

import numpy as np
import numpy.random as npr
import tensorflow as tf

A = tf.constant(npr.randn(5, 3), dtype=tf.float32)

def jvp(x, u):
  with tf.autodiff.ForwardAccumulator(x, u) as acc:
    y = tf.square(tf.matmul(x, A))
  return acc.jvp(y)

x_val = tf.constant(npr.randn(1, 5), dtype=tf.float32)
u_val = tf.constant(npr.randn(1, 5), dtype=tf.float32)
print(jvp(x_val, u_val))

It's implemented with the same double-gradient trick, but runs in a tf.function so the pruning works when executed eagerly too. And of course you can wrap jvp in @tf.function if you don't want to execute it eagerly.

cooijmanstim

comment created time in 3 months

issue closedtensorflow/tensorflow

Forward-mode-by-double-backprop fails on tf.square ops

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): colab
  • TensorFlow version (use command below): 1.15
  • Python version: 3.x

The double-backprop trick (https://j-towns.github.io/2017/06/12/A-new-trick.html) for computing jacobian-vector products (JVPs) fails when tf.square ops are in the graph. Recall the double-backprop trick constructs an initial throwaway backward graph, which linearly depends on some dummy variables, and then backpropagates through this throwaway graph with respect to the dummy variables. The result should be constant wrt the dummy variables, and the throwaway graph should be disconnected from the final result.

Below I've taken the example JVP code from https://github.com/renmengye/tensorflow-forward-ad/issues/2#issue-234418055 and changed tf.tanh to tf.square to illustrate the failure. The code crashes with an InvalidArgumentError because the dummy placeholder created in fwd_gradients is not fed a value. The true underlying issue is that the dummy should have disappeared in the second call to tf.gradients inside fwd_gradients (because g is linear in v). Presumably the cause is that the backward op for tf.square somehow depends nonlinearly on the dummy variables.

%tensorflow_version 1.x
import numpy as np
import numpy.random as npr
import tensorflow as tf

def fwd_gradients(ys, xs, d_xs):
  """Forward-mode pushforward analogous to the pullback defined by tf.gradients.
  With tf.gradients, grad_ys is the vector being pulled back, and here d_xs is
  the vector being pushed forward."""
  v = tf.placeholder(ys.dtype, shape=ys.get_shape(), name="dummy")
  g = tf.gradients(ys, xs, grad_ys=v)
  return tf.gradients(g, v, grad_ys=d_xs)

A = tf.constant(npr.randn(5, 3), dtype=tf.float32)
x = tf.placeholder(tf.float32, [1, 5])
y = tf.square(tf.matmul(x, A))
u = tf.placeholder(tf.float32, [1, 5])

jvp = fwd_gradients(y, x, u)

x_val = npr.randn(1, 5)
u_val = npr.randn(1, 5)

init_op = tf.initialize_all_variables()
with tf.Session() as sess:
  sess.run(init_op)
  print(sess.run(jvp, feed_dict={x: x_val, u: u_val}))

closed time in 3 months

cooijmanstim

push eventallenlavoie/community

Allen Lavoie

commit sha d5fb5c417b19662cba9d00c8c64ccc3ae0ef1b77

Minor edits

view details

push time in 3 months

push eventallenlavoie/community

Allen Lavoie

commit sha 22a37c8c9435a8018dbe255965aa63a0cf014e98

Small tweaks

view details

push time in 3 months

push eventallenlavoie/community

Allen Lavoie

commit sha dc95693313d05d6467dd8d9d8126055edeed995d

Vectorized captures RFC

view details

push time in 3 months

push eventallenlavoie/community

push time in 3 months

push eventallenlavoie/community

Rohan Jain

commit sha 9d5c52f7bb742bdb987cbbd6b94a781993f3755e

Adding a doc to deprecate collections

view details

Martin Wicke

commit sha f8dad65f3e536707add510a939fae2b47a1bc6e0

Add contrib deprecation RFC

view details

Martin Wicke

commit sha 879d13fcfe536371fdcbcc76ad78405ea731f3fa

move contrib/integrate to tensorflow/scientific

view details

Martin Wicke

commit sha 834d6c2e7f03774c874f4ec3239da28cacf2f858

Added ignite and igfs

view details

Martin Wicke

commit sha d615e80262bd1fa49edc3c53029adeeaa9137f5d

Add @terrytangyuan to timeseries

view details

Martin Wicke

commit sha 5c805774c8b45a3a0ce1337ada4a3f1fc3307983

Update 20180907-contrib-sunset.md

view details

Rohan Jain

commit sha 7d23ce4260ea90b9cacf2fd210d88ce69d75adfe

Responding to Karmels comments

view details

Rohan Jain

commit sha 93601a4906a623c0f772030a9158f5da7d26250c

Minor fix to VariableTracker sample code

view details

guptapriya

commit sha 33cda6348de6a2cc9998df022adfe91f49017442

RFC: Distribution Strategy - Revised API (#25) * Initial version of replicator design * formatting tweaks * Update code snippets * Test updating PR * Minor language change * Revised API proposal Incorporated Replicator API into Distribution Strategy API, and re-organized the latter. * Added method to compute per replica batch size * Improve docs for non-deterministic calls to `all_sum`, etc. * Rename a few things * Fix formatting * Decisions on open questions Updating the design doc with decisions on open questions that were discussed in the in-person design review. * Update date * finalize all_reduce methods * Mark as accepted

view details

Asim Shankar

commit sha feda98330928973afb81297b81fabf9c7669053d

RFC: Functions, not sessions in 2.0 (#20) * RFC: Functions not Sessions in TensorFlow 2.0 * Formatting tweaks * Formatting tweaks * Fix some links * Fix the "Status" column * Incorporate some suggestions. * Shorten the autograph example * Correct autograph link. * Additional details on Trace Caches and Input Signatures. * s/tf.defun/tf.function/ * Collapse tf.method and tf.function. (Prototyped in https://github.com/tensorflow/tensorflow/commit/84ace0358526bb51c04a3bef4b3072b93b9d1bec) * Fix typo * Formatting tweak * Fix typo * Update note on tracing twice.

view details

François Chollet

commit sha a72b0f84ca695b922b52a6cb25e91ed0c3ff6a57

Add RFC for optimizer unification (#24) Thank you to everyone who commented, and on the Reddit thread. We have heard the comments and various team members are reviewing them. Please note that this PR isn't the only avenue for communication: you can discuss the general direction of TensorFlow on the discuss@tensorflow.org forum - https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss

view details

Amit Patankar

commit sha d1ebe1306ec69d992c14736bc5a0eae505bb9565

RFC: TensorFlow Integration Testing (#30) Merging accepted RFC.

view details

Edd Wilder-James

commit sha 66d2eb83814336521ee55943c44c66955709b66a

Update SIG permissions for TensorBoard cc @manivaradarajan

view details

Mani Varadarajan

commit sha 5cb345073aa4bb415973bbc8415bfedc7bf2fcde

Update Charter for SIG-TensorBoard (#35) * Update Charter for SIG-TensorBoard * Update project lead to maniv at google

view details

Edd Wilder-James

commit sha 1abccf752a125226395c9174234b8e4ced3ce2cd

Create SIG Networking, Addons, IO Charters (#33) * Create SIG Networking Charter * Update CHARTER.md Some typo tidies * Create SIG Addons charter * Add SIG addons, networking leaders into codeowners * New SIG IO charter. Fixed Addons charter location * Add sig IO codeowners

view details

Peng Wang

commit sha 161a2726cff4bcd9934da782ac86d58921c1b4cd

RFC for random numbers in TensorFlow 2.0

view details

Yong Tang

commit sha e2099f74cd55d5ec0a38f69d814364f53bd60cde

Add SIG IO release process and release team (#39) This fix update the SIG IO's release process so that anyone could help the release for SIG IO. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

view details

Martin Wicke

commit sha 6619a9915980425e35143957f5bbb8c264dae0e4

Merge pull request #18 from tensorflow/rfc-contrib RFC: Sunset tf.contrib

view details

Yuan (Terry) Tang

commit sha 1e9014138fcc5f10475c61b6c56c70549c696a50

Added terrytangyuan to SIG IO release team (#40)

view details

Peng Wang

commit sha 03b32bd70a920ad3344cf88c309ad5ce0b18151f

Changes after some feedback

view details

push time in 3 months

issue commenttensorflow/tensorflow

Expand checkpoint + SavedModel file formats to include provenance (origin) meta data

I don't think we have access to most of this from tf.saved_model.save without special-casing Keras there, which is something we don't want to do. But some of this sounds reasonable for Model.save.

CC @k-w-w ; any interest/guidance?

jxtps

comment created time in 3 months

issue commenttensorflow/tensorflow

[TF2.0] GradientTape.gradient raise SystemError when calling embedding_column layer multiple times with tf.function

@jaingaurav could you triage? I'm told the core team owns GradientTape.

GoSz

comment created time in 3 months

issue commenttensorflow/tensorflow

[TF2.0] GradientTape.gradient raise SystemError when calling embedding_column layer multiple times with tf.function

@alextp does GradientTape have an owner other than you? I've looked at most of the code but am probably not going to be giving random issues the attention they deserve...

GoSz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Add device annotations to gradient_function

 PyObject* RecordGradient(PyObject* op_name, PyObject* inputs, PyObject* attrs,    string c_op_name = TFE_GetPythonString(op_name); +  PyObject* device = DeviceFromTensorSeq(results);+  if (device == Py_None) {+    device = DeviceFromTensorSeq(inputs);

Maybe I'm missing something, we have a reference to "device" here. It's incremented when we create a backward function and decremented when that function is deleted. But do we ever get rid of that base reference?

olesalscheider

comment created time in 4 months

pull request commenttensorflow/tensorflow

Add device annotations to gradient_function

You're right, that looks like a concerning change in the fused batch norm gradient test. FWIW that assertion is not testing the forward gradient code at all, it's just comparing the backward gradient to the numeric equivalent. (FWIW https://github.com/tensorflow/tensorflow/pull/33854#issuecomment-550412171 was me, I accidentally impersonated Alex)

If the symbolic gradient is the same either way and the numeric gradient is different, then it sounds like fused batch norm is computing something different on GPU than it is on CPU. So maybe the gradient is correct but the forward pass is incorrect. You can file a bug and skip that test on GPU to unblock this PR...

olesalscheider

comment created time in 4 months

issue commenttensorflow/tensorflow

_num_elements(grad) difference between Tensor and IndexedSlices behavior

That looks like it's only used here: https://github.com/tensorflow/tensorflow/blob/0250bfdb8c8f0dce3e11c91f5969c994660d21d9/tensorflow/c/eager/tape.h#L785

So safe to return 0; it's just used to estimate when to aggregate gradients, so a performance issue rather than a correctness issue. (CC @alextp )

Want to send a PR with that change? Happy to review. Ideally we'd know when this case is triggered and have a unit test.

alexeyr

comment created time in 4 months

issue commenttensorflow/tensorflow

Better (simpler!) model checkpointing

Probably the optimizer's slot variable creation has moved to where the primary variable is created rather than happening during apply_gradients. That shouldn't generally affect placement, but looks like it's breaking something in your code.

Anyway, it's definitely not this issue. Feel free to open another one.

netw0rkf10w

comment created time in 4 months

issue commenttensorflow/tensorflow

GradientTape: Allow to execute backward functions on same device as forward functions

So basically the colocate_gradients_with_ops option to tf.gradients in 1.x. Yeah, seems fine. Device labels leaking into SavedModels might be a consequence, but mostly those will save inference functions.

If you just don't have a scope around the tape.gradient call, isn't this the behavior you'd get? I'd hope the placer would catch on and start placing the operations on the devices their inputs come from (although if backprop started on the wrong device I could see it continuing with bad greedy placements executing eagerly).

olesalscheider

comment created time in 4 months

issue closedtensorflow/tensorflow

Better (simpler!) model checkpointing

System information

  • TensorFlow version (you are using): 2.0.0
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state. The current tf.train.Checkpoint is really a bit too complicated and not easy to use. Suppose that I want to save the best model (that has the best validation accuracy) during my custom training loop, and that I want to save only the best model because I don't want to waste disk space. My common sense tells me that this is so simple, just like in all other libraries:

if best_acc < epoch_acc:
    best_acc = epoch_acc
    checkpoint.save(os.path.join(checkpoint_dir, 'best'))

Results after 3 epochs:

best-1.index
best-1.data-00001-of-00002
best-1.data-00000-of-00002
best-2.index
best-2.data-00001-of-00002
best-2.data-00000-of-00002
best-3.index
best-3.data-00001-of-00002
best-3.data-00000-of-00002

Oops... My disk does not like this! And there's no option to turn off the counter and choose the exact file names that I want!

Same problem for saving only the last checkpoint. You would tell me that one can use tf.train.CheckpointManager for this task (by setting max_to_keep = 1), but, a checkpoint object and then a checkpoint manager object to manage checkpoints, seriously? Yet another additional layer of complexity! (Not to say that CheckpointManager cannot deal with the best model saving described above).

I think one of the major limitations of TensorFlow was complexity, really. In TensorFlow 2 this has been significantly reduced, but I believe we can do more.

It would be best to give the users the simplest option to save(filepath) and restore(filepath) in a single file (instead of *.index plus a bunch of data files), with whatever filepath they want. Just like in any other libraries such as PyTorch or Keras! (Keras's usage cannot be applied here because the model may not be compiled in a custom training loop).

Finally: my apologies if such a solution already exists and I am not aware of that because of my lack of knowledge! (I have wrestled with tf.train.Checkpoint over the last days, and have checked the official guide and tutorials but could not find anything like that.)

I am just a TF beginner but will try to contribute, if you are open to changes. Thanks.

Will this change the current api? How? I guess. I don't know how exactly, though.

Who will benefit with this feature? Every user who wants to manually save checkpoints.

closed time in 4 months

netw0rkf10w

issue commenttensorflow/tensorflow

Better (simpler!) model checkpointing

You can use tf.train.Checkpoint.write if you want a no-frills API that just saves to a given prefix.

netw0rkf10w

comment created time in 4 months

issue commenttensorflow/serving

TFServing Compatibility issue with the TF2.0 model using feature columns

Please assign @rohan100jain; apparently I have no access.

brightcoder01

comment created time in 4 months

issue commenttensorflow/serving

TFServing Compatibility issue with the TF2.0 model using feature columns

@rohan100jain could you look into this? Looks like a change to lookup table initialization has broken Session-based loading, possibly https://github.com/tensorflow/tensorflow/commit/64586f18724f737393071125a91b19adf013cf8a. Presumably related to loading multiple tables. If so we should add an integration test with multiple tables in tensorflow/python/saved_model/load_v1_in_v2_test.py

brightcoder01

comment created time in 4 months

pull request commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

Yeah the primals= argument has been removed, I don't see a way to support that API (and it was never in a stable release). I've updated the example in the custom_gradient docstring with your suggested fix.

The easiest way to ignore these custom_gradients in SavedModels is probably to set an attribute. I'd make an internal-only version with the extra argument, then add an attribute to this node. Then you can delete the _gradient_op_type attribute here.

MichaelKonobeev

comment created time in 4 months

PR closed tensorflow/tensorflow

Reviewers
Add custom_object argument for keras model cla: yes size:XS

Fix load_keras_model() cannot load custom classes and functions.

+4 -2

1 comment

1 changed file

alvinlin-pn

pr closed time in 4 months

pull request commenttensorflow/tensorflow

Add custom_object argument for keras model

I don't think we can accept this PR; 1.13 has already been released, and we're only doing patch releases for it. Contrib is no longer packaged in newer releases.

Please try the new SavedModel format support in tf.keras.Model.save.

alvinlin-pn

comment created time in 4 months

pull request commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

I don't think the SavedModel issue needs to block this. We will need a workaround, possibly some indication that this custom_gradient is safe to ignore (meaning we'll still only be able to take first-order gradients from SavedModels).

MichaelKonobeev

comment created time in 4 months

pull request commenttensorflow/tensorflow

Implement Hessian for sparse softmax cross entropy

You're right, it needs to be outside the inner custom_gradient. And I need to fix that for primals= which will be somewhat tricky.

MichaelKonobeev

comment created time in 4 months

Pull request review commenttensorflow/tensorflow

fix .so file name error on linux and macos

 private static boolean resourceExists(String baseName) {    private static String getVersionedLibraryName(String libFilename) {     // If the resource exists as an unversioned file, return that.-    if (resourceExists(libFilename)) {-      return libFilename;-    }+    // if (resourceExists(libFilename)) {

Thanks for the explanation. Since we'd fall back to the unversioned lib anyway I think we can just delete this snippet rather than commenting out.

18867664853

comment created time in 5 months

Pull request review commenttensorflow/tensorflow

fix .so file name error on linux and macos

 private static boolean resourceExists(String baseName) {    private static String getVersionedLibraryName(String libFilename) {     // If the resource exists as an unversioned file, return that.-    if (resourceExists(libFilename)) {-      return libFilename;-    }+    // if (resourceExists(libFilename)) {

I'm confused then. If the unversioned shared object exists, why isn't it used? This logic did work before we added version numbers to shared objects.

18867664853

comment created time in 5 months

Pull request review commenttensorflow/tensorflow

fix .so file name error on linux and macos

 private static String getVersionedLibraryName(String libFilename) {    * determined.    */   private static String getMajorVersionNumber() {+    // getImplementationVersion() retrun null. 

Can return null? Or always returns null right now?

18867664853

comment created time in 5 months

Pull request review commenttensorflow/tensorflow

fix .so file name error on linux and macos

 private static boolean resourceExists(String baseName) {    private static String getVersionedLibraryName(String libFilename) {     // If the resource exists as an unversioned file, return that.-    if (resourceExists(libFilename)) {-      return libFilename;-    }+    // if (resourceExists(libFilename)) {

Comment-out left over from debugging?

18867664853

comment created time in 5 months

issue commenttensorflow/tensorflow

[TF 2.0] GRU layer doesn't work when called from tf.function

Saurabh, looks WhileGrad / TensorList related?

David-Mao

comment created time in 5 months

more