profile
viewpoint

tensorflow/estimator 154

TensorFlow Estimator

rchao/data_challenge 0

Data Challenge

rchao/keras 0

Deep Learning for humans

rchao/tracing-framework 0

Web Tracing Framework libraries and extensions.

issue commenttensorflow/tensorflow

Incompatible shapes when using tf.keras.backend.ctc_decode

@faustomorales, would you be open to send out a PR with your modification that fixes the problem? Thanks in advance.

faustomorales

comment created time in a month

issue commenttensorflow/tensorflow

Impossible to use tf.keras.callbacks.ModelCheckpoint in distributed training

Closing for lack of activity. Please reopen if the issue still exists. Thanks!

netw0rkf10w

comment created time in a month

issue closedtensorflow/tensorflow

Impossible to use tf.keras.callbacks.ModelCheckpoint in distributed training

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
  • TensorFlow version (use command below): 2.0.0

Describe the current behavior It is not possible to use tf.keras.callbacks.ModelCheckpoint in distributed training: RuntimeError:add_updatewas called in a cross-replica context. This is not expected. If you require this feature, please file an issue.

Code to reproduce the issue See this Colab notebook.

closed time in a month

netw0rkf10w

issue commenttensorflow/tensorflow

Memory leak

Thanks @MProx for the updates!

MProx

comment created time in a month

issue commenttensorflow/tensorflow

Memory leak

@MProx, about a month ago we added a fix for memory leak and there's a possibility it has fixed this. Can you try !pip install tf-nightly and see if it resolves your issue?

MProx

comment created time in a month

Pull request review commenttensorflow/tensorflow

patch-callbacks_test

 def test_EarlyStopping_reuse(self):           optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])       weights = model.get_weights() -      stopper = keras.callbacks.EarlyStopping(monitor='acc', patience=patience)

Why removing this block of assertion?

autoih

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

patch-callbacks_test

 def ref_hard_sigmoid(x):     x = keras.backend.placeholder(ndim=2)     f = keras.backend.function([x], [keras.activations.hard_sigmoid(x)])     test_values = np.random.random((2, 5))-    result = f([test_values])[0]     expected = hard_sigmoid(test_values)+    result = f([test_values])[0]

Can you provide the rationale for this change?

autoih

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

added Layer.add_callback

 def add_metric(self, value, aggregation=None, name=None):       # Insert layers into the Keras Graph Network.       self._graph_network_add_metric(value, aggregation, name) +  @property+  def callbacks(self):+    return self._callbacks + self._gather_children_attribute('callbacks')

_gather_children_attribute has been recently removed. Can you edit the PR to make it applicable? Thanks.

beasteers

comment created time in 2 months

issue commenttensorflow/tensorflow

MultiWorkerMirroredStrategy does not work with Keras + accuracy metric

As I looked into it I have not been able to repro using code attached (the only difference is I've set the TF_CONFIG on the two workers). That said we can add a check before deleting the attr.

vmarkovtsev

comment created time in 2 months

issue commenttensorflow/tensorflow

MultiWorkerMirroredStrategy does not work with Keras + accuracy metric

@rmothukuru, thanks for the report and apologies for the delay. I'm looking into this and will get back as soon as I find something. I was wondering how you set TF_CONFIG- is that before launching of this python program?

vmarkovtsev

comment created time in 2 months

issue commenttensorflow/tensorflow

Crash with MultiWorkerMirroredStrategy Keras Example from docs

Thanks for the report - we'll update the tutorial to mention that steps_per_epoch needs to be provided to ensure successful training.

leonardoaraujosantos

comment created time in 2 months

pull request commenttensorflow/estimator

Fixes the bug that the first export was ignored by BestExporter

@rthadur, can you help us merge in this PR?

mjlaali

comment created time in 2 months

issue commenttensorflow/tensorflow

Memory leak on TF 2.0 with model.predict or/and model.fit with keras

Thanks for reporting @gdudziuk. Would you mind opening a new issue describing what have been fixed and what have not?

ipsec

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Use _get_distribution_strategy only when it is available.

 def set_model(self, model):     """Sets Keras model and writes graph if specified."""     self.model = model -    # TensorBoard callback involves writing a summary file in a-    # possibly distributed settings.-    self._log_write_dir = distributed_file_utils.write_dirpath(-        self.log_dir, self.model._get_distribution_strategy())  # pylint: disable=protected-access+    # In case this callback is used via Keras, _get_distribution_strategy does not exist.

I would disambiguate it by saying keras-team Keras or native Keras

hgaiser

comment created time in 3 months

issue closedtensorflow/tensorflow

Memory leak on TF 2.0 with model.predict or/and model.fit with keras

System information

  • OS Platform:

System Version: macOS 10.14.6 (18G103) Kernel Version: Darwin 18.7.0

  • TensorFlow installed from binary using pip install tensorflow

  • Python version:

python -V                                                                                                                                                                                                      
Python 3.7.3
  • GPU model and memory: No GPU

  • TensorFlow version

python -c "import tensorflow as tf; print(tf.version.VERSION)"                                                                                                                                               
2.0.0

Describe the current behavior While running using tensorflow 1.14 or theano backends this code works fine. After upgraded to tensorflow 2.0.0 it stops working and memory usage increasing without finish the program.

Describe the expected behavior Using theano I get 28 seconds by iteration. Using tensorflow 2.0.0 I expect same behavior (or better).

Code to reproduce the issue

import gym
import numpy as np
import matplotlib.pylab as plt

import tensorflow as tf
from tensorflow.keras import layers

env = gym.make('NChain-v0')


def q_learning_keras(env, num_episodes=1000):
    # create the keras model
    model = tf.keras.Sequential()
    model.add(layers.InputLayer(batch_input_shape=(1, 5)))
    model.add(layers.Dense(10, activation='sigmoid'))
    model.add(layers.Dense(2, activation='linear'))
    model.compile(loss='mse', optimizer='adam', metrics=['mae'])
    # now execute the q learning
    y = 0.95
    eps = 0.5
    decay_factor = 0.999
    r_avg_list = []
    for i in range(num_episodes):
        s = env.reset()
        eps *= decay_factor
        if i % 100 == 0:
            print("Episode {} of {}".format(i + 1, num_episodes))
        done = False
        r_sum = 0
        while not done:
            if np.random.random() < eps:
                a = np.random.randint(0, 2)
            else:
                a = np.argmax(model.predict(np.identity(5)[s:s + 1]))
            new_s, r, done, _ = env.step(a)
            target = r + y * np.max(model.predict(np.identity(5)[new_s:new_s + 1]))
            target_vec = model.predict(np.identity(5)[s:s + 1])[0]
            target_vec[a] = target
            model.fit(np.identity(5)[s:s + 1], target_vec.reshape(-1, 2), epochs=1, verbose=0)
            s = new_s
            r_sum += r
        r_avg_list.append(r_sum / 1000)
    plt.plot(r_avg_list)
    plt.ylabel('Average reward per game')
    plt.xlabel('Number of games')
    plt.show()
    for i in range(5):
        print("State {} - action {}".format(i, model.predict(np.identity(5)[i:i + 1])))


if __name__ == "__main__":
    q_learning_keras(env)

closed time in 3 months

ipsec

issue commenttensorflow/tensorflow

Memory leak on TF 2.0 with model.predict or/and model.fit with keras

Closing the issue as it's confirmed fixed. Thanks!

ipsec

comment created time in 3 months

issue commenttensorflow/tensorflow

Impossible to use tf.keras.callbacks.ModelCheckpoint in distributed training

Hello @netw0rkf10w, sorry for the late response. I made a copy of your updated colab and use nightly instead:

%tensorflow_version 2.x
!pip uninstall -y -q tensorflow
!pip install -U -q tf-nightly

and it could work fine without the problem. Can you take a look if that fixes your issue? The TF version I verified to work at is 2.1.0-dev20191202.

netw0rkf10w

comment created time in 3 months

issue commenttensorflow/tensorflow

Memory leak on TF 2.0 with model.predict or/and model.fit with keras

Thanks @birdmw! Can you share with us the version of TF you're using?

ipsec

comment created time in 3 months

issue commenttensorflow/tensorflow

tensorflow.python.keras.testing_utils.layer_test breaks when a (custom) layer is returning a list/tuple of tensor

Chatted with @shun-lin and Shun is happy to contribute. Thanks Shun!

shun-lin

comment created time in 3 months

issue commenttensorflow/tensorflow

Memory leak on TF 2.0 with model.predict or/and model.fit with keras

Thanks @SivamPillai for the update. We'll continue to investigate into this.

ipsec

comment created time in 3 months

issue commenttensorflow/tensorflow

Memory leak on TF 2.0 with model.predict or/and model.fit with keras

Thanks for the updates! I verified locally that the memory issue has been mitigated with TF nightly. Can you try !pip install tf-nightly and see if it resolved your issue?

ipsec

comment created time in 3 months

issue commenttensorflow/tensorflow

Impossible to use tf.keras.callbacks.ModelCheckpoint in distributed training

Hello @netw0rkf10w, sorry about the confusion. I was trying to repro locally with your example, but unfortunately there was some issue downloading the dataset. When I tried some other dataset I have the program could finish just fine. Would you mind providing a minimal repro with some dummy data so I can quickly repro? Thank you.

netw0rkf10w

comment created time in 3 months

issue commenttensorflow/tensorflow

Memory leak on TF 2.0 with model.predict or/and model.fit with keras

Thanks for checking in! We're still verifying the fix solves the issue and should have updates soon.

ipsec

comment created time in 3 months

issue commenttensorflow/tensorflow

add_update in cross-replica mode is broken (BatchNormalization layer impossible to use)

Recently we have submitted a few fixes related to using batch norm layer with tf.distribute strategies. If you could try with the nightly and let us know if it's fixed it'd be great.

galeone

comment created time in 3 months

pull request commenttensorflow/tensorflow

JSON Serializable checks for array and structs

@rthadur Can you check what else needs to be done for this to be merged? Thanks.

ashahab

comment created time in 4 months

Pull request review commenttensorflow/estimator

Fixes the bug that the first export was ignored by BestExporter

 def export(self, estimator, export_path, checkpoint_path, eval_result,       self._best_eval_result = self._get_best_eval_result(           full_event_file_pattern) -    if self._best_eval_result is None or self._compare_fn(-        best_eval_result=self._best_eval_result,-        current_eval_result=eval_result):+    if (self._best_eval_result is None or+        # check if this is the first export. Note that+        # evaluation occurs before exporting models and hence the loaded+        # _best_eval_result can be the same even when no export had been done.+        (not self._has_exported and self._best_eval_result == eval_result) or

Thanks, in that case should we make the condition "not self._has_exported and eval_result better than or equal to self._best_eval_result" as opposed to ==?

mjlaali

comment created time in 4 months

Pull request review commenttensorflow/estimator

Fixes the bug that the first export was ignored by BestExporter

 def export(self, estimator, export_path, checkpoint_path, eval_result,       self._best_eval_result = self._get_best_eval_result(           full_event_file_pattern) -    if self._best_eval_result is None or self._compare_fn(-        best_eval_result=self._best_eval_result,-        current_eval_result=eval_result):+    if (self._best_eval_result is None or+        # check if this is the first export. Note that+        # evaluation occurs before exporting models and hence the loaded+        # _best_eval_result can be the same even when no export had been done.+        (not self._has_exported and self._best_eval_result == eval_result) or

Thanks for the PR! Is there a reason why we need self._best_eval_result == eval_result? I would think that if it's not exported yet, i.e. not self._has_exported, is a sufficient condition.

mjlaali

comment created time in 4 months

Pull request review commenttensorflow/tensorflow

added min_epochs to EarlyStopping

 def on_epoch_end(self, epoch, logs=None):     current = self.get_monitor_value(logs)     if current is None:       return-    if self.monitor_op(current - self.min_delta, self.best):-      self.best = current-      self.wait = 0-      if self.restore_best_weights:-        self.best_weights = self.model.get_weights()-    else:-      self.wait += 1-      if self.wait >= self.patience:-        self.stopped_epoch = epoch-        self.model.stop_training = True+    if epoch > self.min_epochs:

should this be epoch == self.min_epochs?

joglekara

comment created time in 4 months

pull request commenttensorflow/tensorflow

added sprite options to Keras Tensorboard Callback

@omalleyt12 adding you as you may have some context about TensorBoard callback, thanks

beasteers

comment created time in 4 months

issue commenttensorflow/tensorflow

Problems running mnist_estimator in distributed mode

Mind sharing what you observed?

shahriar49

comment created time in 4 months

issue commenttensorflow/tensorflow

Problems running mnist_estimator in distributed mode

Hello @shahriar49, if I understand correctly here we're using model_to_estimator without a tf.distribute strategy. With classic estimator training, I am not aware of any data splitting among the workers. The same copy of code is run on chief and workers, and if the training input_fn is configured to raise OutOfRangeError after one epoch, the total epochs trained will be the number of all workers.

You can find more information here: https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate

shahriar49

comment created time in 4 months

Pull request review commenttensorflow/tensorflow

JSON Serializable checks for array and structs

 # Copyright 2016 The TensorFlow Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License");-# you may not use this file except in compliance with the License.+# you may not usve this file except in compliance with the License.

Please fix the typo.

ashahab

comment created time in 5 months

issue commenttensorflow/tensorflow

[TF2.0] Checkpoint doesn't store non-TF objects

Reassigning to @k-w-w to take a look or triage. Thanks.

awav

comment created time in 5 months

Pull request review commenttensorflow/tensorflow

JSON Serializable checks for array and structs

 def on_epoch_end(self, epoch, logs=None):     send = {}     send['epoch'] = epoch     for k, v in logs.items():-      send[k] = v+      if isinstance(v, (np.ndarray, np.generic)):

Can you add a comment why these two types are special cases?

ashahab

comment created time in 5 months

Pull request review commenttensorflow/tensorflow

JSON Serializable checks for array and structs

 def target():       t.join()       assert not t.is_alive() +  def test_RemoteMonitor_np_array(self):+    if requests is None:+        return++    monitor = keras.callbacks.RemoteMonitor()+    a = np.arange(1)  # a 1 by 1 array

Do those two tests cover both np.ndarray and np.generic cases? Is it possible to assert the send dict to make sure the items are expected?

ashahab

comment created time in 5 months

issue closedtensorflow/tensorflow

Chief-worker should waitting for all other workers.

<em>Please make sure that this is a feature request. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template</em>

System information

  • TensorFlow version (you are using): tensorflow1.12
  • Are you willing to contribute it (Yes/No): No

Describe the feature and the current behavior/state. Use Ps-worker mode. When I use Estimator or MonitoredTrainingSession, I should have a chief-worker. Chief-worker will do some chief-only hooks, like CheckpointSaverHook. But now, I can't guarantee that chief-worker will quit before all other workers. If chief-worker is quitted before all others workers, nobody will do the CheckpointSaverHook.

Will this change the current api? How? I have no elegant idea so far.

Who will benefit with this feature? All.

Any Other info. So how can I fix the problem? Maybe you can offer some good ideas.

Thanks so much :)

closed time in 5 months

shanshanpt

issue commenttensorflow/tensorflow

Chief-worker should waitting for all other workers.

@shanshanpt, thanks for the question. If you're using Estimator, you can specify the hooks to run on workers in the tf.estimator.EstimatorSpec object you construct to return from your model_fn. If you're using tf.compat.v1.train.MonitoredTrainingSession, you can directory pass in the hooks. Hopefully this helps.

shanshanpt

comment created time in 5 months

delete branch rchao/keras

delete branch : rchao-patch-1

delete time in 5 months

PR closed rchao/keras

Add an option to provide `cache_subdir` to `tf.keras.datasets.mnist.l…

…oad_data`.

<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md -->

Summary

Related Issues

PR Overview

  • [ n] This PR requires new unit tests [y/n] (make sure tests are included)
  • [ y] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
  • [ y] This PR is backwards compatible [y/n]
  • [ y] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)
+6 -2

0 comment

1 changed file

rchao

pr closed time in 5 months

PR opened rchao/keras

Add an option to provide `cache_subdir` to `tf.keras.datasets.mnist.l…

…oad_data`.

<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md -->

Summary

Related Issues

PR Overview

  • [ n] This PR requires new unit tests [y/n] (make sure tests are included)
  • [ y] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
  • [ y] This PR is backwards compatible [y/n]
  • [ y] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)
+6 -2

0 comment

1 changed file

pr created time in 5 months

create barnchrchao/keras

branch : rchao-patch-1

created branch time in 5 months

more