profile
viewpoint

tensorflow/benchmarks 764

A benchmark framework for Tensorflow

reedwm/benchmarks 2

Benchmark code

reedwm/models 2

Models and examples built with TensorFlow

tfboyd/tensorflow 2

Computation using data flow graphs for scalable machine learning

chsigg/training 1

Reference implementations of MLPerf benchmarks

jedisith/ovs 0

Open vSwitch

reedwm/caffe2 0

Caffe2 is a lightweight, modular, and scalable deep learning framework.

reedwm/tensorflow 0

An Open Source Machine Learning Framework for Everyone

issue commenttensorflow/tensorflow

Half precision training very slow and returning nan loss

Can you provide a full example, which creates and runs the efficient net model? I'm not sure how to run efficient net

acmilannesta

comment created time in an hour

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 Status AutoMixedPrecisionImpl::Optimize() {   return Status::OK(); } -// Finds data structure object ops (e.g., StackV2) and the sets of nodes that-// write (e.g., StackPushV2) and read (e.g., StackPopV2) from them.-Status AutoMixedPrecisionImpl::AddDataStructureOpsToMap(-    const absl::flat_hash_set<string>& data_structure_ops,-    TypeAttrId data_structure_type_attr,-    const absl::flat_hash_map<string, TypeAttrId>& write_ops,-    const absl::flat_hash_map<string, TypeAttrId>& read_ops,-    DataStructureOpsMap* object_clients_map) const {-  for (const NodeDef& node : graph_->node()) {-    const auto write_iter = write_ops.find(node.op());-    const auto read_iter = read_ops.find(node.op());-    bool is_writer = write_iter != write_ops.end();-    bool is_reader = read_iter != read_ops.end();-    if (is_writer || is_reader) {-      const NodeDef* object_node = GetTailOfChain(node, data_structure_ops);-      if (!object_node) {-        return errors::FailedPrecondition(-            "No data structure op found upstream of ", node.op(), " node ",-            node.name());+// If node is a Tensor List op with a float32 data type attribute then this+// returns a pointer to the NodeTypeId representing that type attribute. In+// all other cases this returns nullptr.+const NodeTypeId* AutoMixedPrecisionImpl::GetTensorListFloat32NodeTypeId(+    const NodeDef& node) const {+  if (!IsTensorListOp(node.op())) return nullptr;+  for (const TypeAttrId& type_attr : node_type_map_.GetTypeAttrs(node)) {+    const NodeTypeId* node_type =+        graph_type_view_.GetNode(node.name(), type_attr);+    if (node_type && node_type->type_attr.fixed_type == DT_INVALID &&+        node_type->type_attr.type_index == TypeAttrId::kSingleType &&+        IsFloat32(*node_type)) {+      return node_type;+    }+  }+  return nullptr;+}++bool AutoMixedPrecisionImpl::IsSourceOrSinkOp(const string& op) const {

Talked to @rmlarsen offline. We don't think there's a great way of checking this. It might be possible to have some heuristics, e.g. an op that takes a DT_VARIANT as input and has no outputs, or vice versa. But I'm not sure if it's worth implementing this.

Having Python tests to catch this, in addition to the C++ tests, would be nice in case we change which op is used on the Python side. But this doesn't need to be done in this PR.

benbarsdell

comment created time in 2 hours

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

Almost, sorry about the delay. Should be ready in a few days.

tgaddair

comment created time in 3 hours

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 Status AutoMixedPrecisionImpl::Optimize() {   return Status::OK(); } -// Finds data structure object ops (e.g., StackV2) and the sets of nodes that-// write (e.g., StackPushV2) and read (e.g., StackPopV2) from them.-Status AutoMixedPrecisionImpl::AddDataStructureOpsToMap(-    const absl::flat_hash_set<string>& data_structure_ops,-    TypeAttrId data_structure_type_attr,-    const absl::flat_hash_map<string, TypeAttrId>& write_ops,-    const absl::flat_hash_map<string, TypeAttrId>& read_ops,-    DataStructureOpsMap* object_clients_map) const {-  for (const NodeDef& node : graph_->node()) {-    const auto write_iter = write_ops.find(node.op());-    const auto read_iter = read_ops.find(node.op());-    bool is_writer = write_iter != write_ops.end();-    bool is_reader = read_iter != read_ops.end();-    if (is_writer || is_reader) {-      const NodeDef* object_node = GetTailOfChain(node, data_structure_ops);-      if (!object_node) {-        return errors::FailedPrecondition(-            "No data structure op found upstream of ", node.op(), " node ",-            node.name());+// If node is a Tensor List op with a float32 data type attribute then this+// returns a pointer to the NodeTypeId representing that type attribute. In+// all other cases this returns nullptr.+const NodeTypeId* AutoMixedPrecisionImpl::GetTensorListFloat32NodeTypeId(+    const NodeDef& node) const {+  if (!IsTensorListOp(node.op())) return nullptr;+  for (const TypeAttrId& type_attr : node_type_map_.GetTypeAttrs(node)) {+    const NodeTypeId* node_type =+        graph_type_view_.GetNode(node.name(), type_attr);+    if (node_type && node_type->type_attr.fixed_type == DT_INVALID &&+        node_type->type_attr.type_index == TypeAttrId::kSingleType &&+        IsFloat32(*node_type)) {+      return node_type;+    }+  }+  return nullptr;+}++bool AutoMixedPrecisionImpl::IsSourceOrSinkOp(const string& op) const {+  const gtl::FlatSet<string> source_and_sink_ops = {+      "_Arg",+      "_Retval",+      "OptionalFromValue",+      "OptionalGetValue",+      "PartitionedCall",+      "Placeholder",+      "StatefulPartitionedCall",+  };+  return source_and_sink_ops.count(op) || function_library_.Find(op);+}++// Finds all clusters of float32 Tensor List nodes that are connected via their+// handle edges. Unsafe clusters (those with edges that cross untraversable

Add a comment stating why this is useful. E.g., "The caller should paint all nodes in a cluster the same color as they may all refer to the same Tensor List".

benbarsdell

comment created time in 19 hours

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 Status AutoMixedPrecisionImpl::Optimize() {   return Status::OK(); } -// Finds data structure object ops (e.g., StackV2) and the sets of nodes that-// write (e.g., StackPushV2) and read (e.g., StackPopV2) from them.-Status AutoMixedPrecisionImpl::AddDataStructureOpsToMap(-    const absl::flat_hash_set<string>& data_structure_ops,-    TypeAttrId data_structure_type_attr,-    const absl::flat_hash_map<string, TypeAttrId>& write_ops,-    const absl::flat_hash_map<string, TypeAttrId>& read_ops,-    DataStructureOpsMap* object_clients_map) const {-  for (const NodeDef& node : graph_->node()) {-    const auto write_iter = write_ops.find(node.op());-    const auto read_iter = read_ops.find(node.op());-    bool is_writer = write_iter != write_ops.end();-    bool is_reader = read_iter != read_ops.end();-    if (is_writer || is_reader) {-      const NodeDef* object_node = GetTailOfChain(node, data_structure_ops);-      if (!object_node) {-        return errors::FailedPrecondition(-            "No data structure op found upstream of ", node.op(), " node ",-            node.name());+// If node is a Tensor List op with a float32 data type attribute then this+// returns a pointer to the NodeTypeId representing that type attribute. In+// all other cases this returns nullptr.+const NodeTypeId* AutoMixedPrecisionImpl::GetTensorListFloat32NodeTypeId(+    const NodeDef& node) const {+  if (!IsTensorListOp(node.op())) return nullptr;+  for (const TypeAttrId& type_attr : node_type_map_.GetTypeAttrs(node)) {+    const NodeTypeId* node_type =+        graph_type_view_.GetNode(node.name(), type_attr);+    if (node_type && node_type->type_attr.fixed_type == DT_INVALID &&+        node_type->type_attr.type_index == TypeAttrId::kSingleType &&+        IsFloat32(*node_type)) {+      return node_type;+    }+  }+  return nullptr;+}++bool AutoMixedPrecisionImpl::IsSourceOrSinkOp(const string& op) const {+  const gtl::FlatSet<string> source_and_sink_ops = {+      "_Arg",+      "_Retval",+      "OptionalFromValue",+      "OptionalGetValue",+      "PartitionedCall",+      "Placeholder",+      "StatefulPartitionedCall",+  };+  return source_and_sink_ops.count(op) || function_library_.Find(op);+}++// Finds all clusters of float32 Tensor List nodes that are connected via their+// handle edges. Unsafe clusters (those with edges that cross untraversable+// boundaries via _Arg, _Ret, PartitionedCall etc. nodes) are added to black_set+// and not returned.+void AutoMixedPrecisionImpl::FindFloat32TensorListOpClustersAndBlacklistUnsafe(+    std::vector<absl::flat_hash_set<const NodeDef*>>* tensor_list_clusters,+    absl::flat_hash_set<int>* black_set) const {+  absl::flat_hash_set<const NodeDef*> tensor_list_prop_set;+  for (int root_idx = 0; root_idx < graph_type_view_.num_nodes(); ++root_idx) {+    const NodeTypeId& root = *graph_type_view_.GetNode(root_idx);+    // First add any non-processable Tensor List nodes to the black set to avoid+    // them getting forced to white at the end of optimization.+    if (!ShouldProcess(*root.node) &&+        GetTensorListFloat32NodeTypeId(*root.node) == &root) {+      black_set->insert(root_idx);+      continue;+    }+    if (!ShouldProcess(*root.node) ||+        root.type_attr.fixed_type != DataType::DT_VARIANT ||+        !GetTensorListFloat32NodeTypeId(*root.node) ||+        tensor_list_prop_set.count(root.node)) {+      continue;+    }+    // Traverse Tensor List handle edges (DT_VARIANT) to find cluster of all+    // connected Tensor List nodes.+    absl::flat_hash_set<const NodeDef*> cluster({root.node});+    bool cluster_is_safe = true;+    DfsTypeTraversal(graph_type_view_, {&root},+                     TypeTraversalDirection::kFollowInputsAndOutputs,+                     DfsTypePredicates::Enter([&](int idx) -> bool {+                       const NodeTypeId& item = *graph_type_view_.GetNode(idx);+                       return !tensor_list_prop_set.count(item.node);+                     }),+                     DfsTypeCallbacks::PreOrder([&](int idx) {+                       const NodeDef* node =+                           graph_type_view_.GetNode(idx)->node;+                       tensor_list_prop_set.insert(node);+                       if (GetTensorListFloat32NodeTypeId(*node)) {+                         cluster.insert(node);+                       } else if (IsSourceOrSinkOp(node->op())) {

This is fairly contrived, but I think if you have a TensorList of TensorList handles, the TensorList itself also acts as a sink and the grappler pass will break.

Not worth addressing IMO, since I cannot think of a simple way of dealing with this case.

benbarsdell

comment created time in 20 hours

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 Status AutoMixedPrecisionImpl::Optimize() {   return Status::OK(); } -// Finds data structure object ops (e.g., StackV2) and the sets of nodes that-// write (e.g., StackPushV2) and read (e.g., StackPopV2) from them.-Status AutoMixedPrecisionImpl::AddDataStructureOpsToMap(-    const absl::flat_hash_set<string>& data_structure_ops,-    TypeAttrId data_structure_type_attr,-    const absl::flat_hash_map<string, TypeAttrId>& write_ops,-    const absl::flat_hash_map<string, TypeAttrId>& read_ops,-    DataStructureOpsMap* object_clients_map) const {-  for (const NodeDef& node : graph_->node()) {-    const auto write_iter = write_ops.find(node.op());-    const auto read_iter = read_ops.find(node.op());-    bool is_writer = write_iter != write_ops.end();-    bool is_reader = read_iter != read_ops.end();-    if (is_writer || is_reader) {-      const NodeDef* object_node = GetTailOfChain(node, data_structure_ops);-      if (!object_node) {-        return errors::FailedPrecondition(-            "No data structure op found upstream of ", node.op(), " node ",-            node.name());+// If node is a Tensor List op with a float32 data type attribute then this+// returns a pointer to the NodeTypeId representing that type attribute. In+// all other cases this returns nullptr.+const NodeTypeId* AutoMixedPrecisionImpl::GetTensorListFloat32NodeTypeId(+    const NodeDef& node) const {+  if (!IsTensorListOp(node.op())) return nullptr;+  for (const TypeAttrId& type_attr : node_type_map_.GetTypeAttrs(node)) {+    const NodeTypeId* node_type =+        graph_type_view_.GetNode(node.name(), type_attr);+    if (node_type && node_type->type_attr.fixed_type == DT_INVALID &&+        node_type->type_attr.type_index == TypeAttrId::kSingleType &&+        IsFloat32(*node_type)) {+      return node_type;+    }+  }+  return nullptr;+}++bool AutoMixedPrecisionImpl::IsSourceOrSinkOp(const string& op) const {+  const gtl::FlatSet<string> source_and_sink_ops = {+      "_Arg",+      "_Retval",+      "OptionalFromValue",+      "OptionalGetValue",+      "PartitionedCall",+      "Placeholder",+      "StatefulPartitionedCall",+  };+  return source_and_sink_ops.count(op) || function_library_.Find(op);+}++// Finds all clusters of float32 Tensor List nodes that are connected via their+// handle edges. Unsafe clusters (those with edges that cross untraversable+// boundaries via _Arg, _Ret, PartitionedCall etc. nodes) are added to black_set+// and not returned.+void AutoMixedPrecisionImpl::FindFloat32TensorListOpClustersAndBlacklistUnsafe(+    std::vector<absl::flat_hash_set<const NodeDef*>>* tensor_list_clusters,+    absl::flat_hash_set<int>* black_set) const {+  absl::flat_hash_set<const NodeDef*> tensor_list_prop_set;+  for (int root_idx = 0; root_idx < graph_type_view_.num_nodes(); ++root_idx) {+    const NodeTypeId& root = *graph_type_view_.GetNode(root_idx);+    // First add any non-processable Tensor List nodes to the black set to avoid+    // them getting forced to white at the end of optimization.+    if (!ShouldProcess(*root.node) &&+        GetTensorListFloat32NodeTypeId(*root.node) == &root) {+      black_set->insert(root_idx);+      continue;+    }+    if (!ShouldProcess(*root.node) ||+        root.type_attr.fixed_type != DataType::DT_VARIANT ||+        !GetTensorListFloat32NodeTypeId(*root.node) ||+        tensor_list_prop_set.count(root.node)) {+      continue;+    }+    // Traverse Tensor List handle edges (DT_VARIANT) to find cluster of all+    // connected Tensor List nodes.+    absl::flat_hash_set<const NodeDef*> cluster({root.node});+    bool cluster_is_safe = true;+    DfsTypeTraversal(graph_type_view_, {&root},+                     TypeTraversalDirection::kFollowInputsAndOutputs,+                     DfsTypePredicates::Enter([&](int idx) -> bool {+                       const NodeTypeId& item = *graph_type_view_.GetNode(idx);+                       return !tensor_list_prop_set.count(item.node);+                     }),+                     DfsTypeCallbacks::PreOrder([&](int idx) {+                       const NodeDef* node =+                           graph_type_view_.GetNode(idx)->node;+                       tensor_list_prop_set.insert(node);+                       if (GetTensorListFloat32NodeTypeId(*node)) {+                         cluster.insert(node);+                       } else if (IsSourceOrSinkOp(node->op())) {+                         // The cluster crosses an untraversable boundary, so+                         // mark as unsafe.+                         cluster_is_safe = false;+                       }+                     }));+    if (cluster_is_safe) {+      tensor_list_clusters->push_back(cluster);+    } else {+      // Paint the entire cluster black if it's unsafe.+      VLOG(1) << "Painting Tensor List cluster of size " << cluster.size()+              << " BLACK because it crosses graph boundaries";+      for (const NodeDef* node : cluster) {+        const NodeTypeId* node_type = GetTensorListFloat32NodeTypeId(*node);+        /*D*/ CHECK(node_type) << "No float32 type attribute found for "

Why do /*D*/ CHECK? Either use a DCHECK or a CHECK. If you use check, make sure to add a // Crash OK comment on the same line as the CHECK. And same in the other places you have /*D*/ CHECK

benbarsdell

comment created time in 21 hours

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 Status AutoMixedPrecisionImpl::Optimize() {   return Status::OK(); } -// Finds data structure object ops (e.g., StackV2) and the sets of nodes that-// write (e.g., StackPushV2) and read (e.g., StackPopV2) from them.-Status AutoMixedPrecisionImpl::AddDataStructureOpsToMap(-    const absl::flat_hash_set<string>& data_structure_ops,-    TypeAttrId data_structure_type_attr,-    const absl::flat_hash_map<string, TypeAttrId>& write_ops,-    const absl::flat_hash_map<string, TypeAttrId>& read_ops,-    DataStructureOpsMap* object_clients_map) const {-  for (const NodeDef& node : graph_->node()) {-    const auto write_iter = write_ops.find(node.op());-    const auto read_iter = read_ops.find(node.op());-    bool is_writer = write_iter != write_ops.end();-    bool is_reader = read_iter != read_ops.end();-    if (is_writer || is_reader) {-      const NodeDef* object_node = GetTailOfChain(node, data_structure_ops);-      if (!object_node) {-        return errors::FailedPrecondition(-            "No data structure op found upstream of ", node.op(), " node ",-            node.name());+// If node is a Tensor List op with a float32 data type attribute then this+// returns a pointer to the NodeTypeId representing that type attribute. In+// all other cases this returns nullptr.+const NodeTypeId* AutoMixedPrecisionImpl::GetTensorListFloat32NodeTypeId(+    const NodeDef& node) const {+  if (!IsTensorListOp(node.op())) return nullptr;+  for (const TypeAttrId& type_attr : node_type_map_.GetTypeAttrs(node)) {+    const NodeTypeId* node_type =+        graph_type_view_.GetNode(node.name(), type_attr);+    if (node_type && node_type->type_attr.fixed_type == DT_INVALID &&

We're in trouble if a TensorList op assumes the tensor list is fp32 and hardcodes a DT_FLOAT output

I highly doubt a TensorList op will ever do that, but can you comment that we make this assumption? The condition node_type->type_attr.fixed_type == DT_INVALID confused me at first.

You should also comment this assumes that the type attribute represents the dtype of the tensor list.

benbarsdell

comment created time in 20 hours

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 Status AutoMixedPrecisionImpl::Optimize() {   return Status::OK(); } -// Finds data structure object ops (e.g., StackV2) and the sets of nodes that-// write (e.g., StackPushV2) and read (e.g., StackPopV2) from them.-Status AutoMixedPrecisionImpl::AddDataStructureOpsToMap(-    const absl::flat_hash_set<string>& data_structure_ops,-    TypeAttrId data_structure_type_attr,-    const absl::flat_hash_map<string, TypeAttrId>& write_ops,-    const absl::flat_hash_map<string, TypeAttrId>& read_ops,-    DataStructureOpsMap* object_clients_map) const {-  for (const NodeDef& node : graph_->node()) {-    const auto write_iter = write_ops.find(node.op());-    const auto read_iter = read_ops.find(node.op());-    bool is_writer = write_iter != write_ops.end();-    bool is_reader = read_iter != read_ops.end();-    if (is_writer || is_reader) {-      const NodeDef* object_node = GetTailOfChain(node, data_structure_ops);-      if (!object_node) {-        return errors::FailedPrecondition(-            "No data structure op found upstream of ", node.op(), " node ",-            node.name());+// If node is a Tensor List op with a float32 data type attribute then this+// returns a pointer to the NodeTypeId representing that type attribute. In+// all other cases this returns nullptr.+const NodeTypeId* AutoMixedPrecisionImpl::GetTensorListFloat32NodeTypeId(+    const NodeDef& node) const {+  if (!IsTensorListOp(node.op())) return nullptr;+  for (const TypeAttrId& type_attr : node_type_map_.GetTypeAttrs(node)) {+    const NodeTypeId* node_type =+        graph_type_view_.GetNode(node.name(), type_attr);+    if (node_type && node_type->type_attr.fixed_type == DT_INVALID &&+        node_type->type_attr.type_index == TypeAttrId::kSingleType &&+        IsFloat32(*node_type)) {+      return node_type;+    }+  }+  return nullptr;+}++bool AutoMixedPrecisionImpl::IsSourceOrSinkOp(const string& op) const {

@rmlarsen do you know of a better way to tell if an op is a source or sink?

benbarsdell

comment created time in 20 hours

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 class AutoMixedPrecisionLists {         "StridedSliceGrad",         "Switch",         "TensorListConcat",+        "TensorListConcatLists",

The comment above starting with "Note: if a data structure op (such as TensorListPopBack) " is no longer correct, as it mentions AddDataStructureOpsToMap

benbarsdell

comment created time in 20 hours

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 bool IsFloat32(const NodeTypeId& node_type) {          DataType::DT_FLOAT; } +bool IsTensorListOp(const string& op) {+  return op.find("TensorList") != string::npos;+}++bool IsTensorListReaderOp(const string& op) {+  const gtl::FlatSet<string> tensor_list_reader_ops = {

Make static? And same for writer ops

benbarsdell

comment created time in a day

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 Status AutoMixedPrecisionImpl::Optimize() {   return Status::OK(); } -// Finds data structure object ops (e.g., StackV2) and the sets of nodes that-// write (e.g., StackPushV2) and read (e.g., StackPopV2) from them.-Status AutoMixedPrecisionImpl::AddDataStructureOpsToMap(-    const absl::flat_hash_set<string>& data_structure_ops,-    TypeAttrId data_structure_type_attr,-    const absl::flat_hash_map<string, TypeAttrId>& write_ops,-    const absl::flat_hash_map<string, TypeAttrId>& read_ops,-    DataStructureOpsMap* object_clients_map) const {-  for (const NodeDef& node : graph_->node()) {-    const auto write_iter = write_ops.find(node.op());-    const auto read_iter = read_ops.find(node.op());-    bool is_writer = write_iter != write_ops.end();-    bool is_reader = read_iter != read_ops.end();-    if (is_writer || is_reader) {-      const NodeDef* object_node = GetTailOfChain(node, data_structure_ops);-      if (!object_node) {-        return errors::FailedPrecondition(-            "No data structure op found upstream of ", node.op(), " node ",-            node.name());+// If node is a Tensor List op with a float32 data type attribute then this+// returns a pointer to the NodeTypeId representing that type attribute. In+// all other cases this returns nullptr.+const NodeTypeId* AutoMixedPrecisionImpl::GetTensorListFloat32NodeTypeId(+    const NodeDef& node) const {+  if (!IsTensorListOp(node.op())) return nullptr;+  for (const TypeAttrId& type_attr : node_type_map_.GetTypeAttrs(node)) {+    const NodeTypeId* node_type =+        graph_type_view_.GetNode(node.name(), type_attr);+    if (node_type && node_type->type_attr.fixed_type == DT_INVALID &&+        node_type->type_attr.type_index == TypeAttrId::kSingleType &&+        IsFloat32(*node_type)) {+      return node_type;+    }+  }+  return nullptr;+}++bool AutoMixedPrecisionImpl::IsSourceOrSinkOp(const string& op) const {+  const gtl::FlatSet<string> source_and_sink_ops = {+      "_Arg",+      "_Retval",+      "OptionalFromValue",+      "OptionalGetValue",+      "PartitionedCall",+      "Placeholder",+      "StatefulPartitionedCall",+  };+  return source_and_sink_ops.count(op) || function_library_.Find(op);+}++// Finds all clusters of float32 Tensor List nodes that are connected via their+// handle edges. Unsafe clusters (those with edges that cross untraversable+// boundaries via _Arg, _Ret, PartitionedCall etc. nodes) are added to black_set+// and not returned.+void AutoMixedPrecisionImpl::FindFloat32TensorListOpClustersAndBlacklistUnsafe(+    std::vector<absl::flat_hash_set<const NodeDef*>>* tensor_list_clusters,+    absl::flat_hash_set<int>* black_set) const {+  absl::flat_hash_set<const NodeDef*> tensor_list_prop_set;+  for (int root_idx = 0; root_idx < graph_type_view_.num_nodes(); ++root_idx) {+    const NodeTypeId& root = *graph_type_view_.GetNode(root_idx);+    // First add any non-processable Tensor List nodes to the black set to avoid+    // them getting forced to white at the end of optimization.+    if (!ShouldProcess(*root.node) &&+        GetTensorListFloat32NodeTypeId(*root.node) == &root) {+      black_set->insert(root_idx);+      continue;+    }+    if (!ShouldProcess(*root.node) ||+        root.type_attr.fixed_type != DataType::DT_VARIANT ||+        !GetTensorListFloat32NodeTypeId(*root.node) ||+        tensor_list_prop_set.count(root.node)) {+      continue;+    }+    // Traverse Tensor List handle edges (DT_VARIANT) to find cluster of all+    // connected Tensor List nodes.+    absl::flat_hash_set<const NodeDef*> cluster({root.node});+    bool cluster_is_safe = true;+    DfsTypeTraversal(graph_type_view_, {&root},+                     TypeTraversalDirection::kFollowInputsAndOutputs,+                     DfsTypePredicates::Enter([&](int idx) -> bool {+                       const NodeTypeId& item = *graph_type_view_.GetNode(idx);+                       return !tensor_list_prop_set.count(item.node);+                     }),+                     DfsTypeCallbacks::PreOrder([&](int idx) {+                       const NodeDef* node =+                           graph_type_view_.GetNode(idx)->node;+                       tensor_list_prop_set.insert(node);+                       if (GetTensorListFloat32NodeTypeId(*node)) {+                         cluster.insert(node);+                       } else if (IsSourceOrSinkOp(node->op())) {+                         // The cluster crosses an untraversable boundary, so+                         // mark as unsafe.+                         cluster_is_safe = false;

Instead of having this cluster_is_safe boolean and then painting all the cluster nodes black, can we just mark the node black here? This will still force all the tensor list nodes in the cluster to be fp32, while making the code a bit simpler.

benbarsdell

comment created time in 20 hours

Pull request review commenttensorflow/tensorflow

[Grappler] Rewrite handling of list ops in auto_mixed_precision

 Status AutoMixedPrecisionImpl::Optimize() {   return Status::OK(); } -// Finds data structure object ops (e.g., StackV2) and the sets of nodes that-// write (e.g., StackPushV2) and read (e.g., StackPopV2) from them.-Status AutoMixedPrecisionImpl::AddDataStructureOpsToMap(-    const absl::flat_hash_set<string>& data_structure_ops,-    TypeAttrId data_structure_type_attr,-    const absl::flat_hash_map<string, TypeAttrId>& write_ops,-    const absl::flat_hash_map<string, TypeAttrId>& read_ops,-    DataStructureOpsMap* object_clients_map) const {-  for (const NodeDef& node : graph_->node()) {-    const auto write_iter = write_ops.find(node.op());-    const auto read_iter = read_ops.find(node.op());-    bool is_writer = write_iter != write_ops.end();-    bool is_reader = read_iter != read_ops.end();-    if (is_writer || is_reader) {-      const NodeDef* object_node = GetTailOfChain(node, data_structure_ops);-      if (!object_node) {-        return errors::FailedPrecondition(-            "No data structure op found upstream of ", node.op(), " node ",-            node.name());+// If node is a Tensor List op with a float32 data type attribute then this+// returns a pointer to the NodeTypeId representing that type attribute. In+// all other cases this returns nullptr.+const NodeTypeId* AutoMixedPrecisionImpl::GetTensorListFloat32NodeTypeId(+    const NodeDef& node) const {+  if (!IsTensorListOp(node.op())) return nullptr;+  for (const TypeAttrId& type_attr : node_type_map_.GetTypeAttrs(node)) {+    const NodeTypeId* node_type =+        graph_type_view_.GetNode(node.name(), type_attr);+    if (node_type && node_type->type_attr.fixed_type == DT_INVALID &&+        node_type->type_attr.type_index == TypeAttrId::kSingleType &&+        IsFloat32(*node_type)) {+      return node_type;+    }+  }+  return nullptr;+}++bool AutoMixedPrecisionImpl::IsSourceOrSinkOp(const string& op) const {+  const gtl::FlatSet<string> source_and_sink_ops = {+      "_Arg",+      "_Retval",+      "OptionalFromValue",+      "OptionalGetValue",+      "PartitionedCall",+      "Placeholder",+      "StatefulPartitionedCall",+  };+  return source_and_sink_ops.count(op) || function_library_.Find(op);+}++// Finds all clusters of float32 Tensor List nodes that are connected via their+// handle edges. Unsafe clusters (those with edges that cross untraversable+// boundaries via _Arg, _Ret, PartitionedCall etc. nodes) are added to black_set+// and not returned.+void AutoMixedPrecisionImpl::FindFloat32TensorListOpClustersAndBlacklistUnsafe(+    std::vector<absl::flat_hash_set<const NodeDef*>>* tensor_list_clusters,+    absl::flat_hash_set<int>* black_set) const {+  absl::flat_hash_set<const NodeDef*> tensor_list_prop_set;+  for (int root_idx = 0; root_idx < graph_type_view_.num_nodes(); ++root_idx) {+    const NodeTypeId& root = *graph_type_view_.GetNode(root_idx);+    // First add any non-processable Tensor List nodes to the black set to avoid+    // them getting forced to white at the end of optimization.+    if (!ShouldProcess(*root.node) &&

Can you move this check to the same place where you check IsSourceOrSinkOp. Conceptually, it seems you should have the same behavior if the node is a source/sink or if you shouldn't process it.

benbarsdell

comment created time in a day

issue commenttensorflow/tensorflow

when using mixed_precision.Policy('mixed_float16'), training is stuck at saving checkpoints for 0

I don't know what the df_tpu_estimator.py file is. Can you post a self-contained colab or example to reproduce?

Also note, the 'mixed_float16' policy only will have a performance improvement on GPUs, and otherwise there is no reason to use it. However, the policy should still work on CPUs, it will just be slow. Similarly, the 'mixed_bfloat16' policy only has a performance improvement on TPUs.

petkovacs19

comment created time in 6 days

Pull request review commenttensorflow/tensorflow

optimizer_v2: Improve error when called in cross-replica context

 def apply_gradients(self, grads_and_vars, name=None):         # Distribution strategy does not support reducing an empty list of         # gradients         return control_flow_ops.no_op()++      if distribute_ctx.in_cross_replica_context():+        raise RuntimeError(+            "`apply_gradients() cannot be called in cross-replica context. "+            "Use `_distributed_apply()` instead or enter replica context "

I don't think we should recommend calling a private method. Please remove the "Use _distributed_apply() instead" part of the error, and just recommend calling experimental_run_v2.

hakos

comment created time in 6 days

Pull request review commenttensorflow/benchmarks

Add a command line option to enable automatic mixed precision.

                     'different test runs. This flag is designed for human '                     'consumption, and does not have any impact within the '                     'system.')+flags.DEFINE_boolean('auto_mixed_precision', False,+                     'enable/disable grappler optimization that implements '+                     'automatic mixed precision support. Enabling automatic '

Add a sentence stating: "This is equivalent to --use_fp16, but uses a different implementation".

deven-amd

comment created time in 8 days

Pull request review commenttensorflow/benchmarks

Add a command line option to enable automatic mixed precision.

 def __init__(self, params, dataset=None, model=None):     if self.params.variable_update == 'horovod' and self.params.job_name:       raise ValueError('job_name should not be specified for Horovod.') -    if self.params.use_fp16 and self.params.fp16_enable_auto_loss_scale:+    self.enable_auto_loss_scale = self.params.auto_mixed_precision or (

Instead of automatically using auto loss scaling, I would use the same loss scale settings as if --use_fp16 were used, which uses a static loss scale by default. I think this just requires modifying _maybe_initialize_fp16 to create the loss scale if auto_mixed_precision is used as well.

deven-amd

comment created time in 8 days

Pull request review commenttensorflow/benchmarks

Add a command line option to enable automatic mixed precision.

                     'different test runs. This flag is designed for human '                     'consumption, and does not have any impact within the '                     'system.')+flags.DEFINE_boolean('auto_mixed_precision', False,

Dynamic loss scaling is only applied when the function tf.train.experimental.enable_mixed_precision_graph_rewrite is called. In your case, you are instead directly enabling the graph rewrite by modifying config.graph_options.rewrite_options.auto_mixed_precision, so loss scaling does not automatically occur.

deven-amd

comment created time in 9 days

Pull request review commenttensorflow/benchmarks

Add a command line option to enable automatic mixed precision.

                     'different test runs. This flag is designed for human '                     'consumption, and does not have any impact within the '                     'system.')+flags.DEFINE_boolean('auto_mixed_precision', False,

Loss scaling is needed when mixed precision is used. Currently loss scaling is only used when --use_fp16 is specified, so please change that logic so its also used when --auto_mixed_precision is used.

deven-amd

comment created time in 9 days

pull request commenttensorflow/tensorflow

Fix typos in the base_layer input casting warning

Thanks for the typo fixes! I didn't realize how often I used the wrong form of "its".

mitchellvitez

comment created time in 14 days

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

I found another issue: MKL bfloat16 ops crash in eager mode. For example, the following results in a segmentation fault:

import tensorflow as tf
i = tf.ones([2, 8, 8, 1], dtype='bfloat16')
f = tf.ones([3, 3, 1, 6], dtype='bfloat16')
print('Doing conv2d:')
x = tf.nn.conv2d(i, f, strides=[1, 1, 1, 1], padding='SAME')

This does not affect the grappler pass, which only runs in graph mode. However, it will prevent bfloat16 from being used with the tf.keras.mixed_precision API. It also makes it harder to debug mixed precision issues. Can you please look into this?

nhasabni

comment created time in 16 days

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

Thank you for looking into this!

I tried to get a reproducer for Conv2D, but in doing so I ran into another bug. When I run a Conv2d and BatchNorm in float32, then I change the Conv2d to bfloat16 and run it again, I sometimes get a crash. For example, the following will crash with the error message "free(): invalid next size (fast)".

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

def build_model(bfloat16_conv):
  x = tf.Variable(tf.truncated_normal([1, 1, 1, 1]))
  f = tf.Variable(tf.truncated_normal([1, 1, 1, 2]))
  if bfloat16_conv:
    x = tf.cast(x, 'bfloat16')
    f = tf.cast(f, 'bfloat16')
  x = tf.nn.conv2d(x, f, strides=[1, 1, 1, 1], padding='SAME')
  if bfloat16_conv:
    x = tf.cast(x, 'float32')
  s = tf.Variable(tf.truncated_normal([2]))
  o = tf.Variable(tf.truncated_normal([2]))
  y, _, _ = tf.nn.fused_batch_norm(x, scale=s, offset=o)
  return y

output32 = build_model(bfloat16_conv=False)
output16 = build_model(bfloat16_conv=True)

with tf.Session() as sess:
  print('Computing with float32')
  sess.run(tf.global_variables_initializer())
  sess.run(output32)

  for _ in range(20):
    print('Computing with bfloat16')
    sess.run(tf.global_variables_initializer())
    sess.run(output16)

Interestingly, the issue occurs even if I replace bfloat16 with float16 or float64, indicating the issue does not involve bfloat16 itself. However, the issue only occurs when I compile with MKL.

This issue is affecting the unit tests, but I can work around it for now.

nhasabni

comment created time in 17 days

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

Do you think you can address this before I continue working on the PR? Most of the Python auto_mixed_precision tests sometimes fail when MKL is enabled. What makes it especially difficult is that sometimes the test only fail sometimes as the issue is nondeterministic, so it can be difficult to tell if the test is buggy or not.

I also am worried that users may end up affected by this issue when they enable the grappler pass.

If fixing this is difficult, and you haven't observed this affecting model quality in practice, I can continue working on the PR.

nhasabni

comment created time in 17 days

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

I am not blocked, as I can work around it for now by disabling certain tests.

nhasabni

comment created time in 20 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

@tgaddair, Optimizer will call _aggregate_gradients if and only if experimental_aggregate_gradients is True (the default). If the user has no distribution strategy, then the default distribution strategy is used, which does nothing when you ask it to all-reduce.

This means if you override _aggregate_gradients, it will be called if the user explicitly calls apply_gradients. In 2.1, apply_gradients does not cause Horovod to aggregate gradients, since that is only done in get_gradients.

Do you think this will be an issue? This will cause problems if a user explicitly calls DistributedOptimizer.apply_gradients in TF 2.0/2.1 then upgrades to 2.2. This is because apply_gradients previously did not all-reduce, so the user must have been all-reducing using some other mechanism. In 2.2, according the current plan, DistributedOptimizer.apply_gradients will all-reduce (unless the user changes their code to pass experimental_aggregate_gradients=False), so the gradients will be all-reduced twice.

tgaddair

comment created time in 21 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

@omalleyt12 and I just discussed this, and we have a slightly difference proposal.

Instead of an _transform_unaggregated_gradients method, we'll add an _aggregate_gradients method, with essentially the same semantics. These two lines in Optimizer, which aggregate the gradients, will be moved to _aggregate_gradients and apply_gradients will call _aggregate_gradients. Horovod's optimizer can override _aggregate_gradients to do the all-reduce.

To support having a LossScaleOptimizer wrap a Horovod optimizer and have the gradients all-reduced in fp16, we need to make sure the gradients are unscaled after being all-reduced. To do this, we will introduce an experimental_aggregate_gradients parameter to Optimizer.apply_gradients, which defaults to True. If False, _aggregate_gradients will not be called. In Model.fit(), the model will explicitly call opt._aggregate_gradients, then call opt.get_unscaled_gradients, then call opt.apply_gradients and pass experimental_aggregate_gradients=False. This way, gradients are unscaled after the all-reduce, allowing the all-reduce to occur in fp16.

More specifically, these four lines, which compute and apply gradients as part of Model.fit(), will be replaced with:

grads = tape.gradient(scaled_total_loss, trainable_weights)
grads = model.optimizer._aggregate_gradients(grads)
if isinstance(model.optimizer,
              loss_scale_optimizer.LossScaleOptimizer):
  grads = model.optimizer.get_unscaled_gradients(grads)
model.optimizer.apply_gradients(zip(grads, trainable_weights), 
                                experimental_aggregate_gradients=False)

This also allows gradients to be all-reduced in fp16 with tf.distribute.Strategy as well. I will probably add an allreduce_in_fp16=True argument to the LossScaleOptimizer constructor to make this behavior customizable.

I think there are two main advantages to this over the previous _transform_unaggregated_gradients short-term proposal for 2.2:

  1. It is unlikely we'll remove the _aggregate_gradients method, as the method arguably make the Optimizer code cleaner, and so it doesn't just exist for the sake of Horovod. I cannot promise we will keep the method however.
  2. The name _transform_unaggregated_gradients was somewhat misleading, as it would transform scaled gradients, while users would likely expected the unscaled gradients to be transformed.

@DEKHTIARJonathan, @tgaddair, WDYT?

tgaddair

comment created time in 21 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

@omalleyt12 can confirm, but I now think it's not feasible to revert c73c99ca3e0bacf2bca313f270bb3eae28869530, the change which introduced this behavior. Assuming we cannot revert it, then I don't think we have an alternative other than a short term fix. This is because we don't have time to go through design review and finalize an API in time for 2.2.

@omalleyt12, I'm not sure we have a tentative plan yet, but certainly one option would be to keep the _transform_unaggregated_gradients method. I think we should be prepared to remove it however, if a different design is decided on. There is a decent change we will remove or rename the method.

@DEKHTIARJonathan, I don't think you would need 2.2-specific code. Instead, you could move this code in get_gradients to the _transform_unaggregated_gradients method and call it from get_gradients. You'd also have to move the self._get_gradients_used = True line to that method, and likely rename the field. Then you would support TF 2.0 through TF 2.2, and potentially TF 2.3 if we do keep the method wouldn't renaming it.

tgaddair

comment created time in 21 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

Great, I'll implement _transform_unaggregated_gradients for now.

One other question: I noticed that there is some additional gradient clipping logic here in Optimizer.minimize that doesn't exist in training_eager.py. Am I correct, or is it being executed somewhere else?

IIRC Model.fit (incorrectly) never did this gradient clipping. I could be wrong though, @omalleyt12 can you confirm?

tgaddair

comment created time in 22 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

I just talked to @omalleyt12 and @fchollet. We agreed the best way to solve this for 2.2 is to introduce a _transform_unaggregated_gradients method on Optimizer (but maybe with a different name). By default, this would do nothing, and the only place it would be called is right above this line. Overriding this would allow the scaled gradients to be processed. This method would be private and undocumented, so it would not be subject to TensorFlow's backwards compatibility guarantee.

For 2.3, we plan on designing and implementing the full API, and potentially removing the _transform_unaggregated_gradients. I think this will make it relatively easy for Horovod to support 2.1, 2.2, and 2.3+ for a long time. The _transform_unaggregated_gradients override could be kept indefinitely in Horovod. If Optimizer removes that method in 2.3, the Horovod override would simply be ignored. For 2.3, Horovod could use the new API, potentially requiring a check of tf.__version__.

If I'm understanding correctly, this issue only affects users who call Model.fit, not those who use a custom training loop. This means users would never have to directly call _transform_unaggregated_gradients, and so it would only be called by Model.fit.

Let me know your thoughts.

tgaddair

comment created time in 22 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

@tgaddair I was thinking more of calling _aggregate_gradients after get_unscaled_gradients, becuase I wanted to factor out these two lines into its own aggregate_gradients method, which apply_gradients would call. Then Keras would not have to explicitly call _aggregate_gradients. But I just realized this won't work if you want to all-reduce in fp16, since then you must aggregate gradients before unscaling them.

I liked factoring out the two lines since we didn't have to do anything special to support Horovod. That change makes sense even without taking into account Horovod, since it (arguably) improved readability of the OptimizerV2 class.

How important is supporting LossScaleOptimizer with DistributedOptimizer with fp16 all-reduce with Model.fit(). Does anyone currently do this? Currently, its impossible to all-reduce in fp16 with tf.distribute.Strategy even with a custom training loop, so Horovod is already an improvement in that regard. We will introduce the ability to all-reduce in fp16 in a custom training loop in 2.2 for Strategy, but still not fit() until 2.3.

I will keep trying to think of a way to support Horovod with fp16 gradients in fit() for 2.2 without API changes or major implementation changes. Ideally, we wouldn't have to change the core training step in training_eager.py but I am not personally opposed to doing so if there is no other alternative.

tgaddair

comment created time in 22 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

For our case with Horovod, we cannot implement the aggregation in apply_gradients because of Keras' current use of the LossScaleOptimizer in this code path, which will have get_unscaled_gradients called before apply_gradients.

Is the issue with all-reducing in apply_gradients is that you all-reduce in fp16 and therefore must all-reduce the scaled gradients when a LossScaleOptimizer is used?

Prior to TensorFlow 2.2, Keras supported experimental_run_tf_function=False, which allowed us to circumvent this code path and use the Optimizer's get_gradients method instead. Would it be possible to restore this functionality until a solution to this issue is implemented?

@omalleyt12 do you know the answer to this?

If the answer is "no", perhaps we could implement, but not publicly document, the _aggregate_gradients function for 2.2. This could be considered an internal refactor without any API changes, but would allow Horovod, or any other user, to subclass it to do arbitrary gradient processing. If we later choose to go a different route, say a callback mechanism in Model, we could delete the function, or keep it since its probably good to have that function for readability purposes anyway. If we do delete it, Horovod would have to use whatever mechanism we replace it with, but I think having to keep the _aggregate_gradients override in Horovod for 2.2 support is a lot easier than having to keep the very hacky LossScaleOptimizer subclass.

@omalleyt12 @alextp @tgaddair WDYT?

tgaddair

comment created time in 23 days

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

Thank you for the fix, it worked great! If this is a good permanent fix, can you please make the fixes in a separate PR? That way this PR can be dedicated just to the grappler pass.

Also, when running the auto_mixed_precision tests, I am having trouble passing a test that runs conv3d. In particular, it looks like the gradients with respect to the filter are sometimes incorrect when bfloat16 is used with MKL. I created a stand-alone program to reproduce the conv3d issue:

import numpy as np
import tensorflow.compat.v1 as tf
tf.compat.v1.disable_v2_behavior()

np.random.seed(0)
x_np = np.random.normal(size=(2, 4, 4, 4, 1))
f_np = np.ones((2, 1, 1, 1, 1))
x = tf.Variable(x_np, dtype='float32')
f = tf.Variable(f_np, dtype='float32')

def compute_grad(x, f):
  y = tf.nn.conv3d(x, f, strides=[1, 1, 1, 1, 1], padding='SAME')
  (g,) = tf.gradients(y, [f])
  g = tf.cast(g, 'float32')
  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    return sess.run(g)

grad = compute_grad(x, f)
grad16 = compute_grad(tf.cast(x, 'bfloat16'), tf.cast(f, 'bfloat16'))
print('f32 grad: %s' % np.reshape(grad, (2,)))
print('bf16 grad: %s' % np.reshape(grad16, (2,)))
tf.test.TestCase().assertAllClose(grad, grad16, atol=1e-1, rtol=1e-1)

With MKL, the output is typically

f32 grad: [ 17.3849678   10.21814537]
bf16 grad: [ 17.375   6.75 ]

Then it crashes due to the failed assertions. It occasionally passes with MKL, so the gradient op does not appear to be deterministic. I tried reproducing with smaller shapes, but it would pass if I decreased the shape size.

This passes if TensorFlow is built without MKL. In that case, I think XLA will automatically run the op, since there is no non-XLA non-MKL CPU bfloat16 version of conv3d.

nhasabni

comment created time in 23 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

My thoughts on the proposal so far:

@DEKHTIARJonathan, thanks for prototyping this. I think this approach is reasonable, except that I'm a bit worried about the "priority" field. If optimizers are nested, it seems unintuitive that the order of nesting would sometimes be ignored in favor of priorities. If the Horovod optimizer overrode _aggregate_gradients (as @omalleyt12 proposed), and the LossScaleOptimizer overrode _transform_aggregated_gradients, the correct order would occur no matter how the Horovod optimizer and LossScaleOptimizer were ordered.

As for using a WrappingInterfaceOptimizer in general: one issue is that this will make all-reducing in fp16 difficult if tf.distribute.Strategy is used. As I mentioned in this comment, if we all-reduce in fp16, we must all-reduce the scaled gradients instead of the unscaled gradients. If we do this by overriding _transform_aggregated_gradients to unscale gradients, then the user will have to explicitly scale the loss, but not unscale the gradients. This asymmetry makes it too easy for the user to forget to scale the loss, or accidentally unscale the gradients twice.

I'm not sure how to solve this asymmetry problem, but I will give this more thought.

tgaddair

comment created time in 23 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

There was a internal discussion on this issue and we have decided we want some sort of hook mechanism, such as the one @tgaddair proposed and @omalleyt12 expanded on. However, we have not yet decided on a concrete proposal. We could go with the proposal detailed by @omalleyt12 in this issue. Alternatively, to avoid deeply nesting optimizers, the base Optimizer could take a "Callbacks" object or a list of "Callbacks" objects, and optimizes would execute their callbacks in order. A third approach would be to have the callbacks on a Keras Model instead of the Optimizer.

Unfortunately, it's unlikely to have this implemented and submitted in TensorFlow by 2.2. This will probably require a design review.

@tgaddair, would it be possible to override Optimizer.apply_gradients instead of Optimizer.get_gradients? You could perform the all reduce in apply_gradients, then call super().apply_gradients. If this would not work, subclassing LossScaleOptimizer would work for 2.2. By 2.3, we would (hopefully) have a solution implemented so you could switch to that. Subclassing LossScaleOptimizer is very hacky, but I think doing so for a single release is acceptable if there is no alternative. Let me know your thoughts.

tgaddair

comment created time in 23 days

Pull request review commenttensorflow/tensorflow

Change TrtConversionParams to class from NamedTuple and export it

 def supported_precision_modes(): # so it can produce reasonable performance results with the default. DEFAULT_TRT_MAX_WORKSPACE_SIZE_BYTES = 1 << 30 -# TrtConversionParams encapsulates the parameters that are used for TF-TRT-# conversion.-TrtConversionParams = collections.namedtuple(-    "TrtConversionParams",-    [--        # A template RewriterConfig proto used to create a TRT-enabled-        # RewriterConfig. If None, it will use a default one.-        "rewriter_config_template",--        # The maximum GPU temporary memory which the TRT engine can use at-        # execution time. This corresponds to the 'workspaceSize' parameter of-        # nvinfer1::IBuilder::setMaxWorkspaceSize().-        "max_workspace_size_bytes",--        # One of TrtPrecisionMode.supported_precision_modes().-        "precision_mode",--        # The minimum number of nodes required for a subgraph to be replaced by-        # TRTEngineOp.-        "minimum_segment_size",--        # Whether to generate dynamic TRT ops which will build the TRT network-        # and engine at run time.-        # i.e. Since TensorRT version < 6.0 does not support dynamic dimensions-        # other than the batch dimension, when the TensorFlow graph has a-        # non-batch dimension of dynamic size, we would need to enable this-        # option. This option should be set to True in TF 2.0.-        "is_dynamic_op",--        # Max number of cached TRT engines for dynamic TRT ops.-        # Created TRT engines for a dynamic dimension are cached.-        # This is the maximum number of engines that can be cached.-        # If the number of cached engines is already at max but none of them-        # supports the input shapes, the TRTEngineOp will fall back to run the-        # original TF subgraph that corresponds to the TRTEngineOp.-        "maximum_cached_engines",--        # This argument is ignored if precision_mode is not INT8. If set to-        # True, a calibration graph will be created to calibrate the missing-        # ranges. The calibration graph must be converted to an inference graph-        # by running calibration with calibrate(). If set to False, quantization-        # nodes will be expected for every tensor in the graph (exlcuding those-        # which will be fused). If a range is missing, an error will occur.-        # Please note that accuracy may be negatively affected if there is a-        # mismatch between which tensors TRT quantizes and which tensors were-        # trained with fake quantization.-        "use_calibration",--        # Max size for the input batch.-        # This parameter is only effective when is_dynamic_op=False which-        # is not supported in TF 2.0.-        "max_batch_size",-    ])--DEFAULT_TRT_CONVERSION_PARAMS = TrtConversionParams(-    rewriter_config_template=None,-    max_workspace_size_bytes=DEFAULT_TRT_MAX_WORKSPACE_SIZE_BYTES,-    precision_mode=TrtPrecisionMode.FP32,-    minimum_segment_size=3,-    is_dynamic_op=True,-    maximum_cached_engines=1,-    use_calibration=True,-    max_batch_size=1)++@tf_export("experimental.tensorrt.ConversionParams", v1=[])+class TrtConversionParams(object):

Drive by comment: Can you change this to subclass namedtuple, similar to what is done here? That way this still remains a namedtuple and you don't have to reimplement part of the namedtuple interface.

pooyadavoodi

comment created time in 24 days

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

When I try to compile with XLA disabled, I get the error

Traceback (most recent call last):
  File "/home/reedwm/venvs/mkl_source_noxla/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
    return fn(*args)
  File "/home/reedwm/venvs/mkl_source_noxla/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    self._extend_graph()
  File "/home/reedwm/venvs/mkl_source_noxla/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1390, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'Conv3D' used by {{node Conv3D}} with these attrs: [dilations=[1, 1, 1, 1, 1], padding="SAME", T=DT_BFLOAT16, data_format="NDHWC", strides=[1, 1, 1, 1, 1]]
Registered devices: [CPU]
Registered kernels:
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_HALF]

	 [[Conv3D]]

In this case, it seems the Conv3D op is not properly being converted to the MKL version.

nhasabni

comment created time in 24 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

My biggest objection so far is that it makes scaling the loss and unscaling gradients asymmetric, as a user has to explicitly scale the loss, but not unscale the gradients. For example, LossScaleOptimizer would be implemented as follows

class LossScaleOptimizer(Optimizer):
  def __init__(self, optimizer):
    self._optimizer = optimizer
    self._loss_scale = ...
  
  def get_scaled_loss(loss):
    return loss * self._loss_scale()

  def get_unscaled_loss(gradients):
    return gradients / self._loss_scale()

  def _transform_loss(self, loss):
    loss = super()._transform_loss(loss)
    loss = self.get_scaled_loss(loss)
    return self._optimizer._transform_loss()

  def _aggregate_gradients(self, gradients):
    gradients = tf.cast(gradients, "float16")  # All-reduce in fp16 for performance
    gradients = super()._aggregate_gradients(gradients)
    return tf.cast(gradients, "float32")

  def _transform_aggregated_gradients(self, gradients)
    gradients = super()._transform_aggregated_gradients(gradients)
    gradients = self.get_unscaled_loss(gradients)
    gradients = tf.cast(gradients, tf.float32)
    return self._optimizer._transform_aggregated_gradients(gradients)

  ... # Delegate all other methods to self._optimizer

In fit(), everything works as expected. But in a custom training loop, if the user calls apply_gradients instead of minimize, they must call get_scaled_loss (or _transform_loss) but not get_unscaled_gradients:

with tf.GradientTape() as tape:
    loss = loss_fn(features, labels)
    scaled_loss = optimizer.get_scaled_loss(loss)
scaled_grads = tape.gradient(loss, model.trainable_variables)
# apply_gradients will unscale gradients, but not scale loss
optimizer.apply_gradients(list(zip(fp32_scaled_grads, 
                                   model.trainable_variables)))

Alternatively, the user could have called _transform_loss instead of get_scaled_loss. This removes the obvious asymmetry between scaling loss and unscaling gradients, but it makes it less clear what's going on.

I think this problem is not specific to loss scaling. In general, its somewhat unclear exactly what callback methods apply_gradients is calling.

Another potential issue is this approach has no way of implementing a loop for loss scaling, where the gradients are computed with loss scaling, and if the gradients have NaNs, lower the loss scale and try computing gradients again until you have no NaNs. This is what the LossScaleGradientTape does, but it currently cannot be used with Keras.fit(). IMO, this isn't a big deal, as I think the benefits from the loop are negligible, but others may disagree. I like the LossScaleGradientTape not because of the loop, but because it presents a very nice API.

I don't really see how we can modify the OptimizerV2 class go avoid having to rewrite a lot of boilerplate code each time we want to implement a "wrapping optimizer"

We could still implement a "wrapping optimizer" but keep Tom's proposal. The "wrapping optimizer" would simply delegate to an inner optimizer (and implement the subclassing trick if we can reach consensus on whether that should be done). A user could subclass the wrapping optimizer to implement the callback methods like _transform_loss.

tgaddair

comment created time in 24 days

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

This looks reasonable. If every hook-based optimizer uses composition, I don't think we need the CombinedOptimizer, right? I'm not a big fan of CombinedOptimizer, since it can only call apply_gradients on one optimizer. Also for your LossScaleOptimizer example, I think _process_loss has to call super()._process_loss, then pass the return value to _get_scaled_loss

Also, we should probably do the dynamic subclassing trick Horovod does here and expose a helper method to do this automatically. @DEKHTIARJonathan had a PR, #31578, to do this for LossScaleOptimizer. But I couldn't decide if it was the rigtht approach so I stalled accepting it (sorry about that!)

Also maybe we should simply have Optimizer take a list of callbacks instead? That way we don't have to deal with composition or inheritance.

Anyway this approach is looking pretty good to me, but others may disagree. Let's wait until we get more feedback.

tgaddair

comment created time in a month

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

Thanks for the ideas @omalleyt12, @tgaddair, and @DEKHTIARJonathan. I have a few questions:

  1. How would a user combine multiple optimizer subclasses, each defining hooks? For example, I may have two optimizer subclasses, OptimizerA and OptimizerB, each which overrides _process_unaggregated_gradients. One way would be to define an optimizer subclassing both:

    class MyOptimizer(OptimizerA, OptimizerB):
        pass
    

    But this is somewhat irritating, especially considering the order of the bases matters here. With the WrappedOptimizer concept, instead users could simply nest optimizers. The nesting structure would also make it more obvious the order of nesting matters.

  2. When subclassing to override hooks, would you override the base Optimizer or a concrete optimizer like Adam? If overriding Adam, what if you want your hooks to apply to a variety of optimizers? If overriding Optimizer, users would have to do a similar trick as above with multiple inheritance:

    class MyOptimizer(tf.keras.optimizers.Adam, OptimizerA):
        pass
    
  3. How would users use a LossScaleOptimizer with a CTL? Currently, they directly call LossScaleOptimizer.get_scaled_loss and LossScaleOptimizer.get_unscaled_gradients. Would they now instead have to call LossScaleOptimizer._process_loss and LossScaleOptimizer._process_aggregated_gradients? That seems more unintuitive. We could also recommend they call minimize, but I want to make using mixed precision as easy and seamless as possible, and switching to minimize, even in its new proposed form, is a bit of work.

I don't think any of these issues are necessarily deal breakers, but I want to see if you have any ideas that I am not considering.

tgaddair

comment created time in a month

issue commenttensorflow/tensorflow

Support a generic API for modifying loss / gradients in Keras

Thank you for the proposal. Subclassing the experimental LossScaleOptimizer and overriding get_scaled_loss and get_unscaled_gradients is hacky and not what LossScaleOptimizer is intended for, for, so I agree this needs to be addressed.

One thing I'm confused about is whether Keras would expect a WrappedOptimizer or a non-wrapped optimizer. Would it use isinstance to determine if the optimizer is a WrappedOptimizer, and only call before_compute_gradients_hook, etc, if the optimizer is a wrapped optimizer? Also how would nesting of WrappedOptimizers work, if, e.g., a user wanted a LossScaleDistributedOptimizer?

We also should have an after_allreduce_gradients_hook(grads_and_vars) since Optimizer itself will all-reduce gradients if tf.distribute.Strategy is used. This isn't useful for Horovod which does its own reductions, but would be useful for other purposes, such as unscaling the fp16 gradients and casting them to fp32 after the all-reduce is done.

/CC @alextp @zongweiz @omalleyt12, thoughts?

tgaddair

comment created time in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

When I built a bfloat16 graph explicitly, I get a "No registered '_MklConv3DBackpropInputV2' OpKernel" error. For example, when I build a pip package with:

bazel build  --test_output=errors --config=mkl --copt=-DENABLE_INTEL_MKL_BFLOAT16 -c opt //tensorflow/tools/pip_package:build_pip_package

Then run the following code:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

x = tf.ones([1, 1, 1, 1, 1], dtype='bfloat16')
f = tf.ones([1, 1, 1, 1, 1], dtype='bfloat16')
y = tf.nn.conv3d(x, f, strides=[1, 1, 1, 1, 1], padding='SAME')
g = tf.gradients(y, x)
with tf.Session() as sess:
  sess.run(g)

I get the following error:

2020-02-01 03:26:16.679048: E tensorflow/core/common_runtime/executor.cc:661] Executor failed to create kernel. Not found: No registered '_MklConv3DBackpropInputV2' OpKernel for 'XLA_CPU' devices compatible with node {{node gradients/Conv3D_grad/Conv3DBackpropInputV2}}
	.  Registered:  device='CPU'; label='MklLayoutDependentOp'; T in [DT_BFLOAT16]
  device='CPU'; label='MklLayoutDependentOp'; T in [DT_FLOAT]

	 [[gradients/Conv3D_grad/Conv3DBackpropInputV2]]
Traceback (most recent call last):
  File "/home/reedwm/venvs/mkl_source/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
    return fn(*args)
  File "/home/reedwm/venvs/mkl_source/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
    target_list, run_metadata)
  File "/home/reedwm/venvs/mkl_source/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: No registered '_MklConv3DBackpropInputV2' OpKernel for 'XLA_CPU' devices compatible with node {{node gradients/Conv3D_grad/Conv3DBackpropInputV2}}
	.  Registered:  device='CPU'; label='MklLayoutDependentOp'; T in [DT_BFLOAT16]
  device='CPU'; label='MklLayoutDependentOp'; T in [DT_FLOAT]

	 [[gradients/Conv3D_grad/Conv3DBackpropInputV2]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    sess.run(g)
  File "/home/reedwm/venvs/mkl_source/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 960, in run
    run_metadata_ptr)
  File "/home/reedwm/venvs/mkl_source/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1183, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/reedwm/venvs/mkl_source/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
    run_metadata)
  File "/home/reedwm/venvs/mkl_source/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: No registered '_MklConv3DBackpropInputV2' OpKernel for 'XLA_CPU' devices compatible with node node gradients/Conv3D_grad/Conv3DBackpropInputV2 (defined at test.py:7) 
	.  Registered:  device='CPU'; label='MklLayoutDependentOp'; T in [DT_BFLOAT16]
  device='CPU'; label='MklLayoutDependentOp'; T in [DT_FLOAT]

	 [[gradients/Conv3D_grad/Conv3DBackpropInputV2]]

Errors may have originated from an input operation.
Input Source operations connected to node gradients/Conv3D_grad/Conv3DBackpropInputV2:
 ones_1 (defined at test.py:5)

I am explicitly using bfloat16 to debug a test from the grappler pass. It's hard to debug it effectively if I can not run graphs where I explicitly use bfloat16. Do you know what the issue is?

nhasabni

comment created time in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

Ah it looks like this is probably solved by #36395.

nhasabni

comment created time in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

@nhasabni I merged with the head of master (419ebe51e023e871590b19eb4df1c1fdbe9da51e) to resolve a different issue, but when I try to compile with MKL, I get the following error

tensorflow/core/kernels/mkl_conv_grad_input_ops.cc: In instantiation of 'tensorflow::TensorShape tensorflow::MklConvCustomBackpropInputOp<Device, T, is_depthwise, eager_mode>::MakeInputTfShape(tensorflow::OpKernelContext*, const tensorflow::Tensor&) [with Device = Eigen::ThreadPoolDevice; T = tensorflow::bfloat16; bool is_depthwise = true; bool eager_mode = false]':
tensorflow/core/kernels/mkl_conv_grad_input_ops.cc:333:50:   required from 'void tensorflow::MklConvCustomBackpropInputOp<Device, T, is_depthwise, eager_mode>::Compute(tensorflow::OpKernelContext*) [with Device = Eigen::ThreadPoolDevice; T = tensorflow::bfloat16; bool is_depthwise = true; bool eager_mode = false]'
tensorflow/core/kernels/mkl_conv_grad_input_ops.cc:613:1:   required from here
tensorflow/core/kernels/mkl_conv_grad_input_ops.cc:526:20: error: 'class tensorflow::MklConvCustomBackpropInputOp<Eigen::ThreadPoolDevice, tensorflow::bfloat16, true, false>' has no member named 'MakeShape'; did you mean 'MakeInputTfShape'?
     CHECK_EQ(this->MakeShape(input_tensor, &input_tf_shape).ok(), true);
              ~~~~~~^
./tensorflow/core/platform/default/logging.h:399:64: note: in definition of macro 'CHECK_OP_LOG'
                  ::tensorflow::internal::GetReferenceableValue(val1), \
                                                                ^~~~
./tensorflow/core/platform/default/logging.h:407:30: note: in expansion of macro 'CHECK_OP'
 #define CHECK_EQ(val1, val2) CHECK_OP(Check_EQ, ==, val1, val2)
                              ^~~~~~~~
tensorflow/core/kernels/mkl_conv_grad_input_ops.cc:526:5: note: in expansion of macro 'CHECK_EQ'
     CHECK_EQ(this->MakeShape(input_tensor, &input_tf_shape).ok(), true);
     ^

This occurs even on a clean branch without the changes of this PR applied. Can you resolve this issue? This might be caused by 271f6bb49d2140b4c1bca88391caedd1791561cf.

nhasabni

comment created time in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

FYI, it looks like MKL cannot be compiled in debug mode. For example, when I run the command

bazel build  --test_output=errors --config=mkl --copt=-DENABLE_INTEL_MKL_BFLOAT16 -c dbg //tensorflow/core/kernels:mkl_fused_batch_norm_op

I get the error:

In file included from ./tensorflow/core/lib/core/arena.h:21:0,
                 from ./tensorflow/core/graph/graph.h:49,
                 from ./tensorflow/core/graph/mkl_graph_util.h:22,
                 from ./tensorflow/core/util/mkl_util.h:31,
                 from tensorflow/core/kernels/mkl_fused_batch_norm_op.cc:22:
./tensorflow/core/util/mkl_util.h: In member function 'bool tensorflow::MklDnnShape::CompareMklDnnLayouts(const mkldnn::memory::desc&, const mkldnn::memory::desc&) const':
./tensorflow/core/util/mkl_util.h:341:32: error: no match for 'operator==' (operand types are 'mkldnn_primitive_kind_t' and 'mkldnn::primitive::kind')
     assert(mdd1.primitive_kind == mkldnn::primitive::kind::memory);
            ~~~~~~~~~~~~~~~~~~~~^~~~
./tensorflow/core/util/mkl_util.h:341:32: note: candidate: operator==(mkldnn::primitive::kind, mkldnn::primitive::kind) <built-in>
./tensorflow/core/util/mkl_util.h:341:32: note:   no known conversion for argument 1 from 'mkldnn_primitive_kind_t' to 'mkldnn::primitive::kind'
./tensorflow/core/util/mkl_util.h:341:32: note: candidate: operator==(mkldnn_primitive_kind_t, mkldnn_primitive_kind_t) <built-in>
./tensorflow/core/util/mkl_util.h:341:32: note:   no known conversion for argument 2 from 'mkldnn::primitive::kind' to 'mkldnn_primitive_kind_t'

Adding --copt=-DNDEBUG to bazel seems to fix this, but this is somewhat inconvenient.

nhasabni

comment created time in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

I have a Intel® Xeon® Processor E5-2690 v4 CPU, which is Broadwell according to this page. That would explain why I cannot run the test. Ideally the error message would be improved.

Is it possible to run the test on Broadwell? If not I can use a google cloud instance.

nhasabni

comment created time in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

@nhasabni I am still getting the same error. It looks like the issue is the "could not create a primitive descriptor iterator" error in the logs in my previous message. I think this is from this line in MKL. Any ideas what could be causing this issue?

nhasabni

comment created time in a month

issue commenttensorflow/tensorflow

model.summary() Does not Work in Some Cases

Ah I see the issue. You never used either of your Concatenate layers. Here is a smaller example to reproduce:

import tensorflow as tf

class MyModel(tf.keras.Model):
  def __init__(self):
    super(MyModel, self).__init__()
    self.dense = tf.keras.layers.Dense(16)
    self.concat = tf.keras.layers.Concatenate(axis=1)

  def call(self, inputs):
    return self.dense(inputs)

model = MyModel()
model.build((16, 16))
model.summary()

The error is

ValueError: You tried to call `count_params` on concatenate, but the layer isn't built. You can build it manually via: `concatenate.build(batch_input_shape)`.

This isn't a bug, but perhaps we could improve the error message?

yourtheron

comment created time in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

I also tried running

bazel test --copt=-DENABLE_INTEL_MKL_BFLOAT16 --config=mkl --config=opt //tensorflow/core/grappler/optimizers:convert_to_bfloat16_test

But the test failed with the following error

[ RUN      ] BFloat16ConverterTest.AlreadyBFloat16
2020-01-28 03:12:21.184809: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at mkl_relu_op.cc:547 : Aborted: Operation received an exception:Status: 5, message: could not create a primitive descriptor iterator, in file tensorflow/core/kernels/mkl_relu_op.cc:544
2020-01-28 03:12:21.185276: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at mkl_relu_op.cc:547 : Aborted: Operation received an exception:Status: 5, message: could not create a primitive descriptor iterator, in file tensorflow/core/kernels/mkl_relu_op.cc:544
2020-01-28 03:12:21.185400: F tensorflow/core/grappler/utils/grappler_test.cc:107] Non-OK-status: session->Run(run_options, inputs, node_names, node_names, &output_tensors, nullptr) status: Aborted: Operation received an exception:Status: 5, message: could not create a primitive descriptor iterator, in file tensorflow/core/kernels/mkl_relu_op.cc:544
	 [[{{node d}}]]
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/_U_S_Stensorflow_Score_Sgrappler_Soptimizers_Cconvert_Uto_Ubfloat16_Utest___Utensorflow/libtensorflow_framework.so.2(+0xf4d63e)[0x7f8e3f30263e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x123a0)[0x7f8e32de63a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f8e32623cfb]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x129)[0x7f8e3260e8ad]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/_U_S_Stensorflow_Score_Sgrappler_Soptimizers_Cconvert_Uto_Ubfloat16_Utest___Utensorflow/libtensorflow_framework.so.2(+0x141a1e7)[0x7f8e3f7cf1e7]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libtensorflow_Score_Sgrappler_Sutils_Slibgrappler_Utest.so(_ZNK10tensorflow8grappler12GrapplerTest13EvaluateNodesERKNS_8GraphDefERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISB_EERKS5_ISt4pairISB_NS_6TensorEESaISI_EE+0x21b)[0x7f8e3e1a97fb]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libtensorflow_Score_Sgrappler_Sutils_Slibgrappler_Utest.so(_ZNK10tensorflow8grappler12GrapplerTest13EvaluateNodesERKNS_8GraphDefERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISB_EE+0x34)[0x7f8e3e1a9924]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/convert_to_bfloat16_test.runfiles/org_tensorflow/tensorflow/core/grappler/optimizers/convert_to_bfloat16_test(+0x58ffd5)[0x55abe695dfd5]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x4d)[0x7f8e3bf96fbd]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so(_ZN7testing4Test3RunEv+0xdb)[0x7f8e3bf9722b]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so(_ZN7testing8TestInfo3RunEv+0x121)[0x7f8e3bf97561]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so(_ZN7testing9TestSuite3RunEv+0xc7)[0x7f8e3bf97837]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x40c)[0x7f8e3bf97cdc]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc+0x4d)[0x7f8e3bf97e2d]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libexternal_Scom_Ugoogle_Ugoogletest_Slibgtest.so(_ZN7testing8UnitTest3RunEv+0x96)[0x7f8e3bf98066]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/../../../../_solib_k8/libtensorflow_Score_Slibtest_Umain.so(main+0xbd)[0x7f8e3ffcca4d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f8e3261052b]
/usr/local/google/home/reedwm/.cache/bazel/_bazel_reedwm/02a78c87331ebad0fd7841cf0e5b369b/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/grappler/optimizers/convert_to_bfloat16_test.runfiles/org_tensorflow/tensorflow/core/grappler/optimizers/convert_to_bfloat16_test(+0x57c6ea)[0x55abe694a6ea]
*** END MANGLED STACK TRACE ***

*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	
	
	gsignal
	abort
	
	tensorflow::grappler::GrapplerTest::EvaluateNodes(tensorflow::GraphDef const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&) const
	tensorflow::grappler::GrapplerTest::EvaluateNodes(tensorflow::GraphDef const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) const
	
	void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)
	testing::Test::Run()
	testing::TestInfo::Run()
	testing::TestSuite::Run()
	testing::internal::UnitTestImpl::RunAllTests()
	bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)
	testing::UnitTest::Run()
	main
	__libc_start_main
	
*** End stack trace ***
external/bazel_tools/tools/test/test-setup.sh: line 310: 126224 Aborted                 "${TEST_PATH}" "$@" 2>&1
nhasabni

comment created time in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Automatic BFloat16 converter

@nhasabni I will try to merge the two passes. Can you give me instructions for running the tests? I checked out this PR and ran the following

yes '' | ./configure
bazel test --config=mkl --config=opt //tensorflow/core/grappler/optimizers:convert_to_bfloat16_test

But it doesn't seem like none of the tests run. Even if I add EXPECT_EQ(0, 1); to a test, it still passes.

MKL pass should allow controlling operators that should be/should not be rewritten to BFloat16 (I think this is straightforward given that MKL pass will have a separate white, black and grey lists.)

The auto_mixed_precision pass already has environmental variables such as TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_WHITELIST_ADD, so this will not require any extra effort once I merge the passes.

nhasabni

comment created time in a month

issue closedtensorflow/tensorflow

In Matmul, is FP16 x FP16 accumulated to FP16 or FP32?

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.7
  • Python version:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 9.0
  • GPU model and memory: V100 16GB
  • Exact command to reproduce:

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

It is not clear in documents that, in Matmul, FP16xFP16 is accumulated to FP16 or FP32. This choice affects not only training accuracy, but also TensorCore computing performance on Volta GPU.

closed time in a month

jiazhe0909

issue commenttensorflow/tensorflow

In Matmul, is FP16 x FP16 accumulated to FP16 or FP32?

On GPUs, fp16 Matmul accumulation is done in fp32.

jiazhe0909

comment created time in a month

issue closedtensorflow/tensorflow

kears.layers.concatenate Does Not Work when Saving a Model

<em>Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template</em>

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): A custom model
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary):
  • TensorFlow version (use command below): 2.0.0
  • Python version: 3.75
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.0.1
  • GPU model and memory: Quadro K620M

Describe the current behavior

I built a custom model as follows

	class C3BR(tf.keras.Model):
		''' 3D Convolution + Batch Normalisation + Relu '''
		def __init__(self, filterNum, kSize, strSize, padMode):
			super(C3BR, self).__init__()
			self.conv = layers.Conv3D(filters=filterNum, kernel_size=kSize, strides=strSize, padding=padMode, data_format='channels_first')
			self.BN = layers.BatchNormalization(axis=1)
		
		def call(self, inputs, ifTrain=True):
			x = self.conv(inputs)
			if ifTrain == True:
				x= self.BN(x)
			return activations.relu(x)

		def build_model(self, input_shape):
			''' A work-around to define dimensions of signals through the NN'''
			self.build(input_shape)
			inputs = tf.keras.Input(shape=input_shape[1:])
			_ = self.call(inputs) 

	class SimpleUNet1(tf.keras.Model):
		"""
		Serialise basic units so as to build up a double-layered encoder-decoder U-Net
		Input:
			inDim: (for initialisation) [modaility/channel, tensor dimensions]
			classNum: background included
			name: name for the net
			inputs: 5D tf tensor of [mbSize, modaility/channel, tensor dimensions]. Inputs must be organised into channel first order
			input_shape: a 1X5 tuple [mbSize, modaility/channel, tensor dimensions]
			ifTrain: True for training, and False for validation and testing
		Returns:
			outputs: 5D tf tensor of [mbSize, classNum, tensor dimensions]
		"""
		def __init__(self, inDim, classNum, name='SimpleUNet', **kwarg):
			super(SimpleUNet1, self).__init__(name=name, **kwarg)
			self.inDim = inDim
			self.classNum = classNum
			dimEnSt1End = np.array(inDim)[2:]-2-2
			dimEnSt2Ed = dimEnSt1End/2-2-2
			dimBridgeEnd = (dimEnSt2Ed/2-2-2)*2
			dimDEStd1End = (dimBridgeEnd-2-2)*2
			self.outDim = dimDEStd1End-2-2-2
			temp = ((dimEnSt2Ed - dimBridgeEnd)/2).astype('int32')
			crop3d1 = tuple(np.tile(temp, (2, 1)).T)
			temp = ((dimEnSt1End - dimDEStd1End)/2).astype('int32')
			crop3d2 = tuple(np.tile(temp, (2, 1)).T)

			self.en_st1_cbr1 = C3BR(32, 3, 1, 'valid')
			self.en_st1_cbr2 = C3BR(64, 3, 1, 'valid')
			self.en_st2_mp = layers.MaxPooling3D(pool_size=(2, 2, 2), strides=(2, 2, 2), padding='valid', data_format='channels_first')
			self.en_st2_cbr1 = C3BR(64, 3, 1, 'valid')
			self.en_st2_cbr2 = C3BR(128, 3, 1, 'valid')
			self.bridge_mp = layers.MaxPooling3D(pool_size=(2, 2, 2), strides=(2, 2, 2), padding='valid', data_format='channels_first')
			self.bridge_cbr1 = C3BR(128, 3, 1, 'valid')
			self.bridge_cbr2 = C3BR(256, 3, 1, 'valid')    
			self.bridge_tconv1 = layers.Conv3DTranspose(256, 2, strides=2, padding='valid', data_format='channels_first')
			self.de_3dcrop1 = layers.Cropping3D(crop3d1, data_format='channels_first')
			self.de_st1_concat = layers.Concatenate(axis=1)
			self.de_st1_cbr1 = C3BR(256, 3, 1, 'valid')
			self.de_st1_cbr2 = C3BR(128, 3, 1, 'valid')    
			self.de_st1_tconv1 = layers.Conv3DTranspose(128, 2, strides=2, padding='valid', data_format='channels_first')
			self.de_3dcrop2 = layers.Cropping3D(crop3d2, data_format='channels_first')
			self.de_st2_concat = layers.Concatenate(axis=1)
			self.de_st2_cbr1 = C3BR(64, 3, 1, 'valid')
			self.de_st2_cbr2 = C3BR(64, 3, 1, 'valid') 
			self.final_conv3D = layers.Conv3D(filters=self.classNum, kernel_size=3, strides=1, padding='valid', data_format='channels_first')
					
		#@tf.function
		def call(self, inputs, ifTrain=True):
			x0 = self.en_st1_cbr1(inputs, ifTrain)
			xEnSt1End = self.en_st1_cbr2(x0, ifTrain)
			x1 = self.en_st2_mp(xEnSt1End)
			x2 = self.en_st2_cbr1(x1, ifTrain)
			xEnSt2Ed = self.en_st2_cbr2(x2, ifTrain)
			x3 = self.bridge_mp(xEnSt2Ed)  
			x4 = self.bridge_cbr1(x3, ifTrain)
			x5 = self.bridge_cbr2(x4, ifTrain)    
			xBridgeEnd = self.bridge_tconv1(x5)
			xCrop1 = self.de_3dcrop1(xEnSt2Ed)
			print(xBridgeEnd.shape)
			print(xCrop1.shape)
			x6 = self.de_st1_concat([xBridgeEnd, xCrop1])
			print(x6.shape)
			x7 = self.de_st1_cbr1(x6, ifTrain)
			x8 = self.de_st1_cbr2(x7, ifTrain)
			xDeSt1End = self.de_st1_tconv1(x8)
			xCrop2 = self.de_3dcrop2(xEnSt1End)
			x9 = self.de_st2_concat([xDeSt1End, xCrop2])
			x10 = self.de_st2_cbr1(x9, ifTrain)
			x11 = self.de_st2_cbr2(x10, ifTrain)
			x12 = self.final_conv3D(x11)
			outputs = activations.softmax(x12, axis=1)
			
			return outputs
			
		def build_model(self, input_shape):
			''' A work-around to define dimensions of signals through the NN'''
			self.build(input_shape)
			inputs = tf.keras.Input(shape=input_shape[1:])

			_ = self.call(inputs)
			
		def compute_output_shape(self):
			# Override this function if one expects to use the subclassed model in Kera's fit() method; Otherwise, it is optional.
			return tf.TensorShape(np.append(self.classNum, self.outDim))    

Please pay attention to the following two definitions. If I use concatenate layer in this way (the C is a capitalised one), when I try and save the model, it works

	self.de_st1_concat = layers.Concatenate(axis=1)
	self.de_st2_concat = layers.Concatenate(axis=1)

For instance,

	modelInDim = (4, 64, 64, 64)
	classNum = 2
	mbSize = 2
	TUNet = SimpleUNet1(modelInDim, classNum)
	TUNet.build_model(input_shape=(mbSize,)+modelInDim)
	x=tf.random.uniform((mbSize, 4, 64, 64, 64))
	y=TUNet(x)
	TUNet.summary()
	TUNet._set_inputs(x)
	TUNet.save(r'...\TTweight', save_format='tf') 

But if, as per this page I use layers.concatenate (small 'c') to generate signals x6 and x9, respectively, that is

	x6 = layers.concatenate([xBridgeEnd, xCrop1], axis=1)
	x9 = layers.concatenate([xDeSt1End, xCrop2], axis=1)

Then save the model in the same way above, it raised an error

	  ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, 256, None, None, None), (None, 128, 18, 18, 18)]

The detailed log is here

It takes me almost an entire day to figure out the root cause. In conclusion, I think the second one may have to be deprecated; otherwise users like me will be confused and misguided.

closed time in a month

yourtheron

issue commenttensorflow/tensorflow

kears.layers.concatenate Does Not Work when Saving a Model

With your second code sample, with TUNet._set_inputs(x), I was able to get the example to work. Once I replace x6 and x9 though, I get a different error on the TUNet.summary() line. If I comment the summary line out, I can reproduce your original error in TF 2.0 ("A Concatenate layer requires inputs with matching shapes..."). However, I cannot reproduce in TF 2.1. Therefore, I'm assuming the original issue is fixed so I'm closing it.

However, there still may be a bug, as the summary doesn't work. Feel free to file another bug about that if you think that is an issue, with a self-contained example to reproduce. If you do file such a bug, consider CCing me (add the line "/CC @reedwm") as I have some context and can therefore more easily triage to someone else.

For reference, here is the code sample I ran that works in TF 2.0 but not TF 2.1. It is the same as the code sample you provided, except i called _set_inputs as you suggested, I commented out TUNet.summary(), and I replaced x6 and x9 as you suggested.

import tensorflow as tf
import numpy as np
from tensorflow.keras import layers, activations

class C3BR(tf.keras.Model):
  ''' 3D Convolution + Batch Normalisation + Relu '''
  def __init__(self, filterNum, kSize, strSize, padMode):
    super(C3BR, self).__init__()
    self.conv = layers.Conv3D(filters=filterNum, kernel_size=kSize, strides=strSize, padding=padMode, data_format='channels_first')
    self.BN = layers.BatchNormalization(axis=1)

  def call(self, inputs, ifTrain=True):
    x = self.conv(inputs)
    if ifTrain == True:
      x= self.BN(x)
    return activations.relu(x)

  def build_model(self, input_shape):
    ''' A work-around to define dimensions of signals through the NN'''
    self.build(input_shape)
    inputs = tf.keras.Input(shape=input_shape[1:])
    _ = self.call(inputs)

class SimpleUNet1(tf.keras.Model):
  """
  Serialise basic units so as to build up a double-layered encoder-decoder U-Net
  Input:
    inDim: (for initialisation) [modaility/channel, tensor dimensions]
    classNum: background included
    name: name for the net
    inputs: 5D tf tensor of [mbSize, modaility/channel, tensor dimensions]. Inputs must be organised into channel first order
    input_shape: a 1X5 tuple [mbSize, modaility/channel, tensor dimensions]
    ifTrain: True for training, and False for validation and testing
  Returns:
    outputs: 5D tf tensor of [mbSize, classNum, tensor dimensions]
  """
  def __init__(self, inDim, classNum, name='SimpleUNet', **kwarg):
    super(SimpleUNet1, self).__init__(name=name, **kwarg)
    self.inDim = inDim
    self.classNum = classNum
    dimEnSt1End = np.array(inDim)[1:]-2-2
    dimEnSt2Ed = dimEnSt1End/2-2-2
    dimBridgeEnd = (dimEnSt2Ed/2-2-2)*2
    dimDEStd1End = (dimBridgeEnd-2-2)*2
    self.outDim = dimDEStd1End-2-2-2
    temp = ((dimEnSt2Ed - dimBridgeEnd)/2).astype('int32')
    crop3d1 = tuple(np.tile(temp, (2, 1)).T)
    temp = ((dimEnSt1End - dimDEStd1End)/2).astype('int32')
    crop3d2 = tuple(np.tile(temp, (2, 1)).T)

    self.en_st1_cbr1 = C3BR(32, 3, 1, 'valid')
    self.en_st1_cbr2 = C3BR(64, 3, 1, 'valid')
    self.en_st2_mp = layers.MaxPooling3D(pool_size=(2, 2, 2), strides=(2, 2, 2), padding='valid', data_format='channels_first')
    self.en_st2_cbr1 = C3BR(64, 3, 1, 'valid')
    self.en_st2_cbr2 = C3BR(128, 3, 1, 'valid')
    self.bridge_mp = layers.MaxPooling3D(pool_size=(2, 2, 2), strides=(2, 2, 2), padding='valid', data_format='channels_first')
    self.bridge_cbr1 = C3BR(128, 3, 1, 'valid')
    self.bridge_cbr2 = C3BR(256, 3, 1, 'valid')
    self.bridge_tconv1 = layers.Conv3DTranspose(256, 2, strides=2, padding='valid', data_format='channels_first')
    self.de_3dcrop1 = layers.Cropping3D(crop3d1, data_format='channels_first')
    self.de_st1_concat = layers.Concatenate(axis=1)
    self.de_st1_cbr1 = C3BR(256, 3, 1, 'valid')
    self.de_st1_cbr2 = C3BR(128, 3, 1, 'valid')
    self.de_st1_tconv1 = layers.Conv3DTranspose(128, 2, strides=2, padding='valid', data_format='channels_first')
    self.de_3dcrop2 = layers.Cropping3D(crop3d2, data_format='channels_first')
    self.de_st2_concat = layers.Concatenate(axis=1)
    self.de_st2_cbr1 = C3BR(64, 3, 1, 'valid')
    self.de_st2_cbr2 = C3BR(64, 3, 1, 'valid')
    self.final_conv3D = layers.Conv3D(filters=self.classNum, kernel_size=3, strides=1, padding='valid', data_format='channels_first')

  #@tf.function
  def call(self, inputs, ifTrain=True):
    x0 = self.en_st1_cbr1(inputs, ifTrain)
    xEnSt1End = self.en_st1_cbr2(x0, ifTrain)
    x1 = self.en_st2_mp(xEnSt1End)
    x2 = self.en_st2_cbr1(x1, ifTrain)
    xEnSt2Ed = self.en_st2_cbr2(x2, ifTrain)
    x3 = self.bridge_mp(xEnSt2Ed)
    x4 = self.bridge_cbr1(x3, ifTrain)
    x5 = self.bridge_cbr2(x4, ifTrain)
    xBridgeEnd = self.bridge_tconv1(x5)
    xCrop1 = self.de_3dcrop1(xEnSt2Ed)
    print(xBridgeEnd.shape)
    print(xCrop1.shape)
    x6 = layers.concatenate([xBridgeEnd, xCrop1], axis=1)
    print(x6.shape)
    x7 = self.de_st1_cbr1(x6, ifTrain)
    x8 = self.de_st1_cbr2(x7, ifTrain)
    xDeSt1End = self.de_st1_tconv1(x8)
    xCrop2 = self.de_3dcrop2(xEnSt1End)
    x9 = layers.concatenate([xDeSt1End, xCrop2], axis=1)
    x10 = self.de_st2_cbr1(x9, ifTrain)
    x11 = self.de_st2_cbr2(x10, ifTrain)
    x12 = self.final_conv3D(x11)
    outputs = activations.softmax(x12, axis=1)

    return outputs

  def build_model(self, input_shape):
    ''' A work-around to define dimensions of signals through the NN'''
    self.build(input_shape)
    inputs = tf.keras.Input(shape=input_shape[1:])

    _ = self.call(inputs)

  def compute_output_shape(self):
    # Override this function if one expects to use the subclassed model in Kera's fit() method; Otherwise, it is optional.
    return tf.TensorShape(np.append(self.classNum, self.outDim))

modelInDim = (4, 64, 64, 64)
classNum = 2
mbSize = 2
TUNet = SimpleUNet1(modelInDim, classNum)
TUNet.build_model(input_shape=(mbSize,)+modelInDim)
# TUNet.summary()

x=tf.random.uniform((mbSize, 4, 64, 64, 64))
TUNet._set_inputs(x)
# use your directory in lieu of r'...\TTweight'
TUNet.save(r'TTweight', save_format='tf')
yourtheron

comment created time in a month

issue commenttensorflow/tensorflow

Cannot export keras model to SavedModel if mixed-precision policy is enabled

Unfortunately the fix did not make it into 2.1 :(. It will be in 2.2 however, and the fix is already in tf-nightly

I'm not sure why the error message is different. Note that Model.save_weights still works, just not Model.save.

As a workaround, you can call Model.save_weights, rebuild the model in fp32, call Model.load_weights on the fp32 model then call Model.save. But understandably, this is very irritating, and it doesn't save a mixed precision model. Using the old tf.train.experimental.enable_mixed_precision_graph_rewrite API also works but note we plan on deprecating then removing it (it will remain in tf.compat.v1 for a long time though).

bearsroom

comment created time in a month

issue commenttensorflow/tensorflow

kears.layers.concatenate Does Not Work when Saving a Model

Thanks for the example to reproduce! However, when I try running the new example (without changing the x6 and x9 lines to the version), I get the error:

ValueError: Model <__main__.SimpleUNet1 object at 0x7f7ea76ec690> cannot be saved because the input shapes have not been set. Usually, input shapes are automatically determined from calling .fit() or .predict(). To manually set the shapes, call model._set_inputs(inputs).

Can you double check the example? I got that above error in TF 2.0, TF 2.1, and the nightly TF.

Has the solution for saving models under mixed precision been submitted to Tf2.1 as well?

Unfortunately no. The fix will only be in TF 2.2 and above.

Finally, I would say as I proceed in the doc (basically I read all of the guide, the tutorial, as well as 2.0.0 API except the RNN and reinforcement learning part that are not my field), I have got a strong feeling that the writing style in different parts differ a lot. Sometimes semi-formal, sometimes informal, even colloquial. I, being a non-native English speaker, am afraid to say there are even some grammatical mistakes and pdigin English, which do not hinder engineers from understanding the stuff though, are flaws. Hope Google will pay some attentions to it.

Thanks for the feedback. Ideally all documentation should conform to the Google documentation style guide. Feel free to point out different docs where you think the writing style differs a lot, or where there are grammatical mistakes. You can also file a new issue for this. The docs are written by different people so style may differ, but we try to keep the style consistent.

yourtheron

comment created time in a month

IssuesEvent

issue commenttensorflow/tensorflow

kears.layers.concatenate Does Not Work when Saving a Model

When I try to run the code in the original post, I get the error:

Traceback (most recent call last):
  File "issue35441.py", line 118, in <module>
    TUNet = SimpleUNet1(modelInDim, classNum)
  File "issue35441.py", line 63, in __init__
    self.de_3dcrop1 = layers.Cropping3D(crop3d1, data_format='channels_first')
  File "/usr/local/google/home/reedwm/venvs/tf2_0/lib/python3.7/site-packages/tensorflow_core/python/keras/layers/convolutional.py", line 2565, in __init__
    'Found: ' + str(cropping))
ValueError: `cropping` should have 3 elements. Found: (array([4, 4], dtype=int32), array([4, 4], dtype=int32))

I am unfamiliar with UNet so I'm not sure what's wrong. Also, it would help if you post a single self-contained example to reproduce the issue. In your original post, you split the code over several code blocks, and you don't have the import statements like import tensorflow as tf.

Can you post a self-contained example to reproduce the original issue?

As for your mixed-precision related questions:

  1. Is it a bug about tf's functionality of saving custom model or mixed precision?

It's probably a bug with the combination of the two features. But this may have been fixed by cd6184047e9e497955c473b88387b54818ff23a0 so try again with the latest nightly build. You can install a nightly build with pip install tf-nightly.

  1. Can I use mixed precision for custom models safely?

Yes, this should work.

Why can't the doc use a more explicit or articulating example to state such a delicate and important case?

Agreed this point is not emphasized much. I did try to make this clear, as earlier in the doc, there is a sentence stating the same thing: "They cast their inputs to float16 in order to do float16 computations, which causes their outputs to be float16 as a result." After that sentence, there is an example showing the output is float16.

It was tough to balance having a lot of explanation on details like that, and having the doc be too long. I will see if I can make this more clear but I also don't want to add another paragraph as the doc is already very long.

The underlined sentence is quite confusing. Will the model do it automatically, or shall the user do it, for example in the call() function of a custom model?

The model will automatically do it. I'll also see if I can add a sentence stating this to the doc. This is useful information, but it's tough to find a good place to mention this.

yourtheron

comment created time in a month

issue commenttensorflow/benchmarks

How to set num_eval_epochs?

Yes, but note only the chief worker saves a checkpoint (the chief is the first worker, with task_index 0). So you would gather the checkpoint from the first worker, since the others don't have a checkpoint. Or just have the first worker do the evaluation after training finishes.

mcanini

comment created time in a month

issue commenttensorflow/tensorflow

Why Tf Warns 'Model's Complie Policy is not the same as the dtype policy's loss scale'

Sorry for the late response. I answered your questions below. Let me know if you have any more questions.

The following warning pops out

TUNet.compile(loss=softDiceLoss, optimizer=curOpt)
WARNING:tensorflow:LossScale of LossScaleOptimizer passed to compile (DynamicLossScale(current_loss_scale=32768.0, num_good_steps=0, initial_loss_scale=32768.0, increment_period=2000, multiplier=2.0)) is not the same as the dtype policy's loss scale (DynamicLossScale(current_loss_scale=32768.0, num_good_steps=0, initial_loss_scale=32768.0, increment_period=2000, multiplier=2.0)). Because the dtype policy has a loss scale, you should pass an optimizer that is not wrapped with a LossScaleOptimizer,

What does it mean? The guide requires the optimizer to be wrapped in the way above. I am quite confused...

This warning is poorly worded, and you can safely ignore it. If you want to know the technical details, what is happening is as follows. You set the policy to mixed_float16 and use a LossScaleOptimizer. The LossScaleOptimizer automatically creates a DynamicLossScale object since you passed the string dynamic. However, when you call TUNet.compile, it automatically tries to convert the optimizer to a LossScaleOptimizer with a different DynamicLossScale object, since the policy is mixed_float16 which has DynamicLossScale. Since you already passed a LossScaleOptimizer, it gives a warning.

I will try to see how I can rework this error message, as it's too confusing. If you call Model.fit, you can fix by not wrapping the optimizer with a LossScaleOptimizer. If you do not call fit, you do need to wrap with a LossScaleOptimizer but you (probably) do not need to call compile.

If I use mixed preicision. Suppose my original input is of float32, do I have to manually cast it into float16?

Typically no. The following lines will cause every Keras layer to cast their inputs to float16, so you do not have to do any casting.

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

The doc instructs setting the output layer to be float32. layers.Activation(...) does not support setting axis=1 (my data is channel first), and if I use activations.softmax(...), it does not support manually setting dtype=float32, as I mentioned in another issue

I responded to that issue directly.

Is the way I set the input signature of softDiceLoss(...) correct? or shall I write it like @tf.function(input_signature=(tf.TensorSpec(shape=[None], dtype=tf.float32), tf.TensorSpec(shape=[None], dtype=tf.uint8),)) The online doc is a bit confusing Thanks.

Passing shape=None means the shape can be anything, such as [2] or [3, 4]. Passing shape=[None] means the shape must only have at least one component, or in other words, in the form of [???] where ??? can be any number. This means [2] would be acceptable but not [3, 4]. Assuming you always pass tensors with one component in their shape, I think the two forms have no practical difference, although shape=[None] may have a performance benefit. I'm not an expert on this.

Can the two code snippets below be performed instead without using tf.GradientTape()? Are they equivalent, with/without mixed precision? I want to know that because it seems the following two provide a unified API for cases of using mixed precision or not.

Yes, all three of those examples are correct. curOpt.get_gradients and curOpt.minimize automatically apply loss scaling when called on a LossScaleOptimizer, so you do not have to do extra work. However, when you use a tf.GradientTape, you must call get_scaled_loss andget_unscaled_gradients` (as you correctly did in the first example), since the GradientTape does not know about loss scaling. In general, if you compute gradients through the LossScaleOptimizer, you don't have to manually scale the loss and unscale gradients. If you compute gradient tapes through a GradientTape, you do have to manually scale the loss and unscale gradients.

Another point is that NVIDIA sugests about using tf.train.experimental.enable_mixed_precision_graph_rewrite() to wrap the optimizer in lieu of mixed_precision.LossScaleOptimizer(). What is the difference between the two? I am aware that my GPU is too old to use the first one, but tf' doc says using the second one may at least enable me to double my batch size.

This is complicated, and I will definitely clarify this in a doc in the future. The enable_mixed_precision_graph_rewrite function recommends is completely separate from the mixed_precision.set_policy API. You can use one API or the other, but not both because they both do the same thing. I recommend not using enable_mixed_precision_graph_rewrite, and we will deprecate and remove it in the future (although it will be accessible under the tf.compat.v1 namespace for a long time). Instead use mixed_precision.set_policy.

enable_mixed_precision_graph_rewrite causes the graph to be rewritten under the hood by the TensorFlow C++ backend to use mixed precision. mixed_precision.set_policy('mixed_float16') causes Keras layers to use mixed precision through tf.cast ops. The one shared component between the APIs is the LossScaleOptimizer. enable_mixed_precision_graph_rewrite returns a LossScaleOptimizer, and you have to use it just as you do in the mixed_precision.set_policy API.

As enable_mixed_precision_graph_rewrite rewrites the graph, it does not work in Eager mode as there is no graph. Additionally, as it operates on the C++ backend, querrying tensor.dtype will always return float32 since the float16 happens on the C++ layer. Finally, you cannot override a dtype of a layer to float32 easily with the graph rewrite. On the other hand, enable_mixed_precision_graph_rewrite will never result in a TypeError, works outside of Keras as well as inside, and can more effectively determine the optimal dtypes of ops since it can look at the whole graph at once. The Keras mixed_precision.set_policy is recommended over the enable_mixed_precision_graph_rewrite.

yourtheron

comment created time in a month

issue commenttensorflow/tensorflow

Confusion between layers.Activation('softmax'...) and activations.softmax(...)

First note, you only need to do softmax in float32 at the end of the model. Unless the softmax is immediately followed by the loss, it is typically safe to do in float16. This means the softmax in SFM1 and SFM2 is fine to do in float16 unless you plan on having those layers be the last layers of the model. Therefore you do not have to pass dtype='float32' to softmax.

Now to answer the question: What if you did want to pass axis to softmax and do it in float32? This is unlikely as most losses expect the softmax to be done on the final axis, but it can occur with custom losses. Casting the inputs to float32 with tf.dtypes.cast and using the activations.softmax function is a good solution. But your example is slightly incorrect: You want to cast the inputs to softmax, not the outputs:

activations.softmax(tf.dtypes.cast(inputs, dtype=tf.float32), axis=1)

Alternatively, you can directly use a tf.keras.layers.Softmax layer, which does have an activation argument:

layer = tf.keras.layers.Softmax(axis=1, dtype=tf.float32)
yourtheron

comment created time in a month

issue commenttensorflow/tensorflow

AlphaDropout & mixed_float16 - Op has type float32 that does not match type float16

You're not misunderstanding anything, as any layer which does not work with mixed_float16 is a serious bug. Thanks for the report and short example to reproduce! I will address.

phemmer

comment created time in a month

issue closedtensorflow/benchmarks

How to get communication time between worker and PS in "parameter_server" mode?

Hello, If I didn't get it wrong, in "parameter server" mode, each the workers must fetch the variables from the PS at the start of the step. So I have two quick questions: when I use "parameter server" mode to run distributed training benchmark, how can I get the communication time between worker and PS about the "fetch"? And where can I find the code about communication of each step? Thanks a lot!

closed time in a month

lecoos

issue commenttensorflow/benchmarks

How to get communication time between worker and PS in "parameter_server" mode?

There is no easy way to get the communication time between the worker and the PS unfortunately. You might be able to tell by passing --trace_file and viewing the file in chrome.

The code is in this class, although the code is a bit hard to understand. The get_devices function causes variables to be put on the PS. When the variables are used on the workers, TensorFlow automatically transfers them from the PS to the worker.

lecoos

comment created time in a month

issue commenttensorflow/benchmarks

Alexnet training realdata has a bad performance on Multi-GPU with V100

This is somewhat expected, as we never optimized for Alexnet. Its per-step time is very small, making overhead from all-reducing gradients take a greater percentage of the time (although I'm not sure how large the gradients are).

Since tf_cnn_benchmarks is currently unmaintained and not written using modern TF2 features, the performance issue will likely not be addressed.

stevenyslins

comment created time in a month

issue closedtensorflow/benchmarks

Is fp16 a common practice for CNN training now?

Is it common to use fp16 to do CNN training now?

Maybe weird to ask this question here. But I cannot imagine a better place to ask this. Feel free to close this issue.

closed time in a month

daadaada

issue commenttensorflow/benchmarks

Is fp16 a common practice for CNN training now?

Closing as Toby gave a response

daadaada

comment created time in a month

issue closedtensorflow/benchmarks

AttributeError: '_ThreadPoolDataset' object has no attribute 'make_initializable_iterator'

TensorFlow - v1.13.1, benchmarks - cnn_tf_v1.13_compatible. Tried to evaluate resnet50 model by running tf_cnn_benchmarks.py in following configuration:

TensorFlow: 1.13 Model: resnet50 Dataset: imagenet Mode: evaluation SingleSess: False Batch size: 1 global 1 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server

Got following error (last line from call stack): File ".../benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py", line 547, in create_iterator ds_iterator = ds.make_initializable_iterator() AttributeError: '_ThreadPoolDataset' object has no attribute 'make_initializable_iterator'

Was able to fix it and continue evaluation by using: ds_iterator = tf.compat.v1.data.make_initializable_iterator(ds)

closed time in a month

dmyershov

issue closedtensorflow/benchmarks

INT8 inference support on CPU

Is there a way to run the tf_cnn-benchmark (inference only) with INT8 on CPUs? The current implementation only supports INT8 inference on GPUs (with TensorRT). How do we do INT8 inference for all the Tensorflow CNN models on CPUs instead?

closed time in a month

shrutiramesh1988

issue commenttensorflow/benchmarks

Support TF_CONFIG environment variable in distribute cases

I am not very familiar with TF_CONFIG and tf_cnn_benchmarks is unmaintained, so I don't think this will get done.

It looks like TF_CONFIG specifies the same information as --job_name, --worker_hosts, --ps_hosts, and --task_index. If that is all you want from TF_CONFIG, I might have time to add a --use_tf_config boolean flag that uses TF_CONFIG, which automatically those four flags.

xiaozhouX

comment created time in a month

PR closed tensorflow/benchmarks

To solve horovod's execution error of multi-GPU with multi-server cla: yes

Fix GPU allocation error when learning from multi-GPU with multi-server.

+2 -2

1 comment

1 changed file

jayhpark530

pr closed time in a month

pull request commenttensorflow/benchmarks

To solve horovod's execution error of multi-GPU with multi-server

Since I do not know how to use Horovod and tf_cnn_benchmarks is unmaintained, I cannot verify if this PR works, so I unfortunately cannot accept it :(. In general, I can no longer address Horovod issues. I apologize for not being able to accept this PR.

jayhpark530

comment created time in a month

PR closed tensorflow/benchmarks

change the number of kerners to the settings of the alexnet paper. cla: yes

The number of alexnet kernels in the benchmark differs from the number described in the alexnet paper. Is there a special reason? I changed the number of kernels to the configuration of alexnet paper.

+2 -2

1 comment

1 changed file

jayhpark530

pr closed time in a month

pull request commenttensorflow/benchmarks

change the number of kerners to the settings of the alexnet paper.

Hmm, not sure why we have such an obvious mistake. However, given that tf_cnn_benchmarks is unmaintained and deprecated, I'm reluctant to make significant changes to the models. Only Resnet has been heavily tested, and Alexnet likely has other issues as well. Since we do not publish benchmark results from tf_cnn_benchmarks, it is not worth ensuring the non-resnet models are correct.

jayhpark530

comment created time in a month

pull request commenttensorflow/benchmarks

Updated link

The merge process should be working again. @BlockLabTV sorry for the very long delay. I can merge again if you address conflicts (the conflict is simply the file has been moved to a new directory).

BlockLabTV

comment created time in a month

Pull request review commenttensorflow/benchmarks

Fix the Cifar10ImagePreprocessor class to use newer dataset APIs

 def minibatch(self,                 subset,                 params,                 shift_ratio=-1):-    # TODO(jsimsa): Implement datasets code path     del shift_ratio, params     with tf.name_scope('batch_processing'):       all_images, all_labels = dataset.read_data_files(subset)       all_images = tf.constant(all_images)       all_labels = tf.constant(all_labels)-      input_image, input_label = tf.train.slice_input_producer(-          [all_images, all_labels])-      input_image = tf.cast(input_image, self.dtype)-      input_label = tf.cast(input_label, tf.int32)-      # Ensure that the random shuffling has good mixing properties.+      input_image = tf.cast(all_images, self.dtype)+      input_label = tf.cast(all_labels, tf.int32)+      dataset_train = tf.data.Dataset.from_tensor_slices(+          (input_image, input_label))+       min_fraction_of_examples_in_queue = 0.4       min_queue_examples = int(dataset.num_examples_per_epoch(subset) *                                min_fraction_of_examples_in_queue)-      raw_images, raw_labels = tf.train.shuffle_batch(-          [input_image, input_label], batch_size=self.batch_size,-          capacity=min_queue_examples + 3 * self.batch_size,-          min_after_dequeue=min_queue_examples)++      dataset_train = dataset_train.shuffle(min_queue_examples).batch(+          self.batch_size, drop_remainder=True)++      if tf.VERSION > "1.12":

You can safely assume TensorFlow is at least version 1.12. There are branches such as cnn_tf_v1.11_compatible that work with older versions.

chandanjc

comment created time in a month

Pull request review commenttensorflow/benchmarks

Fix the Cifar10ImagePreprocessor class to use newer dataset APIs

 def minibatch(self,                 subset,                 params,                 shift_ratio=-1):-    # TODO(jsimsa): Implement datasets code path     del shift_ratio, params     with tf.name_scope('batch_processing'):       all_images, all_labels = dataset.read_data_files(subset)       all_images = tf.constant(all_images)       all_labels = tf.constant(all_labels)-      input_image, input_label = tf.train.slice_input_producer(-          [all_images, all_labels])-      input_image = tf.cast(input_image, self.dtype)-      input_label = tf.cast(input_label, tf.int32)-      # Ensure that the random shuffling has good mixing properties.+      input_image = tf.cast(all_images, self.dtype)+      input_label = tf.cast(all_labels, tf.int32)+      dataset_train = tf.data.Dataset.from_tensor_slices(+          (input_image, input_label))+       min_fraction_of_examples_in_queue = 0.4       min_queue_examples = int(dataset.num_examples_per_epoch(subset) *                                min_fraction_of_examples_in_queue)-      raw_images, raw_labels = tf.train.shuffle_batch(-          [input_image, input_label], batch_size=self.batch_size,-          capacity=min_queue_examples + 3 * self.batch_size,-          min_after_dequeue=min_queue_examples)++      dataset_train = dataset_train.shuffle(min_queue_examples).batch(+          self.batch_size, drop_remainder=True)++      if tf.VERSION > "1.12":+        raw_images, raw_labels = tf.compat.v1.data.make_one_shot_iterator(

No need for compat.v1 since we have import tensorflow.compat.v1 as tf at the top. Simply tf.data.make_one_shot_iterator is fine.

chandanjc

comment created time in a month

Pull request review commenttensorflow/benchmarks

Fix the Cifar10ImagePreprocessor class to use newer dataset APIs

 def minibatch(self,                 subset,                 params,                 shift_ratio=-1):-    # TODO(jsimsa): Implement datasets code path     del shift_ratio, params     with tf.name_scope('batch_processing'):       all_images, all_labels = dataset.read_data_files(subset)       all_images = tf.constant(all_images)       all_labels = tf.constant(all_labels)-      input_image, input_label = tf.train.slice_input_producer(-          [all_images, all_labels])-      input_image = tf.cast(input_image, self.dtype)-      input_label = tf.cast(input_label, tf.int32)-      # Ensure that the random shuffling has good mixing properties.+      input_image = tf.cast(all_images, self.dtype)+      input_label = tf.cast(all_labels, tf.int32)+      dataset_train = tf.data.Dataset.from_tensor_slices(+          (input_image, input_label))+       min_fraction_of_examples_in_queue = 0.4       min_queue_examples = int(dataset.num_examples_per_epoch(subset) *                                min_fraction_of_examples_in_queue)-      raw_images, raw_labels = tf.train.shuffle_batch(-          [input_image, input_label], batch_size=self.batch_size,-          capacity=min_queue_examples + 3 * self.batch_size,-          min_after_dequeue=min_queue_examples)++      dataset_train = dataset_train.shuffle(min_queue_examples).batch(

You should call .repeat() in between shuffle and batch

chandanjc

comment created time in a month

PR merged tensorflow/benchmarks

Fix `gradient_repacking` flag description cla: yes ready to pull

The gradient_repacking flag had some minor typos in the description: lines were not separated with spaces, and "of" was repeated twice.

This fixes the description to print correctly.

+4 -4

3 comments

1 changed file

leesharma

pr closed time in a month

push eventtensorflow/benchmarks

Lee Sharma

commit sha 9ee31cb4292549eb01d5fa1e71690a46713aef19

Fix flag description The `gradient_repacking` flag had some minor typos in the description: lines were not separated with spaces, and "of" was repeated twice. This fixes the description to print correctly for `--help`.

view details

Reed

commit sha 21e3757f6c46ae27c334919e0f3461984f537bae

Merge pull request #379 from leesharma:lee/fix-spacing PiperOrigin-RevId: 290186419

view details

push time in a month

PR merged tensorflow/benchmarks

Fix a third typo cla: yes ready to pull
+1 -1

0 comment

1 changed file

reedwm

pr closed time in a month

push eventtensorflow/benchmarks

Reed

commit sha 405c1ee41930018d6b3ad79cbcdb5cadfc25c7ab

Fix a third typo

view details

Reed

commit sha e53409c1d481360e8cabddbbce93f95b0f6339da

Merge branch 'master' into typo3

view details

Reed

commit sha f1379b1ea3df3f56816feff48df0ed9276751dbc

Merge pull request #357 from reedwm:typo3 PiperOrigin-RevId: 290184656

view details

push time in a month

push eventreedwm/benchmarks

Dong Lin

commit sha ce0c732a033bd48538947ff641689476333c5882

Update PerfZero README to use the metrics field in report_benchmark() (#358)

view details

Toby Boyd

commit sha 72c805bd1178f9c0ee8c9ecfa4d206f9937890f4

Change back to tfds-nightly

view details

Dong Lin

commit sha ac830fc902b8743ff2cec83cb103df7d0fffce84

Include default value fields when converting protobuf message to dict (#361)

view details

George Zhang

commit sha 367e23bbeca2dd4eb109e25f9fc6307b5b7132c6

Link should not point to a forked repository (#363) Currently, the `perfzero` link is pointing to a forked repository. This could be really confusing and needs to be corrected. The link should point to the folder `perfzero` inside the repository itself.

view details

Reed Wanderman-Milne

commit sha 63d6db878598bfba211eb23156100e1ab307dab7

Use explicit padding in resnet models. PiperOrigin-RevId: 244894638

view details

Rick Chao

commit sha 2c4a23ad58e823763f400b995f43d51539a8b5ac

Automated rollback of commit 63d6db878598bfba211eb23156100e1ab307dab7 PiperOrigin-RevId: 245065774

view details

A. Unique TensorFlower

commit sha 482ad13d1b51e616d3c9fa4caae51e3d23a5f18c

Update GPU KernelTracker options for TF CNN Benchmarks. Add options to TF CNN Benchmarks to configure use of timestamped allocator and GPUKernelTracker. Note that these both default off, but for some models best performance is obtained by turning on kernel tracking, limiting and/or timestamped allocation in a model-specific way. PiperOrigin-RevId: 245481170

view details

A. Unique TensorFlower

commit sha 053cecbb8aea976fd6ad8554e1c20c97feba580a

Fix line formatting. PiperOrigin-RevId: 245484851

view details

Adrian Kuegel

commit sha 2696206fc01860d7b06fd02a01626f56abbab40a

Adjust ssd_data_loader to recent changes The SSD benchmark was broken by the changes, this CL fixes it. PiperOrigin-RevId: 245785634

view details

Dong Lin

commit sha 2086bd24f21b3c2431246d76f16d995da79cbfd0

Fix link in PerfZero README (#368)

view details

Dong Lin

commit sha 33c1e66be21962fd475833da16b294152a1e3bc7

Allow flag --benchmark_methods to take a comma separated list of patterns (#371)

view details

Ayush Dubey

commit sha c776afc9c31227c919b75a42f1b3922095ae1d53

Ensure we do not miss output when running commands in subprocess (#372)

view details

Reed

commit sha 7da8c07d71b56d01453acaf6d0ad0161cfb27b9b

Do not parse values in "extras" dict as JSON. (#375) Currently, no benchmarks use the "extras" dict, so this change is a no-op. But I plan on using the "extras" dict soon, and I will put values that are not JSON. This change allows arbitrary extra information to be put in the "extras" dict.

view details

Reed

commit sha b16c83573c606d245f1ade90116511dee01b0e2b

Fix sample command in README. (#374) The benchmark method that was in the README no longer exists.

view details

Ayush Dubey

commit sha d17bf4caf0886a4b772ee4b95a474251bb271c5b

Add execution_id command line flag, and add hostname to system info. (#373)

view details

Toby Boyd

commit sha 5e2cb81e1ce307a127d6b659942a7761b902ed8c

Build from source docker (#377) * initial build file * Working python3 build environment CUDA 10 * add comment on installing blaze. * bazel not blaze

view details

Toby Boyd

commit sha 1cfde91c148e3d8da036b2d6b92159e5e1709f30

Python2 fix (#382) * Short-term Python2 support * Add comment about python2

view details

Toby Boyd

commit sha d7974c07f0e58163569b1ca6b0e734ed4b9d9b34

Revert "Python2 fix (#382)" (#383) This reverts commit 1cfde91c148e3d8da036b2d6b92159e5e1709f30.

view details

Toby Boyd

commit sha f5a2d06fe068c9b2405e2639fab1743038fa602f

Python2 fix (#384) * Short-term Python2 support * Add comment about python2 * Empty array to support python2 * Updated comment and fixed issue on tensorflow_profiler.py

view details

Haoyu Zhang

commit sha 50de51b1bfcb6812bfc3ca6d830ae5507adbb3fd

Support Install TensorFlow From Custom GS Path (#385) * Support TF pip package from GCS * Create dockerfile with custom pip package * Fix Dockerfile * Fix pip file path * Change docker build command to include context * Updated default path of pip package * Check tensorflow_pip_spec is not None * Use a separate path for docker context * Resolve comments

view details

push time in a month

issue commenttensorflow/benchmarks

absl import error with using AMD GPU i.e. RX580

Closing this as an issue is filed on the ROCm fork.

dkgh000

comment created time in a month

issue closedtensorflow/benchmarks

How to get a pb models?

Did benchmarks can save a pb file, not chkp files.

closed time in a month

ghost

issue commenttensorflow/benchmarks

How to get a pb models?

If by "pb file", you mean a saved model, this is impossible. However, you can get a graphdef by passing --graph_file ~/graph.pbtxt

ghost

comment created time in a month

issue closedtensorflow/benchmarks

absl import error with using AMD GPU i.e. RX580

When running python tf_cnn_benchmarks.py --num_gpu=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server the benchmark exited with error message saying absl import failed. The pypi.org repository has no absl pip package however similarly named package called absl-py. Installing this did not fix the problem. The tf_cnn_benchmarks.py does have "from absl import <>" statement. For some reason, this is not an issue when running NVIDIA GPU, only seen on AMD GPU i.e. RX560.

closed time in a month

dkgh000

issue closedtensorflow/benchmarks

Should we change session manager from tf.train.Supervisor to tf.train.MonitoredTrainingSession?

I always meet this error Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error in distributed_replicated mode. It will break the training program. But sometimes the error doesn't appear. It impacted my scheduling work a lot... I googled it, and in tensorflow-issue-6780, @yaroslavvb said when we meet UnavailableError, Supervisor won't restart session, just let the session down instead. MonitoredSession/MonitoredTraining session improve on Supervisor by adding retries in presence of UnavailableError. So I wonder if we are going to change session manager from Supervisor to MonitoredTrainingSession?

closed time in a month

konnase

issue commenttensorflow/benchmarks

Should we change session manager from tf.train.Supervisor to tf.train.MonitoredTrainingSession?

Unfortunately, as tf_cnn_benchmarks has been unmaintained for a while and switching is difficult, this will likely never be done.

konnase

comment created time in a month

issue closedtensorflow/benchmarks

About the exact meaning of replicated (without all-reduce) mode

From the benchmarks code,

replicated mode for local jobs (without all-reduce):

it says that a regular cross-device aggregation is used to replicate the combined gradients to all towers.

I just want to know what is exactly meaning with "regular cross-device aggregation" ?

  • is it using the GPU P2P to copy the parameters among the GPUs directly ? but from the log it doesn't enable the peering access.
  • is it using the CPU as a parameter server to update the parameters ? if yes what's the difference with the normal parameter server mode ?

closed time in a month

thincal

issue commenttensorflow/benchmarks

About the exact meaning of replicated (without all-reduce) mode

Sorry for the very very late response. Not sure where it says "without all-reduce" but that is wrong. It does an (inefficient) all-reduce using what you said in the first bullet point: it copies gradients among the GPUs directly.

As for GPU P2P access, TensorFlow simply calls cuMemcpyDtoDAsync to do the memory transfer. I think this will go through the CPU if the GPUs cannot communicate directly.

thincal

comment created time in a month

issue commenttensorflow/benchmarks

Profiling benchmark_cnn.py using Callgrind

It looks this this error is fairly normal and does not indicate an issue in TensorFlow and tf_cnn_benchmarks.

I'm not familiar with callgrind so I'm not sure why there is no callgrind.out.pid. Perhaps callgrind can sometimes handle the segment overflow, but not always? As for why the command gives an output in a few minutes, are you sure the process is ending normally and not crashing?

A command to reproduce would be helpful, but I cannot guarentee I can help given that tf_cnn_benchmarks is unmaintained and I am not familiar with callgrind.

ShreyaMaheshwari

comment created time in a month

issue commenttensorflow/benchmarks

Horovod discrepancies on eval_during_training_every_epochs

Unfortunately Horovod is not well tested. Since tf_cnn_benchmarks is unmaintained and I don't know how to run with Horovod, this will likely not be fixed.

taipin

comment created time in a month

issue commenttensorflow/benchmarks

Mismatched BROADCAST CPU/GPU

Unfortunately, tf_cnn_benchmarks is unmaintained and I am not very familiar with Horovod. So it is unlikely you'll get answers to Horovod issues :(

x666633

comment created time in a month

issue commenttensorflow/benchmarks

scale_step

I'm not sure what "scale_step" refers to. If you are still curious about this, can you clarify?

melo19890618

comment created time in a month

issue closedtensorflow/benchmarks

global_step

I have a problem, I want to implement the jump calculation grad in variable_mgr.append_apply_gradients_ops, so I hope to add a judgment in build_fetches during the construction of the graph: if(global_step%2==0), but global_step is a tensor, We can only get its specific value when we are in session.run. so this way doesn't seem to work well. So I want to ask if who have any good suggestions.

closed time in a month

melo19890618

issue commenttensorflow/benchmarks

global_step

Sorry for the very late response. I'm not sure what a "jump caculation grad" is, but if this is still relevant to you, you can use a tf.cond to build the logic into the graph. Alternatively, you can pass a boolean fetch argument to the Session.run call here, passing step % 2 == 0, but it might be hard to pass the boolean tensor around to the place you want.

However, since tf_cnn_benchmarks is unmaintained and isn't written in a TF2-style manner, I recommend using the Official Resnet Model instead. In TF2, you can check if global_step % 2 == 0 directly.

melo19890618

comment created time in a month

issue closedtensorflow/benchmarks

Where can I modify height * width of synthetic data ?

This is rather a question, not a bug report.

I am only a systems guy, not a developer, and my coding skill is very primitive.

I am trying to run a test with much larger images, and since this test suite can run with synthetic data, it is superb for my purpose. Still, I cannot find which part I should modify to enlarge 'height * width' of synthetic data to be used.

Can somebody be kind enough to show me which part I should look at ?

closed time in a month

nasica88

issue commenttensorflow/benchmarks

Where can I modify height * width of synthetic data ?

Thanks @nasica88 for providing a solution!

nasica88

comment created time in a month

issue closedtensorflow/benchmarks

Synchronous vs Asynchronous training on a single machine

Hi,

For my research work, I would like to run TensorFlow Benchmark training with synchronous and asynchronous mode on a single machine, equipped with multiple GPUs, but I'm confused with two things:

  1. Does "variable_update=parameter_server" provide Asynchronous training or "variable_update=parameter_server" with "cross_replica_sync=false" provide Asynchronous training?

  2. How can I run Synchronous training with "all_reduce" strategy?

  3. Which mod "variable_update=replicated" and "variable_update=distributed_replicated" provide?

Could someone also point out to some official documentation for citation purpose?

Setup: TensorFlow Version: 1.13.1 tf_cnn_v1.13 branch 4 AMD Radeon RX470 GPU Ubuntu 18.04

Thanks Kind regards

closed time in a month

mirza03

issue commenttensorflow/benchmarks

Synchronous vs Asynchronous training on a single machine

  1. "variable_update=parameter_server" provides synchronous training by default. You need to additionally pass "cross_replica_sync=false" to get asynchronous training.
  2. Synchronous training is done by default for all variable_updates. Only parameter_server supports "cross_replica_sync=false", which is the only way to do asynchronous training.
  3. "replicated" and "distributed_replicated" do synchronous training by having each GPU have a copy of the variables. After computing gradients, the GPUs all-reduce the gradients so that each GPU has the same gradients, then each independently applies the gradients to its copy of the variables.

The official documentation was removed from the TensorFlow website since it was out of date (tf_cnn_benchmarks is no longer maintained and is not written in a clean, idiomatic TensorFlow style). But you can see a copy here: https://github.com/tensorflow/docs/blob/r1.10/site/en/performance/performance_models.md

mirza03

comment created time in a month

issue closedtensorflow/benchmarks

Stable branch for TF1.15?

Hi,

I notice that there are many branches corresponding to the different versions of tensorflow, such as cnn_tf_v1.x_compatible. This goes on until 1.13 but there don't seem to be any after that. Can I use the main master itself for running benchmarks, or is it not yet stable? I needed to clarify this so that I could be sure of my measurements.

Thanks

closed time in a month

keerthanss
more