profile
viewpoint
Yimei Sun yimeisun123 yimei.sun@intel.com San Diego

tensorflow/ngraph-bridge 125

TensorFlow-nGraph bridge

yimeisun123/models 0

Models and examples built with TensorFlow

Pull request review commenttensorflow/community

RFC: Keras Mixed Precision API

+# Keras Mixed Precision API++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [293](https://github.com/tensorflow/community/pull/293) |+| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com              |+| **Sponsor**   | Francois Chollet (fchollet@google.com)               |+| **Updated**   | 2020-09-29                                           |++# Objective++Make mixed precision easy to use in Keras.++# Table of contents+* [Objective](#objective)+* [Motivation](#motivation)+* [User Benefit](#user-benefit)+* [Design Proposal: Dtype policies](#design-proposal-dtype-policies)+* [Design Proposal: Loss scaling](#design-proposal-loss-scaling)+* [Differences between the proposal and the API in TF 2.3](#differences-between-the-proposal-and-the-api-in-tf-23)+* [Alternatives Considered](#alternatives-considered)+* [Appendix](#appendix)++# Motivation++Mixed precision is the use of both 16-bit and 32-bit floating point types in the same model. Modern accelerators can run float16 and/or bfloat16 operations significantly faster than float32 operations, and these 16-bit dtypes take less memory. However, certain parts of the model must be in float32 for numeric stability. Compared to float32, the use of mixed precision allows models to run faster and use less memory while training equally as well in terms of evaluation metrics such as accuracy. Having an easy-to-use mixed precision API allows users to achieve this performance benefit with little effort.++Variables and certain losses, including L2 Loss, must be float32 for numeric stability. Most other parts of a model can be safely made float16.++When float16 is used (but not bfloat16), a technique called loss scaling is also required to avoid numeric underflow in gradients during the backwards pass. Loss scaling consists of multiplying the loss by a constant called the "loss scale", which causes gradients to be scaled by the same amount. The final float32 gradients are divided by the loss scale to bring them back to their original value.++See [this paper](https://arxiv.org/abs/1710.03740) for more background information on mixed precision.++# User Benefit++Users can improve performance when training models by using mixed precision. In particular, Volta GPUs and above, Google TPUs, and Cooper Lake CPUs have specialized lower precision hardware that greatly improves mixed precision performance compared to float32.++# Design Proposal: Dtype policies++The mixed precision API consists of two parts: A dtype policy API allowing users to use a mix of float32 and a low-precision dtype (float16 or bfloat16) in their model, and a loss scaling API allowing users to prevent gradients from underflowing when float16 is used. This section describes the dtype policy API.++The API for this RFC is mostly already implemented. There is [a tutorial](https://www.tensorflow.org/guide/mixed_precision) explaining how to use it. The API was initially implemented without RFC in order to get early feedback and experience with mixed precision before committing to a finalized API. Because the API has been in TensorFlow for a significant number of releases, we unfortunately must also consider how changing the API may break existing users who currently rely on mixed precision in their models. We cannot drastically modify the APIs in ways which would cause many models to break, but we can make minor backwards incompatible changes and major backwards compatible changes.++This RFC restates a lot of material in the tutorial, although it goes into much more detail on the API motivation and design, and less detail into the background of mixed precision. This RFC also proposes changes to the API, so the API presented in this RFC is not identical to the API of TensorFlow 2.3.++The mixed precision API is only available to Keras users, not users who only use low-level TensorFlow ops. At a minimum, users must use Keras layers and optimizers to use the mixed precision API, but they do not have to use other Keras classes such as models or losses. Users who do not use Keras can implement mixed precision with manual casting of tensors and manually scaling the loss and gradients, but this is significantly more difficult than using Keras.++Throughout this proposal, the phrase "float16" will sometimes be used to refer to either float16 or bfloat16 in order to avoid repeating the phrase "float16 and bfloat16" throughout the doc.++## Dtype policies overview++Every layer will have a special object called a "dtype policy", determining the dtype of the layer’s computations and variables. A dtype policy additionally holds a LossScale object, which is described in the [Loss Scaling section](#design-proposal-loss-scaling) of this RFC. The dtype policy class will be exposed as `tf.keras.mixed_precision.Policy`, and is currently exposed in TensorFlow 2.3 as [`tf.keras.mixed_precision.experimental.Policy`](https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/experimental/Policy). ++Each policy has a `compute_dtype` field and a `variable_dtype` field. The compute dtype specifies the dtype of a layer’s operations. A layer’s output is typically in the compute dtype. The variable dtype is the default dtype of the layer’s variables, although in a few rare cases a layer may create variables of a different dtype.++Despite the word "mixed_precision" in the API name, all layers, even non-mixed precision layers, have a dtype policy. However, the only motivation for layers having a dtype policy instead of a single dtype is to support mixed precision, and only mixed precision users need to directly interact with policies, which is why the word "mixed_precision" appears in the API name.++The constructor for a policy is:++```python+class Policy:++  def __init__(self, name, loss_scale='auto'):+    ...+```++The `name` argument determines the compute dtype and the variable dtype. It also determines the loss scale of the policy if the user does not pass in a loss scale to the constructor. The possible values for `name` are:++*   Any dtype: The compute and variable dtype is `name`. By default, there is no loss scale. `name` can be a DType, or any value convertible to a DType (including strings).+*   `"mixed_float16"`: The compute dtype is float16. The variable dtype is float32. The default loss scale is "dynamic".+*   `"mixed_bfloat16"`: The compute dtype is bfloat16. The variable dtype is float32. There is no default loss scale, as loss scaling is only useful when float16 is used.++Unlike most TensorFlow functions with a `name` argument, the Policy `name` argument has a semantic impact on the TensorFlow program, and is not just used to uniquely identify an op or layer. The word "name" is chosen for lack of a better word to call the argument.++The `loss_scale` argument can override the default loss scale, although this is rarely useful. The primary use case of the `loss_scale` argument is to debug performance by using no loss scale or a fixed loss scale, in order to determine the performance impact of dynamic loss scaling.++`loss_scale` defaults to the string "auto", which means uses a dynamic loss scale for the "mixed_float16" policy and no loss scale for other policies. Typically in Keras, arguments default to None, but None is used here to indicate that no loss scaling should be used so "auto" is the default value instead. Alternatively, `loss_scale` could default to None, and the value `1` or `0` could explicitly disable loss scaling, but this is less clear than the string "auto".++The "mixed_float16" and "mixed_bfloat16" policies are referred to as "mixed precision policies", as they cause layers to use mixed precision. "mixed_float16" will improve on performance on Volta GPUs and above, while "mixed_bfloat16" will improve performance on Ampere GPUs (once TensorFlow adds support for bfloat16 on GPUs), Cooper Lake CPUs (if the [Intel MKL TensorFlow builds](https://software.intel.com/content/www/us/en/develop/articles/intel-optimization-for-tensorflow-installation-guide.html) are used), and Google TPUs.++The "float16" and "bfloat16" policies are not very useful, because models do not train well when variables are in those dtypes. A warning will be issued if one of these policies is set as the global policy (global policies are described in the next section). The "mixed_float16" and "mixed_bfloat16" policies should be used instead, which train equally as well as "float32". We must keep the "float16" and "bfloat16" policies for backwards compatibility, however.++The policy name simply determines the `compute_dtype`, the `variable_dtype`, and the default `loss_scale`. The `name` is not directly used by Keras other than to determine these values.++The `compute_dtype` and `variable_dtype` cannot be directly specified, but instead can only be specified through `name`, because most combinations of these two dtypes are not useful and are not supported by some layers. For example, BatchNormalization has special logic to handle the "mixed_float16" and "mixed_bfloat16" policies, and will not work in other cases where `compute_dtype` is different from `variable_dtype`.++A simplified implementation of Policy is shown below. Policy itself does not do any computations, but simply stores a `compute_dtype` field, a `variable_dtype` field, and a `loss_scale` field:++```python+class Policy:++  def __init__(self, name, loss_scale='auto'):+    self.name = name+    if name in ('float16', 'float32', 'int32', ...):  # List all other dtypes here+      self.compute_dtype = self.variable_dtype = name+      loss_scale = None if loss_scale == 'auto' else loss_scale+    elif name == 'mixed_float16':+      self.compute_dtype = 'float16'+      self.variable_dtype = 'float32'+      loss_scale = 'dynamic' if loss_scale == 'auto' else loss_scale+    else:+      assert name == 'mixed_bfloat16'+      self.compute_dtype = 'bfloat16'+      self.variable_dtype = 'float32'+      loss_scale = None if loss_scale == 'auto' else loss_scale+    self.loss_scale = convert_to_loss_scale(loss_scale)++  @property+  def should_cast_variables(self):+    return self.compute_dtype != self.variable_dtype+```++In addition to this simplified Policy class, the actual Policy class will++*   Expose `get_config` and `from_config` methods+*   Have `name`, `compute_dtype` and `variable_dtype` be read-only+*   Raise ValueErrors instead of raising assertions+*   Give a warning if "mixed_float16" or "mixed_bfloat16" is used but not run on supported hardware+*   Allow instances of `tf.dtypes.DType` or anything convertible to a dtype with `tf.dtypes.as_dtype` to be passed to the constructor instead of a string. The `name` property will still be a string. Note for "mixed_float16" and "mixed_bfloat16", a string must be passed as there is no equivalent dtype.++## The global policy++There is a global policy which serves as the default policy for layers. It can be set and retrieved with:++```python+# Proposed API+tf.keras.mixed_precision.set_global_policy(policy)+tf.keras.mixed_precision.global_policy()+```++Currently in TensorFlow 2.3 these are exposed as:++```python+# Existing API in TensorFlow 2.3+tf.keras.mixed_precision.experimental.set_policy(policy)+tf.keras.mixed_precision.experimental.global_policy()+```++`set_policy()` is renamed to `set_global_policy()` to make it clear the policy is global and to be consistent with the `global_policy()` function.++In addition to a policy, `set_global_policy` can take a string or any other type which can be passed to the constructor of policy. If a non-policy is passed, a new policy will be constructed and set as the global policy. E.g., the following two options are equivalent:++```python+name = ...++# Option 1 (Both options are equivalent)+policy = tf.keras.mixed_precision.Policy(name)+tf.keras.mixed_precision.set_global_policy(policy)++# Options 2+tf.keras.mixed_precision.set_global_policy(name)+```++Option 2 will be preferred and used in tutorials. However, Option 1 is required if a user wants to use a non-default loss scale in the global policy.++The global policy is the default policy for layers. When a layer is constructed, the layer’s dtype policy will be set to the global policy by default. For example:++```python+layer = tf.keras.layers.Dense()+print(layer.dtype_policy.name)  # float32, the default dtype policy+tf.keras.mixed_precision.set_global_policy("mixed_float16")+layer2 = tf.keras.layers.Dense()+print(layer2.dtype_policy.name)  # mixed_float16+print(layer.dtype_policy.name)  # float32. A layer dtype policy never changes.+```++The global policy is only used by Keras to determine the default layer policy, and has no other purpose. The next section describes in detail how a layer users dtype policies.++To use mixed precision in a model, the global policy must be set to "mixed_float16" or "mixed_bfloat16" before the model is constructed. For many models, this is all that is required to use mixed precision.++```python+# Use mixed precision, for Volta+ GPUs+tf.keras.mixed_precision.set_global_policy("mixed_float16")++# Use mixed precision, for Cooper Lake CPUs, Ampere GPUs, or Google TPUs+tf.keras.mixed_precision.set_global_policy("mixed_bfloat16")+```++If unset, the global policy defaults to a Policy constructed from the current value of `tf.keras.backend.floatx`, itself which defaults to "float32" unless changed in the `~/.keras/keras.json` file. Until the global policy is explicitly set, it will track the current value of floatx, so changing floatx changes the value of the global policy. Once the global policy is set, it no longer tracks floatx. Calling `set_global_policy(None)` will set the global policy to track floatx again, if it has previously been set to an explicit policy. For example:++```python+# By default, the global policy tracks floatx+print(tf.keras.mixed_precision.global_policy())  # float32+tf.keras.backend.set_floatx("float16")+print(tf.keras.mixed_precision.global_policy())  # float16++# Once set, the global policy no longer matches floatx+tf.keras.mixed_precision.set_global_policy("float64")+print(tf.keras.mixed_precision.global_policy())  # float64+tf.keras.backend.set_floatx("float32")+print(tf.keras.mixed_precision.global_policy())  # float64++# Make the global policy track floatx again+tf.keras.mixed_precision.set_global_policy(None)+print(tf.keras.mixed_precision.global_policy())  # float32+```++These rules for tracking floatx are relatively unimportant and exist primarily for backwards compatibility. For the most part, users do not have to be aware of floatx and can simply assume the global policy defaults to float32.++`set_global_policy` requires the policy to be floating-point. That is, the policy’s name must be one of "float16", "bfloat16", "float32", "float64", "mixed_float16", or "mixed_bfloat16". The reason is that most layers do not support non-floating-point policies.++A warning will be issued if the global policy is set to "float16" or "bfloat16", as these policies typically result in substantially worse training results. Also, such policies are typically not useful for inference, as a model with float16 variables cannot load training checkpoints with float32 variables.++### The global policy is not thread local++The global policy will not be thread-local, although this may change in the future. The reason it is not thread local is that [`tf.distribute.MirroredStrategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy) currently spawns threads, each which have their own versions of thread-local variables. MirroredStrategy will explicitly copy all the TensorFlow thread-local variables from the parent thread to its child threads, but since Keras is being moved outside TensorFlow, MirroredStrategy cannot depend on Keras and therefore it cannot copy the global policy.++As a consequence of the global policy not being thread local, it is impossible to create a float32 model and a mixed precision in different threads in parallel.++If in the future, MirroredStrategy stops using threads or exposes a mechanism to copy over thread-local variables to its child threads, we will consider changing the policy to be thread-local. Making the policy thread-local may break a user’s model if they spawn threads and create layers from within the threads, but it is unlikely users do this in practice.++### Where should users set the global policy?++It can be difficult to determine where a user should set the global policy to "mixed_float16" if they want to use mixed precision. For example, suppose they want to add mixed precision to a program that looks like this:++```python+def create_model(num_layers):+  return tf.keras.Sequential([tf.keras.layers.Dense(10, activation="relu")+                              for _ in range(num_layers)])++def main(_):+  model = create_model(num_layers=10)+  model.compile(...)+  dataset = ...+  model.fit(dataset)++if __name__ == "__main__":+  app.run(main)+```++A `dtype` flag can be added which can be "float32" or "mixed_float16", but should the global dtype policy be set in `create_model` or `main`? In this case, we will recommend users set it in `main`. The issue is `create_model` appears to be stateless, so we should avoid setting state by calling `set_global_policy`. The recommend way to add mixed precision is therefore:++```python+flags.DEFINE_enum("dtype", "float32", ("float32", "mixed_float16"),+                  description="Dtype policy to use")++def create_model(num_layers):+  return tf.keras.Sequential([tf.keras.layers.Dense(10, activation="relu")+                              for _ in range(num_layers)])++def main(_):+  tf.keras.mixed_precision.set_global_policy(flags.FLAGS.dtype)+  model = create_model(num_layers=10)+  model.compile(...)+  dataset = ...+  model.fit(dataset)++if __name__ == "__main__":+  app.run(main)+```++Alternatively, the global policy can be set in `create_model`. This will work in cases where multiple models are not built in parallel using threads, and where every time a model is created, the global policy is set to an appropriate value.

I am a little confused about the case here. This is the way for use case "where multiple models are not built in parallel using threads" or "where multiple models are built in parallel using threads"?

reedwm

comment created time in 22 days

PullRequestReviewEvent

Pull request review commenttensorflow/community

RFC: Keras Mixed Precision API

+# Keras Mixed Precision API++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [293](https://github.com/tensorflow/community/pull/293) |+| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com              |+| **Sponsor**   | Francois Chollet (fchollet@google.com)               |+| **Updated**   | 2020-09-29                                           |++# Objective++Make mixed precision easy to use in Keras.++# Table of contents+* [Objective](#objective)+* [Motivation](#motivation)+* [User Benefit](#user-benefit)+* [Design Proposal: Dtype policies](#design-proposal-dtype-policies)+* [Design Proposal: Loss scaling](#design-proposal-loss-scaling)+* [Differences between the proposal and the API in TF 2.3](#differences-between-the-proposal-and-the-api-in-tf-23)+* [Alternatives Considered](#alternatives-considered)+* [Appendix](#appendix)++# Motivation++Mixed precision is the use of both 16-bit and 32-bit floating point types in the same model. Modern accelerators can run float16 and/or bfloat16 operations significantly faster than float32 operations, and these 16-bit dtypes take less memory. However, certain parts of the model must be in float32 for numeric stability. Compared to float32, the use of mixed precision allows models to run faster and use less memory while training equally as well in terms of evaluation metrics such as accuracy. Having an easy-to-use mixed precision API allows users to achieve this performance benefit with little effort.++Variables and certain losses, including L2 Loss, must be float32 for numeric stability. Most other parts of a model can be safely made float16.++When float16 is used (but not bfloat16), a technique called loss scaling is also required to avoid numeric underflow in gradients during the backwards pass. Loss scaling consists of multiplying the loss by a constant called the "loss scale", which causes gradients to be scaled by the same amount. The final float32 gradients are divided by the loss scale to bring them back to their original value.++See [this paper](https://arxiv.org/abs/1710.03740) for more background information on mixed precision.++# User Benefit++Users can improve performance when training models by using mixed precision. In particular, Volta GPUs and above, Google TPUs, and Cooper Lake CPUs have specialized lower precision hardware that greatly improves mixed precision performance compared to float32.++# Design Proposal: Dtype policies++The mixed precision API consists of two parts: A dtype policy API allowing users to use a mix of float32 and a low-precision dtype (float16 or bfloat16) in their model, and a loss scaling API allowing users to prevent gradients from underflowing when float16 is used. This section describes the dtype policy API.++The API for this RFC is mostly already implemented. There is [a tutorial](https://www.tensorflow.org/guide/mixed_precision) explaining how to use it. The API was initially implemented without RFC in order to get early feedback and experience with mixed precision before committing to a finalized API. Because the API has been in TensorFlow for a significant number of releases, we unfortunately must also consider how changing the API may break existing users who currently rely on mixed precision in their models. We cannot drastically modify the APIs in ways which would cause many models to break, but we can make minor backwards incompatible changes and major backwards compatible changes.++This RFC restates a lot of material in the tutorial, although it goes into much more detail on the API motivation and design, and less detail into the background of mixed precision. This RFC also proposes changes to the API, so the API presented in this RFC is not identical to the API of TensorFlow 2.3.++The mixed precision API is only available to Keras users, not users who only use low-level TensorFlow ops. At a minimum, users must use Keras layers and optimizers to use the mixed precision API, but they do not have to use other Keras classes such as models or losses. Users who do not use Keras can implement mixed precision with manual casting of tensors and manually scaling the loss and gradients, but this is significantly more difficult than using Keras.++Throughout this proposal, the phrase "float16" will sometimes be used to refer to either float16 or bfloat16 in order to avoid repeating the phrase "float16 and bfloat16" throughout the doc.++## Dtype policies overview++Every layer will have a special object called a "dtype policy", determining the dtype of the layer’s computations and variables. A dtype policy additionally holds a LossScale object, which is described in the [Loss Scaling section](#design-proposal-loss-scaling) of this RFC. The dtype policy class will be exposed as `tf.keras.mixed_precision.Policy`, and is currently exposed in TensorFlow 2.3 as [`tf.keras.mixed_precision.experimental.Policy`](https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/experimental/Policy). ++Each policy has a `compute_dtype` field and a `variable_dtype` field. The compute dtype specifies the dtype of a layer’s operations. A layer’s output is typically in the compute dtype. The variable dtype is the default dtype of the layer’s variables, although in a few rare cases a layer may create variables of a different dtype.++Despite the word "mixed_precision" in the API name, all layers, even non-mixed precision layers, have a dtype policy. However, the only motivation for layers having a dtype policy instead of a single dtype is to support mixed precision, and only mixed precision users need to directly interact with policies, which is why the word "mixed_precision" appears in the API name.++The constructor for a policy is:++```python+class Policy:++  def __init__(self, name, loss_scale='auto'):+    ...+```++The `name` argument determines the compute dtype and the variable dtype. It also determines the loss scale of the policy if the user does not pass in a loss scale to the constructor. The possible values for `name` are:++*   Any dtype: The compute and variable dtype is `name`. By default, there is no loss scale. `name` can be a DType, or any value convertible to a DType (including strings).+*   `"mixed_float16"`: The compute dtype is float16. The variable dtype is float32. The default loss scale is "dynamic".+*   `"mixed_bfloat16"`: The compute dtype is bfloat16. The variable dtype is float32. There is no default loss scale, as loss scaling is only useful when float16 is used.++Unlike most TensorFlow functions with a `name` argument, the Policy `name` argument has a semantic impact on the TensorFlow program, and is not just used to uniquely identify an op or layer. The word "name" is chosen for lack of a better word to call the argument.++The `loss_scale` argument can override the default loss scale, although this is rarely useful. The primary use case of the `loss_scale` argument is to debug performance by using no loss scale or a fixed loss scale, in order to determine the performance impact of dynamic loss scaling.++`loss_scale` defaults to the string "auto", which means uses a dynamic loss scale for the "mixed_float16" policy and no loss scale for other policies. Typically in Keras, arguments default to None, but None is used here to indicate that no loss scaling should be used so "auto" is the default value instead. Alternatively, `loss_scale` could default to None, and the value `1` or `0` could explicitly disable loss scaling, but this is less clear than the string "auto".++The "mixed_float16" and "mixed_bfloat16" policies are referred to as "mixed precision policies", as they cause layers to use mixed precision. "mixed_float16" will improve on performance on Volta GPUs and above, while "mixed_bfloat16" will improve performance on Ampere GPUs (once TensorFlow adds support for bfloat16 on GPUs), Cooper Lake CPUs (if the [Intel MKL TensorFlow builds](https://software.intel.com/content/www/us/en/develop/articles/intel-optimization-for-tensorflow-installation-guide.html) are used), and Google TPUs.++The "float16" and "bfloat16" policies are not very useful, because models do not train well when variables are in those dtypes. A warning will be issued if one of these policies is set as the global policy (global policies are described in the next section). The "mixed_float16" and "mixed_bfloat16" policies should be used instead, which train equally as well as "float32". We must keep the "float16" and "bfloat16" policies for backwards compatibility, however.++The policy name simply determines the `compute_dtype`, the `variable_dtype`, and the default `loss_scale`. The `name` is not directly used by Keras other than to determine these values.++The `compute_dtype` and `variable_dtype` cannot be directly specified, but instead can only be specified through `name`, because most combinations of these two dtypes are not useful and are not supported by some layers. For example, BatchNormalization has special logic to handle the "mixed_float16" and "mixed_bfloat16" policies, and will not work in other cases where `compute_dtype` is different from `variable_dtype`.++A simplified implementation of Policy is shown below. Policy itself does not do any computations, but simply stores a `compute_dtype` field, a `variable_dtype` field, and a `loss_scale` field:++```python+class Policy:++  def __init__(self, name, loss_scale='auto'):+    self.name = name+    if name in ('float16', 'float32', 'int32', ...):  # List all other dtypes here+      self.compute_dtype = self.variable_dtype = name+      loss_scale = None if loss_scale == 'auto' else loss_scale+    elif name == 'mixed_float16':+      self.compute_dtype = 'float16'+      self.variable_dtype = 'float32'+      loss_scale = 'dynamic' if loss_scale == 'auto' else loss_scale+    else:+      assert name == 'mixed_bfloat16'+      self.compute_dtype = 'bfloat16'+      self.variable_dtype = 'float32'+      loss_scale = None if loss_scale == 'auto' else loss_scale+    self.loss_scale = convert_to_loss_scale(loss_scale)++  @property+  def should_cast_variables(self):+    return self.compute_dtype != self.variable_dtype+```++In addition to this simplified Policy class, the actual Policy class will++*   Expose `get_config` and `from_config` methods+*   Have `name`, `compute_dtype` and `variable_dtype` be read-only+*   Raise ValueErrors instead of raising assertions+*   Give a warning if "mixed_float16" or "mixed_bfloat16" is used but not run on supported hardware+*   Allow instances of `tf.dtypes.DType` or anything convertible to a dtype with `tf.dtypes.as_dtype` to be passed to the constructor instead of a string. The `name` property will still be a string. Note for "mixed_float16" and "mixed_bfloat16", a string must be passed as there is no equivalent dtype.++## The global policy++There is a global policy which serves as the default policy for layers. It can be set and retrieved with:++```python+# Proposed API+tf.keras.mixed_precision.set_global_policy(policy)+tf.keras.mixed_precision.global_policy()+```++Currently in TensorFlow 2.3 these are exposed as:++```python+# Existing API in TensorFlow 2.3+tf.keras.mixed_precision.experimental.set_policy(policy)+tf.keras.mixed_precision.experimental.global_policy()+```++`set_policy()` is renamed to `set_global_policy()` to make it clear the policy is global and to be consistent with the `global_policy()` function.++In addition to a policy, `set_global_policy` can take a string or any other type which can be passed to the constructor of policy. If a non-policy is passed, a new policy will be constructed and set as the global policy. E.g., the following two options are equivalent:++```python+name = ...++# Option 1 (Both options are equivalent)+policy = tf.keras.mixed_precision.Policy(name)+tf.keras.mixed_precision.set_global_policy(policy)++# Options 2+tf.keras.mixed_precision.set_global_policy(name)+```++Option 2 will be preferred and used in tutorials. However, Option 1 is required if a user wants to use a non-default loss scale in the global policy.++The global policy is the default policy for layers. When a layer is constructed, the layer’s dtype policy will be set to the global policy by default. For example:++```python+layer = tf.keras.layers.Dense()+print(layer.dtype_policy.name)  # float32, the default dtype policy+tf.keras.mixed_precision.set_global_policy("mixed_float16")+layer2 = tf.keras.layers.Dense()+print(layer2.dtype_policy.name)  # mixed_float16+print(layer.dtype_policy.name)  # float32. A layer dtype policy never changes.+```++The global policy is only used by Keras to determine the default layer policy, and has no other purpose. The next section describes in detail how a layer users dtype policies.

typo, "... layer users dtype ...", "uses"

reedwm

comment created time in 23 days

PullRequestReviewEvent

issue commenttensorflow/models

Run official/nlp/transformer in eager mode

Thanks for your quick reply. In other model, i.e., ResNet50, we can give the " --run_eagerly=True" option to run the model in pure eager mode, but I don't see this option for Transformer. I am looking for the same behavior for Transformer.

yimeisun123

comment created time in a month

issue openedtensorflow/models

Run official/nlp/transformer in eager mode

Prerequisites

Please answer the following question for yourself before submitting an issue.

  • [ ] I checked to make sure that this feature has not been requested already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer_main.py

2. Describe the feature you request

Would there be an option to run Transformer in eager mode?

3. Additional context

Add any other context about the feature request here.

4. Are you willing to contribute it? (Yes or No)

created time in a month

fork yimeisun123/models

Models and examples built with TensorFlow

fork in a month

pull request commenttensorflow/tensorflow

[Intel MKL] Enabling BatchNorm test in AutoMPMkl

@penpornk, @gbaned - thanks for reviewing on this PR. Sorry for the long delayed response to the comments from our team. I just pushed a new commit, and also responded to the review comments in the comment section. Please let me know if any additional action needed. Thanks

nhasabni

comment created time in a month

Pull request review commenttensorflow/tensorflow

[Intel MKL] Enabling BatchNorm test in AutoMPMkl

 class FusedBatchNormOpTest : public OpsTestBase {     CommonTestUtilities<T>::VerifyTensorsClose(exponential_avg_factor,                                                is_training, run, run_mkl);   }++  void VerifyFusedBatchNormGradWithConv2D(const float epsilon) {+    const GraphRunnerGrad run =+        [this](const Tensor& input, const Tensor& filter,+               const Tensor& y_backprop, const Tensor& scale,+               const Tensor& mean, const Tensor& variance,+               const Tensor& res_sp3, Tensor* x_backprop_tensor,+               Tensor* scale_backprop_tensor, Tensor* offset_backprop_tensor,+               const float epsilon) {+          auto root = tensorflow::Scope::NewRootScope();++          auto input_op =+              ops::Const(root.WithOpName("input"), Input::Initializer(input));+          auto filter_op =+              ops::Const(root.WithOpName("filter"), Input::Initializer(filter));+          ops::Conv2D::Attrs conv_attr;+          conv_attr = conv_attr.DataFormat("NHWC");+          auto conv = ops::Conv2D(root.WithOpName("Conv"), input_op, filter_op,+                                  {1, 1, 1, 1}, "SAME", conv_attr);+          // -------------------------------------------------------------+          auto y_backprop_op = ops::Const(root.WithOpName("y_backprop"),+                                          Input::Initializer(y_backprop));+          auto scale_op =+              ops::Const(root.WithOpName("scale"), Input::Initializer(scale));+          auto mean_op =+              ops::Const(root.WithOpName("mean"), Input::Initializer(mean));+          auto var_op = ops::Const(root.WithOpName("variance"),+                                   Input::Initializer(variance));+          auto res_sp3_op = ops::Const(root.WithOpName("reserve_space_3"),+                                       Input::Initializer(res_sp3));+          ops::FusedBatchNormGradV3::Attrs bn_attr;+          bn_attr = bn_attr.IsTraining(true);+          bn_attr = bn_attr.Epsilon(epsilon);+          bn_attr = bn_attr.DataFormat("NHWC");+          auto bn = ops::FusedBatchNormGradV3(+              root.WithOpName("FusedBatchNormGrad"), y_backprop_op, conv,+              scale_op, mean_op, var_op, res_sp3_op, bn_attr);++          auto x_backprop =+              ops::Identity(root.WithOpName("x_backprop"), bn.x_backprop);+          auto scale_backprop = ops::Identity(root.WithOpName("scale_backprop"),+                                              bn.scale_backprop);+          auto offset_backprop = ops::Identity(+              root.WithOpName("offset_backprop"), bn.offset_backprop);++          tensorflow::GraphDef graph;+          TF_ASSERT_OK(root.ToGraphDef(&graph));++          tensorflow::SessionOptions session_options;+          std::unique_ptr<tensorflow::Session> session(+              tensorflow::NewSession(session_options));+          TF_ASSERT_OK(session->Create(graph));++          std::vector<Tensor> output_tensors;+          TF_ASSERT_OK(session->Run(+              {}, {"x_backprop", "scale_backprop", "offset_backprop"}, {},+              &output_tensors));++          *x_backprop_tensor = output_tensors[0];+          *scale_backprop_tensor = output_tensors[1];+          *offset_backprop_tensor = output_tensors[2];+        };++    const GraphRunnerGrad run_mkl =+        [this](const Tensor& input, const Tensor& filter,+               const Tensor& y_backprop, const Tensor& scale,+               const Tensor& mean, const Tensor& variance,+               const Tensor& res_sp3, Tensor* x_backprop_tensor,+               Tensor* scale_backprop_tensor, Tensor* offset_backprop_tensor,+               const float epsilon) {+          Tensor conv2d_output, conv2d_meta_output;+          Conv2DOpTest<T> conv2d_test;+          conv2d_test.RunConv2D(input, filter, &conv2d_output,+                                &conv2d_meta_output);++          DataType dtype = DataTypeToEnum<T>::v();+          TF_EXPECT_OK(+              NodeDefBuilder("MklFusedBatchNorm", "_MklFusedBatchNormGradV3")+                  .Input(FakeInput(dtype))+                  .Input(FakeInput(dtype))+                  .Input(FakeInput(DT_FLOAT))+                  .Input(FakeInput(DT_FLOAT))+                  .Input(FakeInput(DT_FLOAT))+                  .Input(FakeInput(DT_FLOAT))+                  .Input(FakeInput(DT_UINT8))+                  .Input(FakeInput(DT_UINT8))+                  .Input(FakeInput(DT_UINT8))+                  .Input(FakeInput(DT_UINT8))+                  .Input(FakeInput(DT_UINT8))+                  .Input(FakeInput(DT_UINT8))+                  .Attr("epsilon", epsilon)+                  .Attr("is_training", true)+                  .Attr("data_format", "NHWC")+                  .Attr("_kernel", "MklLayoutDependentOp")+                  .Finalize(node_def()));+          TF_EXPECT_OK(InitOp());++          AddInputFromArray<T>(y_backprop.shape(), y_backprop.flat<T>());+          AddInputFromArray<T>(conv2d_output.shape(), conv2d_output.flat<T>());+          AddInputFromArray<float>(scale.shape(), scale.flat<float>());+          AddInputFromArray<float>(mean.shape(), mean.flat<float>());+          AddInputFromArray<float>(variance.shape(), variance.flat<float>());+          AddInputFromArray<float>(res_sp3.shape(), res_sp3.flat<float>());+          AddInputFromArray<uint8>(dummy_shape, dummy_tensor);+          AddInputFromArray<uint8>(conv2d_meta_output.shape(),+                                   conv2d_meta_output.flat<uint8>());+          AddInputFromArray<uint8>(dummy_shape, dummy_tensor);+          AddInputFromArray<uint8>(dummy_shape, dummy_tensor);+          AddInputFromArray<uint8>(dummy_shape, dummy_tensor);+          AddInputFromArray<uint8>(dummy_shape, dummy_tensor);+          TF_ASSERT_OK(RunOpKernel());++          CommonTestUtilities<T> test_util;+          test_util.PerformConversion(dtype, *GetOutput(0), *GetOutput(5),

The indices in the code are correct as they are the output indices.

nhasabni

comment created time in a month

PullRequestReviewEvent

Pull request review commenttensorflow/tensorflow

[Intel MKL] Enabling BatchNorm test in AutoMPMkl

 class CommonTestUtilities : public OpsTestBase {     test::ExpectClose(batch_var, mkl_batch_var, 1e-5);   } +  static void VerifyTensorsCloseForGrad(const float epsilon,+                                        const GraphRunnerGrad& run,+                                        const GraphRunnerGrad& run_mkl) {+    int batch = 2;+    int height = 8;+    int width = 8;+    int depth = 1;+    DataType dtype = DataTypeToEnum<T>::v();++    Tensor input(dtype, {batch, height, width, depth});+    input.flat<T>() = input.flat<T>().template setRandom<random_gen_>();+    Tensor filter(dtype, {3, 3, 1, 6});

Updated per suggestion in the new commit. Please check.

nhasabni

comment created time in a month

PullRequestReviewEvent

Pull request review commenttensorflow/tensorflow

[Intel MKL] Enabling BatchNorm test in AutoMPMkl

 class MklFusedBatchNormGradOp : public OpKernel {               ? dnn_shape_diff_dst.GetMklLayout()               : memory::desc(diff_dst_dims, MklDnnType<T>(), dnn_fmt); +      MklDnnData<T> reorder_src(&cpu_engine_);+      MklDnnData<T> reorder_diff_dst(&cpu_engine_);+      T* diff_dst_data =+          static_cast<T*>(const_cast<T*>(diff_dst_tensor.flat<T>().data()));+      T* src_data =+          static_cast<T*>(const_cast<T*>(src_tensor.flat<T>().data()));++#ifdef ENABLE_MKLDNN_V1+      // Ensure that src and diff_dst are in same blocked memory layout.+      // As per MKL-DNN doc, this will lead to faster perf.

If both are MKL tensors, they are in blocked format already and no need to adjust here for MKL-DNN. If both are tensorflow native format, no need to adjust either as MKL-DNN can handle. If these two inputs are in different formats, adjust the one in tensorflow native format to blocked format, as MKL-DNN has better performance for blocked format.

I have updated comments.

nhasabni

comment created time in a month

PullRequestReviewEvent

Pull request review commenttensorflow/tensorflow

[Intel MKL] Enabling BatchNorm test in AutoMPMkl

 tf_cc_test_mkl(     srcs = ["mkl_fused_batch_norm_op_test.cc"],     linkstatic = 1,     deps = [+        ":mkl_conv_op",

FYI - this change is added in the last commit but not in the original commits, since the original commits were done a while back, and the baseline changes of moving mkl kernels causes the need to adding this dependency.

nhasabni

comment created time in a month

PullRequestReviewEvent

push eventIntel-tensorflow/tensorflow

Yimei Sun

commit sha 78ce403921649b9f98f3172d47196c638c8aef0b

Address the review comments

view details

push time in a month

pull request commenttensorflow/tensorflow

Add auto_mixed_precision_mkl to run-once optimizer list

@rmlarsen - have you got a chance to check this PR? thanks.

yimeisun123

comment created time in 3 months

more