profile
viewpoint
Frank Chen frankchn @google @tensorflow Mountain View, CA

frankchn/cs143 10

CS143 Decaf Project

frankchn/cs224n 4

CS224N Programming Assignments

abi/Our-Tree 3

Distributed R+ Trees. No, Native Client MapReduce! No, SHA-3 GPU implementation. But no! CPU/GPU RC4/AES implementation!

frankchn/cs108 2

CS108 Quiz Website

frankchn/introduction-to-tensorflow 2

Sample code for Introduction to TensorFlow talks I give.

frankchn/cs145 1

CS145 Cheatsheet

frankchn/cs147 1

CS147 Project for Fall 2011

frankchn/cs229 1

CS 229 Project Report

issue commenttensorflow/custom-op

Custom Op TPU

Yup, working on it. Thanks for bringing that to our attention!

bhack

comment created time in a month

issue commenttensorflow/custom-op

Custom Op TPU

While it is technically possible to write an XLA HLO op and get it to run on TPUs, we currently don't expose any way to load arbitrary user written HLO ops onto the TPU system itself. This may change in future releases, but we don't have anything to announce today.

bhack

comment created time in a month

pull request commenttensorflow/tensorflow

Update SlurmClusterResolver documentation

Thank you!

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Add MPI cluster resolver and update documentation of SLURM cluster resolver

Sorry about the back and forth!

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Add MPI cluster resolver and update documentation of SLURM cluster resolver

Hi @Flamefire, just got updated on an internal discussion and we are actually thinking of moving some of the cluster resolvers out of the TPU repo into a repo owned by a SIG instead (maybe SIG-Networking).

Still happy to accept the documentation update though, so can you remove the mpi_cluster_resolver.py from the PR, and create a new Issue re: mpi_cluster_resolver and assign it to me and @jhseu?

We'll update you on whether we decide to move the cluster resolvers soon. Thanks!

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Add MPI cluster resolver and update documentation of SLURM cluster resolver

Yeah that test failure is unrelated.

I think we are getting some internal tests done and this will be merged in the next day or two. Not sure if this will make it into 2.2 though since there has already been RC cuts and they are only really willing to accept bug fixes at this point. Will definitely be in 2.3 (cut in ~May-ish) though.

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Add MPI cluster resolver and update documentation of SLURM cluster resolver

Yeah for the open source builds we are not supporting Python 2 any more, but internally within Google the migration to Python 3 isn't complete, so that's why the pre-submit checks are still around to enforce the compatibility headers.

Your code won't be run internally, so I think it would be fine to leave it as is otherwise.

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Add MPI cluster resolver and update documentation of SLURM cluster resolver

@Flamefire For consistency can you just add it in? I think all our files have it since we have some code internally still on Py2 so we need to maintain compatibility right now. Thanks!

Flamefire

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add MPI cluster resolver and update documentation of SLURM cluster resolver

+# Copyright 2018-2020 The TensorFlow Authors. All Rights Reserved.+#+# Licensed under the Apache License, Version 2.0 (the "License");+# you may not use this file except in compliance with the License.+# You may obtain a copy of the License at+#+# http://www.apache.org/licenses/LICENSE-2.0+#+# Unless required by applicable law or agreed to in writing, software+# distributed under the License is distributed on an "AS IS" BASIS,+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+# See the License for the specific language governing permissions and+# limitations under the License.+# ==============================================================================+"""Implementation of a Cluster Resolver for MPI jobs."""++from socket import gethostname+from slurm_cluster_resolver import SlurmClusterResolver++from tensorflow.python.util.tf_export import tf_export+++@tf_export('distribute.cluster_resolver.MPIClusterResolver')+class MPIClusterResolver(SlurmClusterResolver):+  """ClusterResolver for programs run via MPI (mpirun, srun, ...)++  This is an implementation of ClusterResolver for MPI programs which can+  be used for distributed TensorFlow.+  When no explicit values are set it will retrieve all information from MPI.+  For rank and number of Tasks the values are gotten from the MPI_COMM_WORLD+  communicator. It also automatically resolves hostnames which requires some+  communication on startup.+  """++  def __init__(self,+               jobs=None,+               port_base=8888,+               gpus_per_node=None,+               gpus_per_task=None,+               auto_set_gpu=True,+               rpc_layer='grpc'):+    """Creates a new MPIClusterResolver object.++    For any parameter not set it will query MPI_COMM_WORLD for the value.+    With the number of GPUs per node and per task it allocates GPUs to tasks by+    setting environment variables.+    Using the resolver works best (and is easier) with homogeneous tasks but+    heterogeneous tasks (number of tasks varying per node) are also possible as+    long as the number of GPUs per task stays constant.++    Args:+      jobs: Dictionary with job names as key and number of tasks in the job as+        value. Defaults to as many 'worker's as there are MPI tasks.+      port_base: The first port number to start with for processes on a node.+      gpus_per_node: Number of GPUs available on each node. Defaults to the+        number of GPUs reported by nvidia-smi+      gpus_per_task: Number of GPUs to be used for each task. Default is to+        evenly distribute the gpus_per_node to tasks_per_node.+      auto_set_gpu: Set the visible CUDA devices automatically while resolving+        the cluster by setting CUDA_VISIBLE_DEVICES environment variable.+        Defaults to True.+      rpc_layer: The protocol TensorFlow used to communicate between nodes.+        Defaults to 'grpc'.++    Returns:+      A MPIResolver object which can be used with distributed TensorFlow.++    Raises:+      RuntimeError: If requested more GPUs per node then available or+        requested more tasks then assigned tasks.+    """+    from mpi4py import MPI # pylint: disable=import-outside-toplevel

Yah works for me!

Flamefire

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add MPI cluster resolver and update documentation of SLURM cluster resolver

+# Copyright 2018-2020 The TensorFlow Authors. All Rights Reserved.+#+# Licensed under the Apache License, Version 2.0 (the "License");+# you may not use this file except in compliance with the License.+# You may obtain a copy of the License at+#+# http://www.apache.org/licenses/LICENSE-2.0+#+# Unless required by applicable law or agreed to in writing, software+# distributed under the License is distributed on an "AS IS" BASIS,+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+# See the License for the specific language governing permissions and+# limitations under the License.+# ==============================================================================+"""Implementation of a Cluster Resolver for MPI jobs."""++from socket import gethostname+from slurm_cluster_resolver import SlurmClusterResolver++from tensorflow.python.util.tf_export import tf_export+++@tf_export('distribute.cluster_resolver.MPIClusterResolver')+class MPIClusterResolver(SlurmClusterResolver):+  """ClusterResolver for programs run via MPI (mpirun, srun, ...)++  This is an implementation of ClusterResolver for MPI programs which can+  be used for distributed TensorFlow.+  When no explicit values are set it will retrieve all information from MPI.+  For rank and number of Tasks the values are gotten from the MPI_COMM_WORLD+  communicator. It also automatically resolves hostnames which requires some+  communication on startup.+  """++  def __init__(self,+               jobs=None,+               port_base=8888,+               gpus_per_node=None,+               gpus_per_task=None,+               auto_set_gpu=True,+               rpc_layer='grpc'):+    """Creates a new MPIClusterResolver object.++    For any parameter not set it will query MPI_COMM_WORLD for the value.+    With the number of GPUs per node and per task it allocates GPUs to tasks by+    setting environment variables.+    Using the resolver works best (and is easier) with homogeneous tasks but+    heterogeneous tasks (number of tasks varying per node) are also possible as+    long as the number of GPUs per task stays constant.++    Args:+      jobs: Dictionary with job names as key and number of tasks in the job as+        value. Defaults to as many 'worker's as there are MPI tasks.+      port_base: The first port number to start with for processes on a node.+      gpus_per_node: Number of GPUs available on each node. Defaults to the+        number of GPUs reported by nvidia-smi+      gpus_per_task: Number of GPUs to be used for each task. Default is to+        evenly distribute the gpus_per_node to tasks_per_node.+      auto_set_gpu: Set the visible CUDA devices automatically while resolving+        the cluster by setting CUDA_VISIBLE_DEVICES environment variable.+        Defaults to True.+      rpc_layer: The protocol TensorFlow used to communicate between nodes.+        Defaults to 'grpc'.++    Returns:+      A MPIResolver object which can be used with distributed TensorFlow.++    Raises:+      RuntimeError: If requested more GPUs per node then available or+        requested more tasks then assigned tasks.+    """+    from mpi4py import MPI # pylint: disable=import-outside-toplevel

Maybe catch the ImportError and return a better error message directing users to install mpi4py? Something like https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/cluster_resolver/tpu_cluster_resolver.py#L32-L34 works.

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Enhance SlurmClusterResolver

The indentation is a major difference between our internal style guide and the public one, so you should use 2 spaces.

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Enhance SlurmClusterResolver

According to the Google Python style guide, yapf with --style google should work.

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Enhance SlurmClusterResolver

Just create PRs and assign me as the reviewer.

Flamefire

comment created time in 2 months

pull request commenttensorflow/tensorflow

Enhance SlurmClusterResolver

@Flamefire Sure, both of those will be super helpful! Thanks for contributing to TensorFlow :)

Flamefire

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Remove expired forward compatibility horizons

 def testImplicitForm(self):           np.array([[[1], [3], [2]], [[5], [6], [8]]], dtype=dtype),           expected=np.array(128, dtype=dtype)) -      with compat.forward_compatibility_horizon(2019, 10, 19):

same as above.

lgeiger

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Remove expired forward compatibility horizons

 def __init__(self, input_dataset, num_workers, index):     self._input_dataset = input_dataset      self._element_spec = input_dataset.element_spec-    if (compat.forward_compatible(2019, 11, 25) or

You can remove the compat.forward_compatible part but you should probably leave both sides of the if statement here as there is the other condition on the auto_shard_policy setting.

lgeiger

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Remove expired forward compatibility horizons

 def __init__(self,           f=self._map_func.function,           deterministic=deterministic_string,           **self._flat_structure)-    else:

Given the above, this should be retained.

lgeiger

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Remove expired forward compatibility horizons

 def testReducedIndices(self):           np.array([3, 2], dtype=dtype),           expected=np.array(59, dtype=dtype)) -      with compat.forward_compatibility_horizon(2019, 10, 19):-        self._testBinary(-            lambda x, y: special_math_ops.einsum('ij,j->', x, y),-            np.array([[1, 3], [2, 5], [6, 8]], dtype=dtype),-            np.array([3, 2], dtype=dtype),-            expected=np.array(59, dtype=dtype))-   def testUnary(self):     for dtype in self.float_types:       self._testUnary(           lambda x: special_math_ops.einsum('ijk->kji', x),           np.array([[[1, 3], [2, 5], [6, 8]]], dtype=dtype),           expected=np.array([[[1], [2], [6]], [[3], [5], [8]]], dtype=dtype)) -      with compat.forward_compatibility_horizon(2019, 10, 19):

same as above.

lgeiger

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Remove expired forward compatibility horizons

 def __init__(self,           f=self._map_func.function,           deterministic=deterministic_string,           **self._flat_structure)-    elif deterministic is not None or compat.forward_compatible(2020, 2, 20):+    else:

You should probably retain elif deterministic is not None:

lgeiger

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Remove expired forward compatibility horizons

 def testReducedIndices(self):           np.array([3, 2], dtype=dtype),           expected=np.array(59, dtype=dtype)) -      with compat.forward_compatibility_horizon(2019, 10, 19):

same as above

lgeiger

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Remove expired forward compatibility horizons

 def testMatMul(self):           np.array([[8]], dtype=dtype),           expected=np.array([[-2]], dtype=dtype)) -      with compat.forward_compatibility_horizon(2019, 10, 19):

You probably should just delete this line and leave the self._testBinary there?

lgeiger

comment created time in 2 months

more