profile
viewpoint
Frank Chen frankchn @google @tensorflow Mountain View, CA

frankchn/cs143 10

CS143 Decaf Project

frankchn/cs224n 4

CS224N Programming Assignments

abi/Our-Tree 3

Distributed R+ Trees. No, Native Client MapReduce! No, SHA-3 GPU implementation. But no! CPU/GPU RC4/AES implementation!

frankchn/cs108 2

CS108 Quiz Website

frankchn/introduction-to-tensorflow 2

Sample code for Introduction to TensorFlow talks I give.

frankchn/cs145 1

CS145 Cheatsheet

frankchn/cs147 1

CS147 Project for Fall 2011

frankchn/cs229 1

CS 229 Project Report

pull request commenttensorflow/community

RFC: tf.data Snapshot

Thanks @markemus! The implementation of Snapshot has something similar but in C++ for higher performance, but save and load is something we are actively considering for tf.data and we are thinking through ways of saving and loading the dtypes safely and efficiently too.

frankchn

comment created time in an hour

pull request commenttensorflow/tensorflow

tfconfig_cluster_resolver: Fix task index override

Hmm that doesn't really look related. @hakos can you re-sync and @rthadur we can reimport the PR and try agian?

hakos

comment created time in 5 days

pull request commenttensorflow/tensorflow

tfconfig_cluster_resolver: Fix task index override

The merge process is automatic once the internal tests pass -- should be in within a day or so.

hakos

comment created time in 11 days

pull request commenttensorflow/tensorflow

Fix example speech_command: v2 compatibility

Thanks for doing this!

foldl

comment created time in 12 days

issue commenttensorflow/tpu

Error in running CRNN on TPU

The error is because CTCLoss is not supported on TPUs:

CTCLoss (No registered 'CTCLoss' OpKernel for XLA_TPU_JIT devices compatible with node {{node model_1/lambda_1/CTCLoss}}
. Registered: device='CPU'; T in [DT_FLOAT]
device='CPU'; T in [DT_DOUBLE]

The documentation for CTCLoss states that "only dense padded labels are supported", so you might want to check your code to see if that is what you are doing.

puneetjindal

comment created time in 12 days

issue commenttensorflow/tensorflow

[TPU Colab] [TF2.1] Many issues only on TPU due to data types, batch sizes and memory (gist provided with explanations)

Thanks for reporting the issues.

The batch size and dtype limitations are known.

Practically speaking, the TPU systems have significant amounts of host memory, so converting INT8 to INT32 data types manually should not make it OOM unless the batch and data sizes are truly enormous.

Assigning the rest of to the current TPU oncall @hongjunChoi.

MichaelKarpe

comment created time in 12 days

push eventfrankchn/community

Frank Chen

commit sha 41396e8ce17b89d712a8bbda51c009ff8ca8475c

Update status to accepted.

view details

push time in 15 days

pull request commenttensorflow/tensorflow

Enhance SlurmClusterResolver

@Flamefire Thanks for filing the bug! I would prefer the current changes you have go in first (it is undergoing internal review now) and let's see how #36550 shakes out since I am not that familiar with the current usages there.

Flamefire

comment created time in 15 days

pull request commenttensorflow/tensorflow

Enhance SlurmClusterResolver

@Flamefire You may have to run the API compatibility test/updater at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/api/tests/api_compatibility_test.py#L23

Flamefire

comment created time in 19 days

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

  class SlurmClusterResolverTest(test.TestCase): -  def mock_resolve_hostnames_output(self):-    return ['t02n13', 't02n41', 't02n43', 't02n44']+  def test_expand_hostlist(self):+    from tensorflow.python.distribute.cluster_resolver.slurm_cluster_resolver import expand_hostlist

Just add the import at the top.

Flamefire

comment created time in 20 days

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.+# Copyright 2018-2020 The TensorFlow Authors. All Rights Reserved.

Change to Copyright 2020 The TensorFlow Authors.

Flamefire

comment created time in 20 days

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

  class SlurmClusterResolverTest(test.TestCase): -  def mock_resolve_hostnames_output(self):-    return ['t02n13', 't02n41', 't02n43', 't02n44']+  def test_expand_hostlist(self):+    from tensorflow.python.distribute.cluster_resolver.slurm_cluster_resolver import expand_hostlist+    self.assertEqual(expand_hostlist('n1'), ['n1'])+    self.assertEqual(expand_hostlist('n[1,3]'), ['n1', 'n3'])+    self.assertEqual(expand_hostlist('n[1-3]'), ['n1', 'n2', 'n3'])+    self.assertEqual(expand_hostlist('n[1-2],m5,o[3-4,6,7-9]'),+                     ['n1', 'n2', 'm5', 'o3', 'o4', 'o6', 'o7', 'o8', 'o9'])++  def test_expand_tasks_per_node(self):+    from tensorflow.python.distribute.cluster_resolver.slurm_cluster_resolver import expand_tasks_per_node

as above.

Flamefire

comment created time in 20 days

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

 import collections import os import subprocess+import re  from tensorflow.python.distribute.cluster_resolver.cluster_resolver import ClusterResolver from tensorflow.python.distribute.cluster_resolver.cluster_resolver import format_master_url from tensorflow.python.training.server_lib import ClusterSpec from tensorflow.python.util.tf_export import tf_export +def expand_hostlist(hostlist):+  """Create a list of hosts out of a SLURM hostlist++  The order of nodes is preserved and no deduplication is done+  Input: 'n[1-2],m5,o[3-4,6,7-9]')+  Output: ['n1', 'n2', 'm5', 'o3', 'o4', 'o6', 'o7', 'o8', 'o9']+  """+  def split_hostlist(hostlist):+    """Split hostlist at commas outside of range expressions ('[3-5]')"""+    in_brackets = False+    cur_host = ''+    for c in hostlist:+      if in_brackets:+        assert c != '['+        if c == ']':+          in_brackets = False+      elif c == '[':+        in_brackets = True+      elif c == ',':+        assert cur_host != ''+        yield cur_host+        cur_host = ''+        continue+      cur_host += c+    if cur_host:+      yield cur_host++  def expand_range_expression(range_exp):+    """Expand a range expression like '3-5' to values 3,4,5"""+    for part in range_exp.split(','):+      sub_range = part.split('-')+      if len(sub_range) == 1:+        sub_range = sub_range * 2+      else:+        assert len(sub_range) == 2+      for i in range(int(sub_range[0]), int(sub_range[1]) + 1):+        yield i++  hosts = []+  try:+    for part in split_hostlist(hostlist):+      # Match prefix (anything but a range expression) and range expression+      # Both are optional+      m = re.match(r'([^,[\]]*)(\[([^\]]+)\])?$', part)+      if m is None:+        raise ValueError('Invalid part: %s' % part)+      prefix = m.group(1) or ''+      if m.group(3) is None:+        hosts.append(prefix)+      else:+        hosts.extend(prefix + str(i)+                     for i in expand_range_expression(m.group(3))+        )

Just leave this on the previous line with the other closing brackets

Flamefire

comment created time in 20 days

pull request commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

Thanks for doing this @Flamefire! I've added some minor comments, and we should be good to go once those are fixed. If the internal linter catches anything else, I'll fix it then.

Flamefire

comment created time in 20 days

pull request commenttensorflow/tensorflow

Fix go proto handling

@JWMHayhurst can you file a bug with what you are running and the error message?

jhseu

comment created time in 20 days

pull request commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

@Flamefire The current preferred usage is in the unittest style with assertEqual, assertNotEqual, etc...

Flamefire

comment created time in 22 days

pull request commenttensorflow/community

RFC: tf.data Snapshot

Hi everyone,

Thank you for commenting on this design review. We've recently completed an internal round of design review meetings for this feature, and we have taken into account many of the comments and have revised the design document.

Specifically, the largest changes we have made are (1) consolidating the various options into a user-specifiable reader_fn and writer_fn, where users can customize how dataset is read and written, and (2) removing snapshot_name and snapshot_mode, and instead implementing with_snapshot_fingerprint for users to override fingerprinting on parts of their input pipeline graph.

We appreciate all your comments and are looking forward to implementing this feature soon!

Thanks,

Frank

frankchn

comment created time in a month

push eventfrankchn/community

Frank Chen

commit sha 9a13bd427158d7da0b4d6e4c3300b00842eed5c5

Update set_snapshot_fingerprint to with_snapshot_fingerprint

view details

push time in a month

push eventfrankchn/community

Frank Chen

commit sha 19f3a62cecbcf1c82a9dfbfc540e046aa9320505

Change fingerprint function to set_snapshot_fingerprint

view details

push time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_fn=None,+             writer_fn=None,+             pending_snapshot_expiry_seconds=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++1.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    AUTO.++1.  `reader_fn`: Optional. The input pipeline transformation specified by +    `reader_fn` is executed when the snapshot detects that there is an existing, +    valid snapshot available.++    `reader_fn` is a user specified function that accepts a single argument: +    (1) a Dataset of Datasets, each representing a "splits" of elements of the +    original dataset. The cardinality of the input dataset matches the +    cardinality of the output of `writer_fn` (see below). The function should +    return a Dataset of elements of the original dataset.++    A default `reader_fn` will look like the following:++    ```python+    def default_reader_fn(datasets):+      # shuffle the datasets splits+      datasets = datasets.shuffle(NUM_DATASETS)+      # read datasets in parallel and interleave their elements+      return dataset.interleave(lambda x: x, num_parallel_calls=AUTOTUNE)+    ```++1.  `writer_fn`: Optional. The input pipeline specified by `writer_fn` is +    executed when the snapshot op detects that there are no valid snapshots+    and no other threads are currently attempting to write a snapshot.++    `writer_fn` is a user specified function that accepts a single argument: +    (1) a Dataset of elements to be written out. The function should return+    a Dataset of Datasets, each representing "splits" of elements of the+    original dataset. The tf.data snapshot implementation will then persist+    splits in parallel.++    A default writer_fn will look like the following:++    ```python+    def default_writer_fn(dataset):+      # add a component with element index+      dataset = dataset.enumerate()+      # split input dataset in a round-robin fashion+      return dataset.split(num_splits=NUM_CORES, key_fn=lambda i, _: i % NUM_CORE+    ```++1.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++#### Achieving Parallelism++`reader_fn` and `writer_fn` will default to passing the dataset through unchanged+by default. In other words, the default implementation will result in +single-threaded reads and writes on snapshots. Parallelism can be achieved in+`writer_fn` by splitting up the dataset into multiple datasets, and using+`num_parallel_calls` in the `interleave` function of the `reader_fn`.++#### Computing Graph Fingerprints++Snapshot attempts to determine whether a run of an input pipeline is the same+as a previous run by computing the fingerprint of the nodes within the pipeline.++However, some input pipelines might vary in insignificant ways from run to run+that causes the fingerprinting of them to differ. For instance, consider the+following preprocessing function:++```python+features_to_multiply = {"feature1", "feature2", "feature3", "feature4"}++def preprocessing_fn(value):+  keys_to_features = {+    "feature1": tf.FixedLenFeature([], tf.float32, 0.0),+    "feature2": tf.FixedLenFeature([], tf.float32, 0.0),+    "feature3": tf.FixedLenFeature([], tf.float32, 0.0),+    "feature4": tf.FixedLenFeature([], tf.float32, 0.0)+  }++  parsed = tf.parse_single_example(value, keys_to_features)+  combined_feature = 1.0+  for item in features_to_multiply:+    combined_feature *= parsed[item]++  return combined_feature++dataset = ...+dataset = dataset.map(preprocessing_fn)+```++In the above example, our `features_to_multiply` variable uses a `set`, which is +not guaranteed to be ordered in Python 2. When we iterate over the set in the +for loop within `preprocessing_fn`, we may get a different graph on each +run (i.e. one run could have us multiplying `feature2` first, then `feature4`, +etc..., while another run may have us multiplying `feature1`, then `feature3`, +and so on).++In cases like these, we can ask fingerprinting to use a fixed value for the+fingerprint of the map function with a new `set_fingerprint`+transformation, which asks the fingerprinting function to not compute the +fingerprint of the previous node but to use a user-specified value instead:++```python+dataset = ...+dataset = dataset.map(preprocessing_fn) +dataset = tf.data.set_fingerprint(dataset, fingerprint="my_fixed_fp")

Changing.

frankchn

comment created time in a month

push eventfrankchn/community

Frank Chen

commit sha 0b28765eb58e03eeed229c1cca8b6309abeee184

Revision to the API after TensorFlow Design Review

view details

push time in a month

pull request commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

Thanks for your comments @steven-chien! It would be good to have documentation to cover standard Slurm use cases for this written.

Flamefire

comment created time in a month

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

 from tensorflow.python.training.server_lib import ClusterSpec from tensorflow.python.util.tf_export import tf_export +from hostlist import expand_hostlist, parse_slurm_tasks_per_node

I am not a lawyer, and I am guessing it depends on the interpretation of what "linking" is under GPL, but if calling a function is considered linking, then we may want to avoid this if possible.

https://www.apache.org/licenses/GPL-compatibility.html

Flamefire

comment created time in a month

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

 from tensorflow.python.training.server_lib import ClusterSpec from tensorflow.python.util.tf_export import tf_export +from hostlist import expand_hostlist, parse_slurm_tasks_per_node+++def get_slurm_step_var(name):+  """Get the SLURM step variable from the environment++  Args:+    name: Name of the step variable+  Returns:+    SLURM_STEP_<name> from os.environ+  Raises:+    RuntimeError if variable is not found+  """+  name = 'SLURM_STEP_' + name+  try:+    return os.environ[name]+  except KeyError:+    raise RuntimeError('%s not found in environment. Not running inside a SLURM step?' % name)

Imports are excluded from the line-length requirement. Here is the standard Google Python style guide, which we follow: http://google.github.io/styleguide/pyguide.html and there are some links to a formatter and vim settings file.

Flamefire

comment created time in a month

pull request commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

cc @steven-chien

Flamefire

comment created time in a month

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

 from tensorflow.python.training.server_lib import ClusterSpec from tensorflow.python.util.tf_export import tf_export +from hostlist import expand_hostlist, parse_slurm_tasks_per_node+++def get_slurm_step_var(name):+  """Get the SLURM step variable from the environment++  Args:+    name: Name of the step variable+  Returns:+    SLURM_STEP_<name> from os.environ+  Raises:+    RuntimeError if variable is not found+  """+  name = 'SLURM_STEP_' + name+  try:+    return os.environ[name]+  except KeyError:+    raise RuntimeError('%s not found in environment. Not running inside a SLURM step?' % name)

Python max line length is 80.

Flamefire

comment created time in a month

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

 from tensorflow.python.training.server_lib import ClusterSpec from tensorflow.python.util.tf_export import tf_export +from hostlist import expand_hostlist, parse_slurm_tasks_per_node

Can we reimplement expand_hostlist and parse_slurm_tasks_per_node here? Given the size of TensorFlow as it is, we are trying not to include new packages unless absolutely essential.

I note that hostlist is GPL so it is not compatible with TensorFlow's Apache license.

Flamefire

comment created time in a month

Pull request review commenttensorflow/tensorflow

WIP Enhance SlurmClusterResolver

 def __init__(self,       num_tasks = int(os.environ['OMPI_COMM_WORLD_SIZE'])     else:       self._rank = int(os.environ['SLURM_PROCID'])-      num_tasks = int(os.environ['SLURM_NTASKS'])+      num_tasks = get_num_slurm_tasks()++    if jobs is None:+      jobs = {'worker': num_tasks}+    if gpus_per_node is None:+      gpus_per_node = sum(l.startswith('GPU ') for l in+                          subprocess.check_output(['nvidia-smi', '--list-gpus']).

Can we abstract this out to another function? The function should also throw an exception if nvidia-smi is not found for instance.

Flamefire

comment created time in a month

issue commenttensorflow/tensorflow

SlurmClusterResolver should use env variables ob job step and return correct num_accelerators

@Flamefire, thanks for the bug report. I am happy to review your PR, so just create one and assign it to me. Thanks!

Flamefire

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.++### New `SnapshotDatasetOp`++To implement the transformation, we are introducing a new `SnapshotDatasetOp`+dataset kernel that will implement all of the functionality in TensorFlow C++.+Python code is mostly glue code to pass relevant parameters into the op kernel.++### Internal Directory / File Structure++Given a user directory path (e.g. `/path/to/snapshot`), the directory will look+like:++*   /path/to/snapshot+    *   `fingerprint`/+        *   snapshot.metadata+        *   `run-id`/+            *   0000000.snapshot+            *   0000001.snapshot++The `fingerprint` is a hash of the input processing graph. The `run-id` is+unique training run ID generated.

@karmel -- We are going to re-investigate TFRecordReader performance and see if we can bring that up to equivalency with our implementation.

frankchn

comment created time in a month

push eventfrankchn/community

Frank Chen

commit sha 149a6f9532900224bfb951e5a70a8e75d2e6ecb9

Rename SnapshotDataset to SnapshotReaderDataset

view details

push time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_writer_threads=None,+             reader_fn=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++1.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    AUTO.++1.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to AUTO.++1.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++1.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written. Defaults to AUTO. ++1.  `reader_fn`: Optional. A user provided reader function to use when reading+    the snapshot back. This allows the user to specify the concurrency and+    randomization required when reading from the snapshot.++    `reader_fn` should be a function that accepts two arguments: (1) a list of+    snapshot file paths, and (2) a reference to a `SnapshotDataset` class.+    The function should return a `Dataset` class.++    The `SnapshotDataset` class is a `Dataset` (similar to other source datasets

Good point. Done.

frankchn

comment created time in a month

issue closedtensorflow/tensorflow

tensorflow2.1 can not initialize colab TPU

if '2.1' in tf.__version__:
    if 'COLAB_TPU_ADDR' in os.environ:
        resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
        tf.config.experimental_connect_to_cluster(resolver)
        tf.tpu.experimental.initialize_tpu_system(resolver)
        strategy = tf.distribute.experimental.TPUStrategy(resolver)
        print('Running on TPU ', resolver.cluster_spec().as_dict())

NotFoundError Traceback (most recent call last) <ipython-input-1-53f26737ac5e> in <module>() 160 resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR']) 161 tf.config.experimental_connect_to_cluster(resolver) --> 162 tf.tpu.experimental.initialize_tpu_system(resolver) 163 strategy = tf.distribute.experimental.TPUStrategy(resolver) 164 print('Running on TPU ', resolver.cluster_spec().as_dict())

3 frames /usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

NotFoundError: '__inference__tpu_init_fn_4' is neither a type of a primitive operation nor a name of a function registered in binary running on n-3ea5ef93-w-0. Make sure the operation or function is registered in the binary running in this process.

closed time in a month

ofpppppppdbfjs

issue commenttensorflow/tensorflow

tensorflow2.1 can not initialize colab TPU

Hello! This seems to be a Colab-specific issue (Colab might not have fully rolled out the TensorFlow TPU 2.1 backend yet). Can you follow the instructions at https://github.com/googlecolab/colabtools#contacting-us and file an issue/send feedback to the Colab team?

ofpppppppdbfjs

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.+

@alextp @rohan100jain I have updated the design doc to add a new reader_fn and remove existing reader parameters. PTAL, thanks!

frankchn

comment created time in a month

push eventfrankchn/community

Frank Chen

commit sha a7a7c5b553cd7d7264d2f5730a368db5f709ad79

Removed miscellaneous reader options and added a reader_fn parameter

view details

push time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):

The buffer size configuration here is for the reading and writing thread buffers (basically the same effect as having Dataset.prefetch(1)) rather than file system buffers. For the buffers you are talking about, we offer configurations at the file system levels (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/cloud/gcs_file_system.cc#L67) that would be more general.

frankchn

comment created time in a month

push eventfrankchn/community

Frank Chen

commit sha beff086b7c870c346e30d64562c370d0c80e795d

Updated design doc by removing some unneeded parameters

view details

push time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    AUTO.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading

Good point, removed.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    AUTO.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to AUTO.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)

This is there so that if a pipeline in write mode crashes halfway without finalizing the snapshot, future runs of the pipeline can start a new writing process rather be stuck in passthrough mode indefinitely.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    AUTO.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to AUTO.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.+    Defaults to AUTO. ++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Increasing this might improve+    performance but will increase memory consumption. Defaults to AUTO.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written. Defaults to AUTO. ++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`. Defaults+    to AUTO.++11. `shuffle_on_read`: Optional. If this is True, then snapshot randomizes the+    order in which the snapshot files are read back. This emulates shuffling+    of the input files during a training run (e.g. when `Dataset.list_files` +    is called with `shuffle` turned on). Defaults to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named

Added clarification about the possibility of collisions.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the

Cache can be in memory, which is useful for testing and small datasets.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):

After discussion, going to remove the buffer_size option.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.+

Agree with @rohan100jain that for the next internal implementation we can expose save/load/control where appropriate while retaining this python interface for existing users who might want a simple solution.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.+

I think this design was chosen so that the user doesn't have to worry about changing their existing pipelines much at all, and can benefit from this just by dropping snapshot in.

With save() and load(), users will have to manage saving and loading by themselves and this potentially introduces errors (e.g. users may want accidentally load an old version of a snapshot without realizing it).

That said, perhaps we can re-use the C++ code that implements this into more generic Dataset.save and Dataset.load ops for users who want that sort of control?

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.

There are internal use cases for this (e.g. readahead).

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):

Yeah, I don't particularly like how many args there are here either. Perhaps reader/writer threads and buffer sizes can be collapsed into just threads and buffer (cc @rohan100jain)

frankchn

comment created time in 2 months

push eventfrankchn/community

Frank Chen

commit sha d941c33eb309d23ec79ba891e490da1abc1fb141

Add clarification regarding number of workers being the same

view details

push time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.

If the number of workers change, then (at least in the case of sharded TPU training) the num_shards parameter to Dataset.shard will change too. This results in a different graph fingerprint for each worker, and we will go into WRITE mode automatically. If there are no shards and all the workers use exactly the same input pipeline, then this won't happen. Will add something here to clarify.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.++### New `SnapshotDatasetOp`++To implement the transformation, we are introducing a new `SnapshotDatasetOp`+dataset kernel that will implement all of the functionality in TensorFlow C++.+Python code is mostly glue code to pass relevant parameters into the op kernel.++### Internal Directory / File Structure++Given a user directory path (e.g. `/path/to/snapshot`), the directory will look+like:++*   /path/to/snapshot+    *   `fingerprint`/+        *   snapshot.metadata+        *   `run-id`/+            *   0000000.snapshot+            *   0000001.snapshot++The `fingerprint` is a hash of the input processing graph. The `run-id` is+unique training run ID generated.++### Standard Kernel Workflow++_Note: This is an implementation detail, and may change in the future. This+should not be relied upon except as a reference to the current implementation._++By default, the `snapshot` operation will, upon startup, make a determination+using the following algorithm as to whether the operation should be in the+WRITE, PASSTHROUGH, or READ state.++1.  We will compute a graph fingerprint containing all the information from the+    Dataset preprocessing graph before the `snapshot` op. We’ll use the+    `AsGraphDefInternal` method on DatasetBase for this.++1.  We will attempt to enter the corresponding fingerprint directory. For+    instance, if the computed fingerprint is `f-abc123` and the base snapshot+    directory is `/saved/data`, then we will attempt to enter+    `/saved/data/f-abc123`.++1.  If the snapshot directory is non-existent, empty or it doesn’t contain a+    `metadata` file, we will enter the **WRITE** state.++1.  If the snapshot directory contains a `metadata` file, we will read the+    metadata file.++    1.  The metadata file contains the following fields:+        1.  A training run ID+        1.  A boolean indicating if the snapshot is complete+        1.  A training run start-time.++1.  If the training run start-time is more than the (configurable) training run+    timeout (set with the `pending_snapshot_expiry_seconds` parameter), we will+    enter the **WRITE** state.++1.  If the training run start-time is less than the training run timeout, but+    the snapshot is not complete, then we will enter the **PASSTHROUGH** state.++1.  If the snapshot is complete, we will enter the **READ** state.++#### WRITE State++1.  We generate a random training run ID.++1.  We write (possibly overwriting) the `snapshot.metadata` file.++1.  We proceed to create a subdirectory containing the training run ID, and+    start writing data asynchronously in chunks.++1.  At the end of the dataset (when `end_of_sequence == true`), we will check+    the snapshot.metadata file to determine whether it contains the same+    training run ID.++    1.  If it does, we set the complete bit to true to finalize the directory.+    1.  If it does not, it means that someone else is concurrently writing the+        snapshot and we lost the race to them. We delete all data in the+        training run directory.++For the current implementation, we will store the data in chunked TFRecord+files. Eventually we may move to other more higher performance data stores or

@katsiapis Sorry, can you clarify what sink you are referring to?

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future

Makes sense. Removed the guarantees about future stuff.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.++### New `SnapshotDatasetOp`++To implement the transformation, we are introducing a new `SnapshotDatasetOp`+dataset kernel that will implement all of the functionality in TensorFlow C++.+Python code is mostly glue code to pass relevant parameters into the op kernel.++### Internal Directory / File Structure++Given a user directory path (e.g. `/path/to/snapshot`), the directory will look+like:++*   /path/to/snapshot+    *   `fingerprint`/+        *   snapshot.metadata+        *   `run-id`/+            *   0000000.snapshot+            *   0000001.snapshot++The `fingerprint` is a hash of the input processing graph. The `run-id` is+unique training run ID generated.

I don't think SavedModel supports comparisons between graphs (or in this case, parts of a graph), which is our main use case here. In general, I am not sure TensorFlow can guarantee to produce the exact same graph (i.e. with the same node names, function names, etc...), so we are utilizing the HashGraph function (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/dataset_utils.cc#L733), which computes a fingerprint while ignoring node names and other things that may vary run by run.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.

This is currently a best practice. There will be a snapshot best practices guide published later.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.++### New `SnapshotDatasetOp`++To implement the transformation, we are introducing a new `SnapshotDatasetOp`+dataset kernel that will implement all of the functionality in TensorFlow C++.+Python code is mostly glue code to pass relevant parameters into the op kernel.++### Internal Directory / File Structure++Given a user directory path (e.g. `/path/to/snapshot`), the directory will look+like:++*   /path/to/snapshot+    *   `fingerprint`/+        *   snapshot.metadata+        *   `run-id`/+            *   0000000.snapshot+            *   0000001.snapshot++The `fingerprint` is a hash of the input processing graph. The `run-id` is+unique training run ID generated.++### Standard Kernel Workflow++_Note: This is an implementation detail, and may change in the future. This+should not be relied upon except as a reference to the current implementation._++By default, the `snapshot` operation will, upon startup, make a determination+using the following algorithm as to whether the operation should be in the+WRITE, PASSTHROUGH, or READ state.++1.  We will compute a graph fingerprint containing all the information from the+    Dataset preprocessing graph before the `snapshot` op. We’ll use the+    `AsGraphDefInternal` method on DatasetBase for this.++1.  We will attempt to enter the corresponding fingerprint directory. For+    instance, if the computed fingerprint is `f-abc123` and the base snapshot+    directory is `/saved/data`, then we will attempt to enter+    `/saved/data/f-abc123`.++1.  If the snapshot directory is non-existent, empty or it doesn’t contain a+    `metadata` file, we will enter the **WRITE** state.++1.  If the snapshot directory contains a `metadata` file, we will read the+    metadata file.++    1.  The metadata file contains the following fields:

An immutable metadata file definitely makes sense. We will introduce a metadata.final file to indicate when snapshot has finished writing.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.

This emulates the behavior when Dataset.list_files(shuffle=True) and is orthogonal to using dataset.shuffle() after the snapshot op. Clarified the behavior of this flag.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults

Clarified. The behavior emulates Dataset.list_files(shuffle=True) by randomizing the order in which the snapshot files are read back.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.

Changed.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve

Changed

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.+

Changed.

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.

Good point. We can have the default option be AUTO here and elsewhere, and gradually add tf.data autotuning to snapshots after (autotuning right now starts a few versions of the pipelines, which is not ideal given that we are writing to the file system).

frankchn

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data

Yup, this is a way to serializing tf.data datasets to disk then reading them back. However, the API design is geared towards saving preprocessing time right now.

frankchn

comment created time in 2 months

push eventfrankchn/community

Frank Chen

commit sha 2bee40f645d49445efe548ab8c59d58586e10d24

Updated design doc after comments from various folks

view details

push time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.

@byronyi The current tf.data input pipeline design doesn't really allow us to coordinate between workers (e.g. there might failure cases where K worker out of N fails and the snapshot slice is invalid).

That said, I totally understand where you are coming from, and we are planning a data pipeline service (where snapshot can be integrated) so everyone would read elements off a single pipeline and having different numbers of trainers would be less of an issue. There should be a design doc RFC on that in the next few weeks so look for that.

frankchn

comment created time in 2 months

pull request commenttensorflow/community

RFC: tf.data Snapshot

How is this different from processing data, saving them in TFRecords format on the disk and reading it later?

Thanks for your question! You can definitely manually replicate what we are doing here by just running an input pipeline and saving the processed data to other TFRecord files.

That said, there are a couple of advantages of having this op over a manual process:

First, we can write to/read from snapshots automatically with a single additional op at the end of the input pipeline, without any new code necessary to write to/read from the snapshots.

Second, tf.data snapshot can detect changes in the pipeline (e.g. changing JPEG image decode sizes or pointing to a new set of files) through input pipeline fingerprinting and automatically create new snapshot files with no further intervention. This minimizes potential errors when you are experimenting and changing input pipelines.

Third, the proposed file format we have is somewhat more efficient (both in terms of the file format itself, and the ability to enable gzip/snappy compression on the file level). We also try to optimize the read/write paths as much as possible, so at the limit we believe we are faster or more space-efficient (or both) than plain TFRecord reading/writing.

frankchn

comment created time in 2 months

pull request commenttensorflow/community

RFC: tf.data Snapshot

Thanks Frank for this RFC. We observe similar performance benefits in caching preprocessed data in multi-epoch training.

Since we already have a ds.cache API that support file system interface, what is the unique value proposition to have another ds.snapshot API?

EDIT: saw that in the doc, might worth to mention that in the PR summary.

Thanks for your comments, and added the motivation and differences to ds.cache in the PR summary.

frankchn

comment created time in 2 months

push eventfrankchn/community

Frank Chen

commit sha 958ae634eff74de3a71e9991e6fea892c1e3f4af

Minor formatting changes

view details

push time in 2 months

PR opened tensorflow/community

Reviewers
RFC: tf.data Snapshot

tf.data Snapshot

Status Proposed
RFC # 193
Author(s) Frank Chen (frankchn@google.com), Rohan Jain

: : (rohanj@google.com) : | Sponsor | Jiri Simsa (jsimsa@google.com) | | Updated | 2020-01-07 |

Objective

With ever faster accelerators available in Cloud and hyperparameter tuning consuming larger chunks of accelerator time, TensorFlow users are increasingly finding that they don’t have enough CPU resources to keep up with these accelerators, leaving valuable accelerator resources idle.

To alleviate this problem, we are proposing a snapshot API within tf.data, to allow users to transparently persist the output of their preprocessing pipeline to disk, and materialize the pre-processed data on a different training run.

This API enables repeated preprocessing steps to be consolidated, and allowing re-use of already processed data, trading off disk storage and network bandwidth for freeing up more valuable CPU resources and accelerator compute time.

+368 -0

0 comment

1 changed file

pr created time in 2 months

create barnchfrankchn/community

branch : rfc-snapshot

created branch time in 2 months

fork frankchn/community

Stores documents used by the TensorFlow developer community

fork in 2 months

issue closedtensorflow/tensorflow

“Cloud TPU” console spam on every TensorFlow import

<em>Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template</em>

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): gLinux (like Debian)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.12.1-21487-g2e8d5e5 2.1.0-dev20200102
  • Python version: Python 3.7.5rc1
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A

Describe the current behavior

Importing TensorFlow prints an unnecessary and unhelpful warning:

WARNING:tensorflow:Falling back to tensorflow client, its recommended to install the cloud tpu client directly with pip install cloud-tpu-client .

Describe the expected behavior

Importing TensorFlow should not print any messages about Cloud TPUs. This is a normal desktop installation that doesn’t have anything to do with TPUs, and doesn’t need them.

Code to reproduce the issue

python -c 'import tensorflow' 2>&1 | diff -u /dev/null -

Other info / logs

Likely introduced by 5364121e858b.

closed time in 2 months

wchargin

issue commenttensorflow/tensorflow

“Cloud TPU” console spam on every TensorFlow import

Thanks for your report! It should be fixed starting with the nightly build tomorrow morning (2.1.0-dev20200104).

wchargin

comment created time in 2 months

issue commenttensorflow/tensorflow

ImageDataGenerator does not work with tpu

I believe we currently don't support fit_generator in distribution strategy or TPUs, but assigning to @guptapriya to confirm.

Shiro-LK

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Do not make invalid android config with only NDK/SDK configured

 def _android_autoconf_impl(repository_ctx):     ndk_home = repository_ctx.os.environ.get(_ANDROID_NDK_HOME)     ndk_api_level = repository_ctx.os.environ.get(_ANDROID_NDK_API_VERSION) -    sdk_rule = "pass"+    sdk_rule = ""     if all([sdk_home, sdk_api_level, build_tools_version]):         sdk_rule = _ANDROID_SDK_REPO_TEMPLATE % (             sdk_home,             sdk_api_level,             build_tools_version,         ) -    ndk_rule = "pass"+    ndk_rule = ""     if all([ndk_home, ndk_api_level]):         ndk_rule = _ANDROID_NDK_REPO_TEMPLATE % (ndk_home, ndk_api_level) +    if ndk_rule == "" and sdk_rule == "":+        sdk_rule = "pass"

Actually why is this changed?

DoumanAsh

comment created time in 4 months

issue commenttensorflow/tensorflow

TPUStrategy breaks on subclassed keras models

@rxsang can you help take a look? I am OOO today

kazimuth

comment created time in 4 months

issue commenttensorflow/tensorflow

TPUStrategy breaks on subclassed keras models

TPUs are not officially supported in TF2.0. Does the code work on 1.15 or a nightly version?

kazimuth

comment created time in 4 months

issue commenttensorflow/tensorflow

tf.keras model.fit calls slow with TPU distribute strategy

I am not surprised that the notebook is slow as the data processing is all happening on the Colab rather than the TPU system (which has much more processing power than the Colab VMs).

With the notebook, the data variable contains 224 x 224 x 3 x 4 (bytes/float) x 1024 = 588 MB of data, which has to be transferred per step. Transferring this amount of data over the network to the TPU + encoding and decoding overhead would be non-trivial.

For performance reasons especially on non-trivial image models, you need to use tf.data Datasets with TF supported ops, and load the raw data from GCS.

capilano

comment created time in 4 months

issue commenttensorflow/tensorflow

Combo TPU/TFRecords for model.evaluate is not working

Yeah, in general you need to match the TF versions between the TPU and the Colab/VM instance that you are using. We don't support TF1.15 on Colab and 1.14 on the TPU (or any other combination).

Coming back to your original question, it looks like the XLA compiler doesn't support this feature right now: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/dynamic_dimension_inference.cc#L712

@jvishnuvardhan Can you assign this to someone on the XLA team to see if they know more?

anhmeow

comment created time in 5 months

issue commenttensorflow/tensorflow

Combo TPU/TFRecords for model.evaluate is not working

The TPU version must match the version of TensorFlow you are installing with pip install. Are you creating a TF with --version=1.15?

anhmeow

comment created time in 5 months

issue closedtensorflow/tensorflow

TPUStrategy incompatibility with tf.io.read_file

System information

I am using Colaboratory and Google cloud.

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 4.9.168-1+deb9u5
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): unknown
  • TensorFlow version (use command below): 1.14
  • Python version: 3.5.3
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

The script ends with a segmentation fault or abort.

Describe the expected behavior

Just a clean run.

Code to reproduce the issue

# Importing all necessary libraries
import tensorflow as tf
import cv2
import random as rnd
import os

if int(tf.__version__.split('.')[0]) == 1:
    print(tf.__version__)
    tf.enable_eager_execution()

@tf.function
def _read_test(filename):
    img_raw = tf.io.read_file(tf.squeeze(filename))
    return img_raw

import numpy as np
image = np.array([[rnd.randint(0, 255) for _ in range(936)] for _ in range(1024)])
cv2.imwrite('0.png', image);

raw = _read_test(tf.constant('0.png'))

tf.keras.backend.clear_session()

if 'TPU_NAME' in os.environ:
    TPU_WORKER = 'grpc://' + os.environ['TPU_NAME']
    resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu=TPU_WORKER)
    tf.config.experimental_connect_to_host(resolver.master())
    tf.contrib.distribute.initialize_tpu_system(resolver)
    strategy = tf.contrib.distribute.TPUStrategy(resolver)
elif 'COLAB_TPU_ADDR' in os.environ:
    TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu=TPU_WORKER)
    tf.config.experimental_connect_to_host(resolver.master())
    tf.contrib.distribute.initialize_tpu_system(resolver)
    strategy = tf.contrib.distribute.TPUStrategy(resolver)
else:
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

#print(strategy)

with strategy.scope():
    pass

print('success')

Other info / logs

The previous code is part of an image processing Neural Network. The image files are read while running the code to minimize the disk usage. The code runs smoothly in a CPU or GPU environment, however it crashes in a TPU one.

closed time in 5 months

jmgc

issue commenttensorflow/tensorflow

TPUStrategy incompatibility with tf.io.read_file

Closing issue for now, feel free to reopen.

jmgc

comment created time in 5 months

issue closedtensorflow/tensorflow

[TF 2.0.0rc0] Cannot connect to TPU device

Created VM and v3-8 TPU with ctpu up command and updated TF version to TF2.0.0rc0 via pip3. When i try to connect to tpu device returns error:

InvalidArgumentError: Unable to find a context_id matching the specified one (5613663074031560004). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1566994715.938381293","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (5613663074031560004). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2019-08-28 12:18:36.196440: E tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc:72] Remote EagerContext with id 5613663074031560004 does not seem to exist.

I also tried the same in Colab, with rc0 version and i have the same error. The code i used is the one given in documentation:

tpu='test'
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=tpu)
tf.config.experimental_connect_to_host(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
tpu_strategy = tf.distribute.experimental.TPUStrategy(resolver)

closed time in 5 months

nsantavas

issue commenttensorflow/tensorflow

[TF1.14][TPU]Can not use custom TFrecord dataset on Colab using TPU

@rishabhsahrawat Unfortunately you have to use GCS to store data going into and out of TPUs. This is a current design limitation for the platform that we are working to resolve, but we don't have anything to announce at the moment.

You can sign up for a Google Cloud Platform account with 5GB of free storage and $300 in credits at https://cloud.google.com/free/, so that should be able to provide you with enough credits to get started.

rishabhsahrawat

comment created time in 5 months

issue commenttensorflow/tensorflow

[TF1.14][TPU]Can not use custom TFrecord dataset on Colab using TPU

Hello! Yeah, we don't allow local filesystem access on TPUs right now, and we recommend that you read and write checkpoints and other files on GCS. See our MNIST on TPU colab for an example.

With future TensorFlow and TPU releases (2.1+), you will be able to use the ram:// file system to write temporary files that you don't need to access.

rishabhsahrawat

comment created time in 5 months

issue closedtensorflow/tensorflow

[TF1.14][TPU]Can not use custom TFrecord dataset on Colab using TPU

I have created a TFRecord dataset file consisting elements and their corresponding labels. I want to use it for training model on Colab using free TPU. I can load the TFRecord file and even run an iterator just to see the contents however, before the beginning of the epoch it throws following error-

UnimplementedError: From /job:worker/replica:0/task:0:
File system scheme '[local]' not implemented (file: '/content/gdrive/My Drive/data/encodeddata_inGZIP.tfrecord')
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNextAsOptional_1]]

In my understanding, it wants the TFRecord file on the TPU bucket, I don't know how to do that on Colab. How can one use a TFRecord file directly on Colab TPU?

closed time in 5 months

rishabhsahrawat

pull request commenttensorflow/tensorflow

test streaming accuracy python implementation

Here are the errors from Ubuntu Sanity:

From pylint:

tensorflow/examples/speech_commands/recognize_commands.py:107: [C0330(bad-continuation), ] Wrong continued indentation (remove 1 space).
tensorflow/examples/speech_commands/recognize_commands.py:128: [C0330(bad-continuation), ] Wrong hanging indentation (remove 35 spaces).
tensorflow/examples/speech_commands/recognize_commands.py:129: [C0330(bad-continuation), ] Wrong hanging indentation (remove 28 spaces).
tensorflow/examples/speech_commands/recognize_commands.py:135: [C0330(bad-continuation), ] Wrong continued indentation (add 10 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:136: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:138: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:140: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:148: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:149: [C0330(bad-continuation), ] Wrong continued indentation (add 9 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:154: [C0330(bad-continuation), ] Wrong continued indentation (remove 1 space).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:169: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:172: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:173: [C0330(bad-continuation), ] Wrong continued indentation (remove 5 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:174: [C0330(bad-continuation), ] Wrong continued indentation (remove 5 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:175: [C0330(bad-continuation), ] Wrong continued indentation (remove 5 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:184: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:185: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:186: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:187: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:189: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:190: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:191: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:192: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:194: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:195: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:196: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:197: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:199: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:200: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:201: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:202: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:204: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:205: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:206: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:207: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:207: [C0326(bad-whitespace), ] Exactly one space required after comma
tensorflow/examples/speech_commands/test_streaming_accuracy.py:208: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:210: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:211: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:212: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:213: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:215: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:216: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:217: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:218: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:220: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:221: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:222: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:223: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:225: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:226: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:227: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:228: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:230: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:231: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:232: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:233: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:235: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:236: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:237: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:238: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:240: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:241: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:242: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:243: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:246: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:247: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:248: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/test_streaming_accuracy.py:249: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/accuracy_utils.py:64: [C0330(bad-continuation), ] Wrong continued indentation (remove 10 spaces).
tensorflow/examples/speech_commands/accuracy_utils.py:144: [C0330(bad-continuation), ] Wrong hanging indentation (add 2 spaces).
tensorflow/examples/speech_commands/accuracy_utils.py:151: [C0330(bad-continuation), ] Wrong continued indentation (add 3 spaces).

From Ubuntu Sanity:

FAIL: buildifier found errors and/or warnings in above BUILD files.
buildifier suggested the following changes:
tensorflow/examples/speech_commands/BUILD:
16,17c16,17
<     main="accuracy_utils.py",
<     srcs = ["accuracy_utils.py",],
---
>     srcs = ["accuracy_utils.py"],
>     main = "accuracy_utils.py",
23c23
<         ],
---
>     ],
28,29c28,29
<     main="recognize_commands.py",
<     srcs = ["recognize_commands.py",],
---
>     srcs = ["recognize_commands.py"],
>     main = "recognize_commands.py",
35c35
<         ],
---
>     ],
39a40
>     srcs = ["test_streaming_accuracy.py"],
41d41
<     srcs = ["test_streaming_accuracy.py",],
50c50
<         ],
---
>     ],
Please fix manually or run buildifier <file> to auto-fix.
zhizunbao-y

comment created time in 5 months

issue commenttensorflow/tensorflow

TPU has unsupported tensorflow op "LogUniformCandidateSampler" using XLA

I've filed a feature request for this internally, but I am not sure it would get on our roadmap in the short term.

elyasmehtabuddin

comment created time in 6 months

issue commenttensorflow/tensorflow

TPU has unsupported tensorflow op "LogUniformCandidateSampler" using XLA

Unfortunately this is not a supported op that you can use on TPUs. The list of compatible base op for TPUs is available at https://cloud.google.com/tpu/docs/tensorflow-ops.

elyasmehtabuddin

comment created time in 6 months

issue closedtensorflow/tensorflow

TPU has unsupported tensorflow op "LogUniformCandidateSampler" using XLA

Using NCE loss for an NLP model and ran into an issue where XLA doesn't compile using TPU.

No registered 'LogUniformCandidateSampler' OpKernel for XLA_CPU_JIT devices compatible with node node LogUniformCandidateSampler

System information

  • Using Google's cloud TPU with Tensorflow 1.14 tf_env.txt

Describe the current behavior I get an error when running the code on TPU because XLA doesn't support candidate samplers.

image: Screen Shot 2019-09-05 at 8 29 54 AM

Describe the expected behavior I expect the model to be able to run and calculate loss on TPU using XLA

Code to reproduce the issue

import os
import tensorflow as tf
from tensorflow.python.compiler.xla import xla

def my_model(features, labels):
  nce_weights = tf.get_variable(
    name="nce_weights",
    shape=[10, 4],
    initializer=tf.random_uniform_initializer(-1.0, 1.0)
  )
  nce_biases = tf.get_variable(
    name="nce_biases",
    shape=[10],
    initializer=tf.zeros_initializer()
  )

  num_sampled = 1
  num_classes = 10

  sampler_func = tf.random.log_uniform_candidate_sampler
  sampled_values = sampler_func(
    true_classes=tf.cast(labels, tf.dtypes.int64),
    num_sampled=num_sampled,
    range_max=num_classes,
    num_true=1,
    unique=True,
    seed=None,
  )

  loss = tf.reduce_mean(
    tf.nn.nce_loss(
      weights=nce_weights,
      biases=nce_biases,
      labels=labels,
      inputs=features,
      num_sampled=num_sampled,
      num_classes=num_classes,
      sampled_values=sampled_values,
    )
  )

  optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
  return loss, optimizer

sess = tf.Session()
features = tf.constant(1.2, shape=[1, 4], dtype=tf.float32)
labels = tf.constant(1, shape=[1, 1], dtype=tf.float32)

[y] = xla.compile(my_model, inputs=[features, labels])

sess.run(tf.global_variables_initializer())
sess.run(y)

Other info / logs

WARNING: Logging before flag parsing goes to stderr.
W0905 08:26:12.392647 140303472117184 deprecation_wrapper.py:119] From issue.py:45: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-09-05 08:26:12.529939: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-09-05 08:26:12.908472: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-09-05 08:26:12.908636: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557f84258f10 executing computations on platform Host. Devices:
2019-09-05 08:26:12.908650: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
W0905 08:26:13.021475 140303472117184 deprecation_wrapper.py:119] From issue.py:6: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W0905 08:26:13.398072 140303472117184 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_impl.py:180: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0905 08:26:13.421444 140303472117184 deprecation_wrapper.py:119] From issue.py:42: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead.

W0905 08:26:13.709804 140303472117184 deprecation_wrapper.py:119] From issue.py:51: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

2019-09-05 08:26:14.238822: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2019-09-05 08:26:14.645355: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at xla_ops.cc:343 : Invalid argument: Detected unsupported operations when trying to compile graph cluster_18305510630026920806[] on XLA_CPU_JIT: LogUniformCandidateSampler (No registered 'LogUniformCandidateSampler' OpKernel for XLA_CPU_JIT devices compatible with node {{node LogUniformCandidateSampler}}
	.  Registered:  device='CPU'
){{node LogUniformCandidateSampler}}
	This error might be occurring with the use of xla.compile. If it is not necessary that every Op be compiled with XLA, an alternative is to use auto_jit with OptimizerOptions.global_jit_level = ON_2 or the environment variable TF_XLA_FLAGS="tf_xla_auto_jit=2" which will attempt to use xla to compile as much of the graph as the compiler is able to.
Traceback (most recent call last):
  File "issue.py", line 52, in <module>
    sess.run(y)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Detected unsupported operations when trying to compile graph cluster_18305510630026920806[] on XLA_CPU_JIT: LogUniformCandidateSampler (No registered 'LogUniformCandidateSampler' OpKernel for XLA_CPU_JIT devices compatible with node node LogUniformCandidateSampler (defined at issue.py:27) 
	.  Registered:  device='CPU'
)node LogUniformCandidateSampler (defined at issue.py:27) 
	This error might be occurring with the use of xla.compile. If it is not necessary that every Op be compiled with XLA, an alternative is to use auto_jit with OptimizerOptions.global_jit_level = ON_2 or the environment variable TF_XLA_FLAGS="tf_xla_auto_jit=2" which will attempt to use xla to compile as much of the graph as the compiler is able to.
	 [[cluster]]

Errors may have originated from an input operation.
Input Source operations connected to node LogUniformCandidateSampler:
 Cast (defined at issue.py:22)

Input Source operations connected to node LogUniformCandidateSampler:
 Cast (defined at issue.py:22)

closed time in 6 months

elyasmehtabuddin
more