profile
viewpoint
Rohan Jain rohan100jain @google Mountain View

rohan100jain/Beatameister 2

Music 256a beat boxing project

rohan100jain/Movie-Visualization 2

Creating a visualization for movies

rohan100jain/Cloud-Prize 0

Description and terms for the Netflix Cloud Prize, which runs from March-September 2013. Read the rules, fork to your GitHub account to create a Submission, then send us your email address.

rohan100jain/community 0

Stores documents used by the TensorFlow developer community

rohan100jain/Snowcleaning 0

Snow Cleaning Marathon match 79 for top coder

rohan100jain/tensorboard 0

TensorFlow's Visualization Toolkit

rohan100jain/tensorflow 0

An Open Source Machine Learning Framework for Everyone

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.++### New `SnapshotDatasetOp`++To implement the transformation, we are introducing a new `SnapshotDatasetOp`+dataset kernel that will implement all of the functionality in TensorFlow C++.+Python code is mostly glue code to pass relevant parameters into the op kernel.++### Internal Directory / File Structure++Given a user directory path (e.g. `/path/to/snapshot`), the directory will look+like:++*   /path/to/snapshot+    *   `fingerprint`/+        *   snapshot.metadata+        *   `run-id`/+            *   0000000.snapshot+            *   0000001.snapshot++The `fingerprint` is a hash of the input processing graph. The `run-id` is+unique training run ID generated.

@karmel While I think its a worthy objective of unifying the Bundle Reader / Writer file format with what we're proposing, I don't want to tie them both together this early. The reason is that since the workloads are quite different, we are still early days into exactly knowing what sort of options and functionality we'll have to build in to get the maximum throughput (e.g. we've needed compression so that we don't saturate network bandwidth too quickly, different threading implementations etc.). So my proposal would be to keep it separate at the moment, gain some experience tuning this workload and then we can pull in some of the learnings into Bundle Reader / Writer.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_writer_threads=None,+             reader_fn=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++1.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    AUTO.++1.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to AUTO.++1.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++1.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written. Defaults to AUTO. ++1.  `reader_fn`: Optional. A user provided reader function to use when reading+    the snapshot back. This allows the user to specify the concurrency and+    randomization required when reading from the snapshot.++    `reader_fn` should be a function that accepts two arguments: (1) a list of+    snapshot file paths, and (2) a reference to a `SnapshotDataset` class.+    The function should return a `Dataset` class.++    The `SnapshotDataset` class is a `Dataset` (similar to other source datasets

Should this be named SnapshotReaderDataset instead?

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.+

During earlier iterations of developing this, we considered save / load style APIs but to make that usable is quite challenging (i.e. the figuring out whether to save a new snapshot or to load). The current API serves a very concrete use case (for some significant internal users) and we feel it makes sense to expose this API to serve that use case.

Your concerns about shuffle_on_read etc. are valid and we'll address them by allowing users to specify a reader_fn.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.++### New `SnapshotDatasetOp`++To implement the transformation, we are introducing a new `SnapshotDatasetOp`+dataset kernel that will implement all of the functionality in TensorFlow C++.+Python code is mostly glue code to pass relevant parameters into the op kernel.++### Internal Directory / File Structure++Given a user directory path (e.g. `/path/to/snapshot`), the directory will look+like:++*   /path/to/snapshot+    *   `fingerprint`/+        *   snapshot.metadata+        *   `run-id`/+            *   0000000.snapshot+            *   0000001.snapshot++The `fingerprint` is a hash of the input processing graph. The `run-id` is+unique training run ID generated.

Thats a good point Karmel and we considered this while first developing. Here is why we ended up choosing something of our own -

The checkpointing API is for saving the state of the model and here we're trying to store the entire output of an input pipeline and so API wise there isn't much alignment.

On the implementation side, Checkpointing uses BundleReader / BundleWriter which are designed for key / value accesses to tensors whereas here we want to sequentially read off tensors as soon as possible. We tried Bundle Reader / Writer at first but abandoned it for performance reasons.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.++### New `SnapshotDatasetOp`++To implement the transformation, we are introducing a new `SnapshotDatasetOp`+dataset kernel that will implement all of the functionality in TensorFlow C++.+Python code is mostly glue code to pass relevant parameters into the op kernel.++### Internal Directory / File Structure++Given a user directory path (e.g. `/path/to/snapshot`), the directory will look+like:++*   /path/to/snapshot+    *   `fingerprint`/+        *   snapshot.metadata+        *   `run-id`/+            *   0000000.snapshot+            *   0000001.snapshot++The `fingerprint` is a hash of the input processing graph. The `run-id` is+unique training run ID generated.++### Standard Kernel Workflow++_Note: This is an implementation detail, and may change in the future. This+should not be relied upon except as a reference to the current implementation._++By default, the `snapshot` operation will, upon startup, make a determination+using the following algorithm as to whether the operation should be in the+WRITE, PASSTHROUGH, or READ state.++1.  We will compute a graph fingerprint containing all the information from the+    Dataset preprocessing graph before the `snapshot` op. We’ll use the+    `AsGraphDefInternal` method on DatasetBase for this.++1.  We will attempt to enter the corresponding fingerprint directory. For+    instance, if the computed fingerprint is `f-abc123` and the base snapshot+    directory is `/saved/data`, then we will attempt to enter+    `/saved/data/f-abc123`.++1.  If the snapshot directory is non-existent, empty or it doesn’t contain a+    `metadata` file, we will enter the **WRITE** state.++1.  If the snapshot directory contains a `metadata` file, we will read the+    metadata file.++    1.  The metadata file contains the following fields:+        1.  A training run ID+        1.  A boolean indicating if the snapshot is complete+        1.  A training run start-time.++1.  If the training run start-time is more than the (configurable) training run+    timeout (set with the `pending_snapshot_expiry_seconds` parameter), we will+    enter the **WRITE** state.++1.  If the training run start-time is less than the training run timeout, but+    the snapshot is not complete, then we will enter the **PASSTHROUGH** state.++1.  If the snapshot is complete, we will enter the **READ** state.++#### WRITE State++1.  We generate a random training run ID.++1.  We write (possibly overwriting) the `snapshot.metadata` file.++1.  We proceed to create a subdirectory containing the training run ID, and+    start writing data asynchronously in chunks.++1.  At the end of the dataset (when `end_of_sequence == true`), we will check+    the snapshot.metadata file to determine whether it contains the same+    training run ID.++    1.  If it does, we set the complete bit to true to finalize the directory.+    1.  If it does not, it means that someone else is concurrently writing the+        snapshot and we lost the race to them. We delete all data in the+        training run directory.++For the current implementation, we will store the data in chunked TFRecord+files. Eventually we may move to other more higher performance data stores or

I guess Gus is referring to allowing users to customize the writing data format etc. via some kind of sink configuration. I think as of now, we don't plan to allow for that level of customization because right now its good to have our own reader / writer that is optimized for performance. Its definitely possible that in the future we allow users to customize this though but I think that won't happen in the near future. Will keep this in mind though.

frankchn

comment created time in a month

Pull request review commenttensorflow/tensorflow

Avoid doing reset when position is still in buffer in BufferedInputStream.

 Status BufferedInputStream::Seek(int64 position) {                                    position);   } -  // Position of the buffer within file.-  const int64 bufpos = Tell();-  if (position < bufpos) {-    // Reset input stream and skip 'position' bytes.+  // Position of the buffer's lower limit within file.+  const int64 buf_lower_limit  = input_stream_->Tell() - limit_ ;+  const int64 buf_pos  = Tell();++  if (position < buf_lower_limit) {+    // Seek before buffer, reset input stream and skip 'position' bytes.     TF_RETURN_IF_ERROR(Reset());     return SkipNBytes(position);   }--  return SkipNBytes(position - bufpos);+  else if (position < buf_pos) {+    // Seek within buffer before 'pos_'+    pos_ -= buf_pos - position;+    return Status::OK();+  }+  else {

else isn't needed.

burgerkingeater

comment created time in a month

Pull request review commenttensorflow/tensorflow

Avoid doing reset when position is still in buffer in BufferedInputStream.

 Status BufferedInputStream::Seek(int64 position) {                                    position);   } -  // Position of the buffer within file.-  const int64 bufpos = Tell();-  if (position < bufpos) {-    // Reset input stream and skip 'position' bytes.+  // Position of the buffer's lower limit within file.+  const int64 buf_lower_limit  = input_stream_->Tell() - limit_ ;+  const int64 buf_pos  = Tell();++  if (position < buf_lower_limit) {+    // Seek before buffer, reset input stream and skip 'position' bytes.     TF_RETURN_IF_ERROR(Reset());     return SkipNBytes(position);   }--  return SkipNBytes(position - bufpos);+  else if (position < buf_pos) {

since there is a return above you don't need the else here.

burgerkingeater

comment created time in a month

Pull request review commenttensorflow/tensorflow

Avoid doing reset when position is still in buffer in BufferedInputStream.

 Status BufferedInputStream::Seek(int64 position) {                                    position);   } -  // Position of the buffer within file.-  const int64 bufpos = Tell();-  if (position < bufpos) {-    // Reset input stream and skip 'position' bytes.+  // Position of the buffer's lower limit within file.+  const int64 buf_lower_limit  = input_stream_->Tell() - limit_ ;

I don't think Tell() - limit_ is the right way to get the lower bound. I think it should be Tell() - pos_.

burgerkingeater

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):

Would the AUTO as default help here? I think based on our experience, we can come up with reasonable defaults reducing the amount of cognitive overhead the users will have to endure figuring out what to set.

We can also remove the buffer_size option and just set it to 2 * num_threads.

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.++### Performance Implications++*   Do you expect any (speed / memory)? How will you confirm?+*   There should be microbenchmarks. Are there?+*   There should be end-to-end tests and benchmarks. If there are not (since+    this is still a design), how will you track that these will be created?++### Dependencies++No new dependencies will be introduced as part of this project to TensorFlow.+Dependent projects may be able to use this additional op, but there should be no+significant changes otherwise.++### Engineering Impact++Binary sizes increases slightly with the inclusion of this new op, and this code+will be maintained by the `tf.data` team.++### Platforms and Environments++This op will work on all TensorFlow-supported platforms. We do not anticipate+this to work on embedded systems as it is not useful in resource-constrained+environments.++### Best Practices, Tutorials and Examples++A user guide for snapshot will be published to guide new users in using this+feature.++### Compatibility++This introduces a new op, which will impact future backwards compatibility.++### User Impact++A new python function and a new op are the only user-facing changes visible.++## Detailed Design++### Implementation Assumptions++The following implementation is based on the following assumptions that define+the MVP this is designed for:++1.  We assume that at least for one pipeline run, you can go through the entire+    training dataset and be able to store that data on disk. Otherwise, a+    snapshot will never get created.++2.  In case there are multiple workers and the dataset is sharded across+    workers, we assume that the number of workers remains the same from one run+    to another. If the number changes, we’ll trigger another snapshot.++3.  Any `repeat`s in the dataset should be moved to after the `snapshot` op, to+    avoid writing large (or infinite) amounts of data during a snapshot writing+    run.++### New `SnapshotDatasetOp`++To implement the transformation, we are introducing a new `SnapshotDatasetOp`+dataset kernel that will implement all of the functionality in TensorFlow C++.+Python code is mostly glue code to pass relevant parameters into the op kernel.++### Internal Directory / File Structure++Given a user directory path (e.g. `/path/to/snapshot`), the directory will look+like:++*   /path/to/snapshot+    *   `fingerprint`/+        *   snapshot.metadata+        *   `run-id`/+            *   0000000.snapshot+            *   0000001.snapshot++The `fingerprint` is a hash of the input processing graph. The `run-id` is+unique training run ID generated.++### Standard Kernel Workflow++_Note: This is an implementation detail, and may change in the future. This+should not be relied upon except as a reference to the current implementation._++By default, the `snapshot` operation will, upon startup, make a determination+using the following algorithm as to whether the operation should be in the+WRITE, PASSTHROUGH, or READ state.++1.  We will compute a graph fingerprint containing all the information from the+    Dataset preprocessing graph before the `snapshot` op. We’ll use the+    `AsGraphDefInternal` method on DatasetBase for this.

tf.data always builds a dataset graph for execution, whether its eager or not. So it doesn't matter much whether we're in eager mode or not [to be precise: the iterator gets created eagerly which takes in a dataset variant that contains the entire dataset graph to be executed which is what we fingerprint]

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.

Right now the use cases are primarily internal (readahead etc.) and I couldn't find any public filesystems that passed configuration through prefixes. Should we consider an internal version of the op instead then?

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):

Agreed! How about a SnapshotConfig / Options object? Or did you have something else in mind?

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_path_prefix=None,+             writer_path_prefix=None,+             shard_size_bytes=None,+             pending_snapshot_expiry_seconds=None,+             num_reader_threads=None,+             reader_buffer_size=None,+             num_writer_threads=None,+             writer_buffer_size=None,+             shuffle_on_read=None,+             shuffle_seed=None,+             mode=None,+             snapshot_name=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++2.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    None.++3.  `reader_path_prefix`: Optional. A prefix to add to the path when reading+    from snapshots. This is useful for filesystems where configuration is passed+    in through the path. Defaults to None.++4.  `writer_path_prefix`: Optional. A prefix to add to the path when writing to+    snapshots. This is useful for filesystems where configuration is passed in+    through the path. Defaults to None.++5.  `shard_size_bytes`: Optional. The maximum size of each data file to be+    written by the snapshot dataset op. Defaults to 10 GiB.++6.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++7.  `num_reader_threads`: Optional. Number of threads to parallelize reading+    from snapshot. Especially useful if compression is turned on since the+    decompression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the snapshot are different from the order they're written.++8.  `reader_buffer_size`: Optional. Maximum number of elements we can prefetch+    reading from the snapshot. Defaults to 1. Increasing this might improve+    performance but will increase memory consumption.++9.  `num_writer_threads`: Optional. Number of threads to parallelize writing+    from snapshot. We'll open up `num_writer_threads` files and write to them in+    parallel. Especially useful if compression is turned on since the+    compression operation tends to be intensive. Defaults to 1. If > 1, then+    this might introduce non-determinism i.e. the order in which the elements+    are read from the upstream iterator are different from the order they're+    written.++10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill+    up the buffer before writing them out using `num_writer_threads`.++11. `shuffle_on_read`: Optional. If this is True, then the order in which+    examples are produced when reading from a snapshot will be random. Defaults+    to False.++12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number+    generator used for shuffling (when `shuffle_on_read` is turned on) is seeded+    by the given seed. Otherwise, it is seeded by a random seed that differs for+    every run.++13. `mode`: Optional. The mode at which snapshot should operate. Valid options+    are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,+    where the snapshot op will automatically determine what mode to operate in.++    1.  `write` mode forces the snapshot transformation to write a new+        materialization to disk, regardless of whether a complete and valid+        materialization currently exists. In other words, we enter the **WRITE**+        state immediately.++    2.  `read` mode forces the snapshot transformation to read from the latest+        version of the materialization on disk, regardless of whether the data+        stored on disk is complete and valid. In other words, we enter the+        **READ** state immediately.++    3.  `passthrough` mode turns the snapshot transformation into a no-op. In+        other words, we enter the **PASSTHROUGH** state immediately.++    4.  `auto` retains the default behavior of snapshot. See the "Standard+        Kernel Workflow" section for the default behavior.++14. `snapshot_name`: Optional. If set, use the supplied string as a named+    snapshot name instead of introspecting the data pipeline and automatically+    generating a unique identifier for the specific data pipeline.++    1.  Instead of generating a new fingerprint of the input processing graph or+        and `run_id` (see the _Detailed Design_ section for details), we will+        use the `snapshot_name` to uniquely identify the snapshot.++### External API Guarantees++Externally, we guarantee that snapshots written by a particular version of+TensorFlow will be readable by that specific version of TensorFlow. Eventually,+we can also guarantee that snapshots written will be readable by all future+versions of TensorFlow.++We are not currently handling the case where workers do not go through the+entire training set at least once.++### Alternatives Considered++An alternative proposal for an API would be `save()` and `load()`, where the+saving and loading of the input pipeline would be made more explicit, avoiding+some of the logic needed in determining whether to snapshot or read from a+snapshot of a model.++The downside here would be that the user would have to split the preprocessing+and training into potentially different files, and users would be forced to+select whether to train or preprocess on their own, which is not good.+

The issue with that is the logic that determines whether the input pipeline we want to load or not is not as simple as whether a file exists (the control aspect of the op). Following your train of thought, I think the better alternative would be splitting it up as reader, writer and control as separate ops (which as I indicated above should be the implementation for the next iteration of the op).

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data

Right now this proposal is two things in one

  1. Mechanism for serializing / deserializing tf.data datasets to disk [the reading and writing part of the snapshot op]
  2. A control mechanism that decides when to write and when to read.

One of the first open questions that we have is to consider separating these two out as different ops when it comes to implementation. My personal opinion is that the current API is quite useful as it is and in the next iteration of the op we should split it up into different smaller ops which could then allow us to serialize a tf.data pipeline and then deserialize it elsewhere (i.e. implement our own control).

frankchn

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))

I think right now .apply(...) is the only way to support an experimental tf.data transformation. So the plan would be to expose this as apply and then in a few months, we can promote this to a dataset.snapshot API.

frankchn

comment created time in a month

issue closedtensorflow/tensorflow

(more) spurious deprecation warnings

Similar to #27897

System information

  • OS Platform and Distribution Linux-4.14.79+-x86_64-with-Ubuntu-18.04-bionic
  • TensorFlow installed from: pip install tensorflow==2.0.0-beta1
  • TensorFlow version (use command below): v2.0.0-beta0-16-g1d91213fe7 2.0.0-beta1
  • Python version: 3.6.8

(This is happening in colab.sandbox.google.com)

Describe the current behavior

When using new APIs that replaced old APIs, you deprecation warnings as if you were still using the old API.

Describe the expected behavior

If I use the new APIs, I should not get deprecation warnings.

Code to reproduce the issue

from __future__ import absolute_import, division, print_function, unicode_literals
import functools

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds


TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
LABEL_COLUMN = 'survived'
LABELS = [0, 1]

def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)

"""
-------OUTPUT------------
WARNING: Logging before flag parsing goes to stderr.
W0711 17:34:31.453707 140627566475136 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/readers.py:498: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
------END OUTPUT-------
"""

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}

categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

MEANS = {
    'age' : 29.631308,
    'n_siblings_spouses' : 0.545455,
    'parch' : 0.379585,
    'fare' : 34.385399
}

def process_continuous_data(mean, data):
  # Normalize data
  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

numerical_columns = []

for feature in MEANS.keys():
  num_col = tf.feature_column.numeric_column(feature, normalizer_fn=functools.partial(process_continuous_data, MEANS[feature]))
  numerical_columns.append(num_col)

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numerical_columns)


def get_model(hidden_units=[100, 100]):

  model = tf.keras.Sequential([preprocessing_layer])
  for units in hidden_units:
    model.add(tf.keras.layers.Dense(units, activation='relu'))
 
  return model


train_data = raw_train_data.shuffle(500)

model = get_model()
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

model.fit(train_data, epochs=20)

"""
-------OUTPUT------

Epoch 1/20
W0711 17:34:32.313002 140627566475136 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/feature_column/feature_column_v2.py:2655: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0711 17:34:32.347570 140627566475136 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/feature_column/feature_column_v2.py:4215: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0711 17:34:32.350716 140627566475136 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/feature_column/feature_column_v2.py:4270: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
"""

closed time in 2 months

adammichaelwood

issue commenttensorflow/tensorflow

(more) spurious deprecation warnings

https://github.com/tensorflow/tensorflow/commit/f70c46f8cd2a91b390455827cda65b5b5fe92ef0 should fix the spurious warnings. Closing bug now. Please re-open / open a new one if you still see warnings.

adammichaelwood

comment created time in 2 months

issue closedtensorflow/serving

TFServing Compatibility issue with the TF2.0 model using feature columns

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS Mojave 10.14.6
  • TensorFlow Serving installed from (source or binary): Use docker image: docker pull tensorflow/serving
  • TensorFlow Serving version: 1.14.0

Describe the problem

We use the same model definition, run it using tensorflow 1.14.0 and tensorflow 2.0.0 and then get two models. Host these two models with the same tensorflow/serving:latest docker image and send the same request. The expectation is that the response from these two models should be the same. Actually, the model from tf 1.14 is fine but the model from tf 2.0.0 returns error.

Exact Steps to Reproduce

  1. docker pull tensorflow/serving
  2. Download adult.data from https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
  3. Run the census_keras.py as follows using tf 1.14.0 and then get the SavedModel. Run docker run -t --rm -p 8501:8501 -v "$(pwd)/saved_models/census_keras/:/models/census_keras" -e MODEL_NAME=census_keras tensorflow/serving & to host this model using TF-Serving
  4. curl -d '{"instances": [{"age": 16, "education": "Masters"}]}' -X POST http://localhost:8501/v1/models/census_keras:predict .We can get the correct response.
  5. Run docker kill xxx to kill the launched image
  6. Run the census_keras.py using tf 2.0.0 and the same docker command to host the model from tf2.0
  7. Run the same curl command to test the service, and then returns the error message: { "error": "indices = 1 is not in [0, 1)\n\t [[{{node StatefulPartitionedCall/StatefulPartitionedCall/sequential/dense_features/education_embedding/education_embedding_weights/GatherV2}}]]" }

Source code / logs

census_data.py

import tensorflow as tf 
import pandas as pd 
from sklearn.model_selection import train_test_split

__COLUMN_NAMES = ['age', 'workclass','fnlwgt', 'education', 'education_num', 'marital_status',
                'occupation', 'relationship', 'race', 'gender', 'capital_gain',
                'capital_loss', 'hours_per_week', 'native_country', 'income_bracket']

__SELECT_COLUMN_NAMES = ['age', 'education', 'income_bracket']

def get_train_test_pandas_data():
    census = pd.read_csv("./data/adult.data", header=None, skipinitialspace=True)
    census.columns = __COLUMN_NAMES

    census['income_bracket'] = census['income_bracket'].apply(lambda label: 0 if label==' <=50K' else 1)
    census = census[__SELECT_COLUMN_NAMES]

    y_labels = census.pop('income_bracket')
    x_data = census

    x_train, x_test, y_train, y_test = train_test_split(x_data, y_labels, test_size=0.3)

    return x_train, x_test, y_train, y_test

def get_feature_columns():
    age = tf.feature_column.numeric_column("age", dtype=tf.int64)
    education = tf.feature_column.embedding_column(
        tf.feature_column.categorical_column_with_hash_bucket("education", hash_bucket_size=1000),
        dimension=100)

    feat_cols = [age, education]

    return feat_cols

census_keras.py

import tensorflow as tf 
import census_data
import pathlib
import time

if (tf.__version__ < '2.0'):
    tf.enable_eager_execution()

x_train, _, y_train, _ = census_data.get_train_test_pandas_data()

dataset = tf.data.Dataset.from_tensor_slices((dict(x_train), y_train))

dataset = dataset.shuffle(len(x_train)).batch(4)

feat_cols = census_data.get_feature_columns()

model = tf.keras.Sequential([
    tf.keras.layers.DenseFeatures(feature_columns=feat_cols),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
    ])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(dataset, epochs=1)

__SAVED_MODEL_DIR = './saved_models/census_keras/{}'.format(int(time.time()))
pathlib.Path(__SAVED_MODEL_DIR).mkdir(parents=True, exist_ok=True)

tf.saved_model.save(model, export_dir=__SAVED_MODEL_DIR)

closed time in 2 months

brightcoder01

issue commenttensorflow/serving

TFServing Compatibility issue with the TF2.0 model using feature columns

So then i'm closing the issue now. Please reopen or file another issue if more problems appear with TF2.0

brightcoder01

comment created time in 2 months

issue commenttensorflow/serving

TFServing Compatibility issue with the TF2.0 model using feature columns

For the latest reproduction, the one with the table initialization issue - could you specify what version of TF you were using? I ran with the latest version (2.0.0) and seems to run fine.

brightcoder01

comment created time in 3 months

more