profile
viewpoint
Jiri Simsa jsimsa Google, Inc. California, USA

feihugis/tensorflow 0

Computation using data flow graphs for scalable machine learning

jsimsa/alluxio 0

Memory-Centric Virtual Distributed Storage System

jsimsa/community 0

Stores documents used by the TensorFlow developer community

jsimsa/doomtrooper 0

Scraper and data for Czech Doomtrooper

jsimsa/flink 0

Mirror of Apache Flink

jsimsa/incubator-zeppelin 0

Mirror of Apache Zeppelin (Incubating)

jsimsa/mesos 0

Mirror of Apache Mesos

jsimsa/spark 0

Mirror of Apache Spark

jsimsa/tensorflow 0

Computation using data flow graphs for scalable machine learning

jsimsa/thrift 0

Mirror of Apache Thrift

issue commenttensorflow/tensorflow

Hang on out of memory error

@aaudiber could you please take a look?

smatzek

comment created time in 5 hours

pull request commenttensorflow/tensorflow

tensor_bundle: Supply size hint to input buffer

Also, the new API should have unit tests.

byronyi

comment created time in a day

Pull request review commenttensorflow/tensorflow

tensor_bundle: Supply size hint to input buffer

 Status InputBuffer::Seek(int64 position) {   return Status::OK(); } +Status InputBuffer::Hint(int64 bytes_to_read) {+  if (bytes_to_read < 0) {+    return errors::InvalidArgument("Can't read a negative number of bytes: ",+                                   bytes_to_read);+  }++  // There are enough data present in the buffer. Do nothing.+  if (bytes_to_read + pos_ <= limit_) {+    return Status::OK();+  }++  // The input buffer is too small to accommodate the hint size.+  // There is nothing we can do; a relloc may be too heavy.+  if (bytes_to_read > size_) {+    return status::OK();

Status:OK()

byronyi

comment created time in a day

Pull request review commenttensorflow/tensorflow

tensor_bundle: Supply size hint to input buffer

 class InputBuffer {   // read will trigger a File::Read().   Status Seek(int64 position); +  // Hint to the total read size within the file before next Seek.

Replace with "// Provides a hint about future reads, which may improve their performance."

byronyi

comment created time in a day

Pull request review commenttensorflow/tensorflow

tensor_bundle: Supply size hint to input buffer

 class InputBuffer {   // read will trigger a File::Read().   Status Seek(int64 position); +  // Hint to the total read size within the file before next Seek.+  //+  // If there are enough data present in the buffer, do nothing.

Remove the second paragraph. The documentation should not include implementation details.

byronyi

comment created time in a day

pull request commenttensorflow/tensorflow

improve API doc page: tf.data.Dataset

Thanks @yashk2810, if that's the case I agree that this PR makes sense.

dothinking

comment created time in 2 days

pull request commenttensorflow/tensorflow

improve API doc page: tf.data.Dataset

@dothinking thank you for attempting to improve tf.data API documentation. Having said that, I agree with @fuhailin that this change is not necessary.

I quickly searched through all of tensorflow/python/data and found 9 occurrences of "Note" and 16 occurrences of "NOTE". So there does not seem to be a precedent and I am also not aware of official style guide for TensorFlow that would state to use "Note". If we wanted to be consistent, I'd prefer "NOTE".

dothinking

comment created time in 2 days

issue commenttensorflow/tensorflow

tensorflow1.12 hangs at MapAndBatchDatasetOp::Dataset::Iterator::RunnerThread

@yanqingshang I am sorry to hear that this has been an issue for you. Unfortunately, the team does not have resources to investigate issues in dated version of TensorFlow. My suggestion would be to updated to more recent version of TensorFlow. The latest version is TF 2.1 and if you require TF 1 compatibility, it would be TF 1.15.

yanqingshang

comment created time in 5 days

issue commenttensorflow/tensorflow

Dataset padded_batch fails with InvalidArgumentError

@edloper could you please take a look? As far as I can tell, this is related to boxing ragged tensors in a variant so that they can be batched/unbatched. Thanks.

stathva

comment created time in 7 days

issue closedtensorflow/tensorflow

Allow datasets to provide the number of examples they contain

System information

  • TensorFlow version (you are using): 2.1.0
  • Are you willing to contribute it (Yes/No): No

Describe the feature and the current behavior/state.

Currently there is no good way to get to the number of samples or batches contained by a dataset although the information is usually available.

What you can do: sum(1 for _ in dataset) but this might not do what one wants: When the dataset is batched it will return the number of batches including the trailing one. MultiWorkerMirroredStrategy can't handle that.

Usually this information is already available, see e.g. https://github.com/tensorflow/datasets/issues/1403

Will this change the current api? How?

Add a member num_examples and/or an overload for __len__

Who will benefit with this feature?

  • Everyone using MultiWorkerMirroredStrategy
  • Everyone using steps_per_epoch
  • TF itself as the number of samples/batches is known before executing the training loop avoid status reports like 10/Unknown
  • This would help to provide correct behavior in https://github.com/tensorflow/tensorflow/commit/6be131d0860559954c42685a87c63f16cebb2185#diff-f8dd40712ac721c1b363e1a1ec44c1a3R741-R747

Any Other info.

There is an experimental op cardinality which might be very related. However it often (always?) returns "Unknown". Tested with MNIST from TFDS.

closed time in 8 days

Flamefire

issue commenttensorflow/tensorflow

Error occurred when finalizing GeneratorDataset iterator

This warning is spurious and should be removed by https://github.com/tensorflow/tensorflow/commit/b6edd34c5858ab0ab4380da774e7e2fd81a92da0

olk

comment created time in 13 days

issue commenttensorflow/tensorflow

tf.data.Dataset unusable with steps_per_epoch standard training loop

Karmel, could you please triage this to someone on the Keras team. Thank you.

Flamefire

comment created time in 15 days

issue commenttensorflow/tensorflow

tf.data.Dataset unusable with steps_per_epoch standard training loop

This is a Keras documentation issue, not tf.data issue.

Flamefire

comment created time in 15 days

issue commenttensorflow/tensorflow

TFRecord guide doesn't show how to serialize and parse tensors

@MarkDaoust I don't think that tf.data.Snapshot will provide what is @MikeOfZen asking for. IIUC, he would like to have a simple API to "save" and "load" arbitrary structure of tensors.

@MikeOfZen a challenge that Python pickling does not need to solve but TensorFlow ser/de does is how to statically infer shapes and types so that a traced Python function can be executed as a dataflow computation. A simple "load(file)" operation does not provide any information into what the shape and type of the data stored in the file is which is needed to be known statically (i.e. reading from a file is not good enough) at the time a Python function with this operation is traced. So even if we had "save" and "load" as primitives, the "load" primitive would still likely need to specify a structure of tf.TypeSpecs that describe the shape and type of the loaded data for the sake of static shape inference.

The building blocks of "snapshot" could be used for implementing "save" and "load" which is something I can imagine doing as a follow up to "snapshot" and will take it into consideration in tf.data 2020 plans.

bionicles

comment created time in 15 days

issue commenttensorflow/tensorflow

TF 2.1 Keras: Cached datasets use 40% more memory than in TF 2.0

Thank you for the follow up, I am going to ask the Keras team to investigate.

@karmel this looks like Keras regression between TF 2.0 and TF 2.1.

lgeiger

comment created time in 21 days

issue commenttensorflow/tensorflow

TF 2.1 Keras: Cached datasets use 40% more memory than in TF 2.0

Thank you @lgeiger. What I was suggesting is that you copy the data to your local file system and then run your program caching data in memory. I am trying to rule out that this is not due to reading from GCS (and it will also be easier to reproduce if it can be run locally).

lgeiger

comment created time in 21 days

Pull request review commenttensorflow/tensorflow

Add block cache for low level table library

+// Copyright (c) 2011 The LevelDB Authors. All rights reserved.

this should be using the TensorFlow copyright

byronyi

comment created time in 22 days

Pull request review commenttensorflow/tensorflow

Add block cache for low level table library

+// Copyright (c) 2011 The LevelDB Authors. All rights reserved.

this should be using the TensorFlow copyright

byronyi

comment created time in 22 days

Pull request review commenttensorflow/tensorflow

Add block cache for low level table library

+// Copyright (c) 2011 The LevelDB Authors. All rights reserved.

this should be using the TensorFlow copyright

byronyi

comment created time in 22 days

issue commenttensorflow/tensorflow

tf.data.Dataset.from_generator converts input argument types implicitly instead of just forwarding

@mxmlnkn would you be able to check whether the same issue can be reproduced using py_function without tf.data. tf.data.Dataset.from_generator relies on py_function internally and I believe that this issue is inherent to `py_function.

mxmlnkn

comment created time in a month

issue commenttensorflow/tensorflow

TPUStrategy.make_dataset_iterator does not work for a Dataset created by tf.data.Dataset.from_generator

@tsc2017 would you be able to try if your program work with a nightly release of TF which should include the above fix?

tsc2017

comment created time in a month

issue commenttensorflow/tensorflow

TF 2.1 Keras: Cached datasets use 40% more memory than in TF 2.0

@lgeiger can you reproduce the issue when you store a copy of ImageNet locally?

lgeiger

comment created time in a month

issue commenttensorflow/tensorflow

TPUStrategy.make_dataset_iterator does not work for a Dataset created by tf.data.Dataset.from_generator

@lindong28 was your CL to fix the issue merged?

tsc2017

comment created time in a month

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))+dataset = dataset.repeat()++model = ...+model.fit(dataset)+```++As we can see, the end user simply has to add this transformation in order to+use this functionality. In essence, the transformation is similar to the+existing `tf.data.Dataset.cache`, with the key difference is being that, unlike+`cache`, `snapshot` is intended to re-used across different executions of the+same input pipelines.++### Proposed API++We are proposing the following API for the snapshot transformation.++```python+def snapshot(path,+             compression=None,+             reader_fn=None,+             writer_fn=None,+             pending_snapshot_expiry_seconds=None):+  pass  # Implementation goes here.+```++1.  `path`: Required. A directory where we want to save our snapshots and/or+    read from a previously saved snapshot.++1.  `compression`: Optional. The type of compression to apply to the snapshot+    written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to+    AUTO.++1.  `reader_fn`: Optional. The input pipeline transformation specified by +    `reader_fn` is executed when the snapshot detects that there is an existing, +    valid snapshot available.++    `reader_fn` is a user specified function that accepts a single argument: +    (1) a Dataset of Datasets, each representing a "splits" of elements of the +    original dataset. The cardinality of the input dataset matches the +    cardinality of the output of `writer_fn` (see below). The function should +    return a Dataset of elements of the original dataset.++    A default `reader_fn` will look like the following:++    ```python+    def default_reader_fn(datasets):+      # shuffle the datasets splits+      datasets = datasets.shuffle(NUM_DATASETS)+      # read datasets in parallel and interleave their elements+      return dataset.interleave(lambda x: x, num_parallel_calls=AUTOTUNE)+    ```++1.  `writer_fn`: Optional. The input pipeline specified by `writer_fn` is +    executed when the snapshot op detects that there are no valid snapshots+    and no other threads are currently attempting to write a snapshot.++    `writer_fn` is a user specified function that accepts a single argument: +    (1) a Dataset of elements to be written out. The function should return+    a Dataset of Datasets, each representing "splits" of elements of the+    original dataset. The tf.data snapshot implementation will then persist+    splits in parallel.++    A default writer_fn will look like the following:++    ```python+    def default_writer_fn(dataset):+      # add a component with element index+      dataset = dataset.enumerate()+      # split input dataset in a round-robin fashion+      return dataset.split(num_splits=NUM_CORES, key_fn=lambda i, _: i % NUM_CORE+    ```++1.  `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)+    before the snapshot op considers a previously unfinished snapshot to be+    stale and starts writing a snapshot from scratch again. Defaults to 86400+    seconds (1 day).++#### Achieving Parallelism++`reader_fn` and `writer_fn` will default to passing the dataset through unchanged+by default. In other words, the default implementation will result in +single-threaded reads and writes on snapshots. Parallelism can be achieved in+`writer_fn` by splitting up the dataset into multiple datasets, and using+`num_parallel_calls` in the `interleave` function of the `reader_fn`.++#### Computing Graph Fingerprints++Snapshot attempts to determine whether a run of an input pipeline is the same+as a previous run by computing the fingerprint of the nodes within the pipeline.++However, some input pipelines might vary in insignificant ways from run to run+that causes the fingerprinting of them to differ. For instance, consider the+following preprocessing function:++```python+features_to_multiply = {"feature1", "feature2", "feature3", "feature4"}++def preprocessing_fn(value):+  keys_to_features = {+    "feature1": tf.FixedLenFeature([], tf.float32, 0.0),+    "feature2": tf.FixedLenFeature([], tf.float32, 0.0),+    "feature3": tf.FixedLenFeature([], tf.float32, 0.0),+    "feature4": tf.FixedLenFeature([], tf.float32, 0.0)+  }++  parsed = tf.parse_single_example(value, keys_to_features)+  combined_feature = 1.0+  for item in features_to_multiply:+    combined_feature *= parsed[item]++  return combined_feature++dataset = ...+dataset = dataset.map(preprocessing_fn)+```++In the above example, our `features_to_multiply` variable uses a `set`, which is +not guaranteed to be ordered in Python 2. When we iterate over the set in the +for loop within `preprocessing_fn`, we may get a different graph on each +run (i.e. one run could have us multiplying `feature2` first, then `feature4`, +etc..., while another run may have us multiplying `feature1`, then `feature3`, +and so on).++In cases like these, we can ask fingerprinting to use a fixed value for the+fingerprint of the map function with a new `set_fingerprint`+transformation, which asks the fingerprinting function to not compute the +fingerprint of the previous node but to use a user-specified value instead:++```python+dataset = ...+dataset = dataset.map(preprocessing_fn) +dataset = tf.data.set_fingerprint(dataset, fingerprint="my_fixed_fp")

I would rename the transformation to set_snapshot_fingerprint and make it experimental.

frankchn

comment created time in a month

issue commenttensorflow/tensorflow

tf.data.Dataset.map repeats random numbers in each epoch of a Keras training loop when a graph-level seed is set

tf.data does not maintain any state across iterations (with the exception of shuffle and cache). In other words, if your input pipeline is deterministic, does not contain shuffle and unseeded random operations, and does not depend on any external state, it will always produce the same sequence of numbers.

If you would like your input pipeline to produce deterministic behavior, that at the same time is different between different epochs, then you either need to introduce external state (e.g. an epoch counter that is used to seed the randomness) or piggyback on shuffle.

Here is an example of how you can create an input pipeline for which different iterations produce different but deterministic sequence of numbers:

seed = tf.Variable(0, dtype=tf.int64)

def get_seed(_):
  seed.assign_add(1)
  return seed

seeds = tf.data.Dataset.range(1).map(get_seed)
seeds = seeds.flat_map(lambda seed: tf.data.experimental.RandomDataset(seed=seed))

ds = tf.data.Dataset.zip((seeds.batch(2), tf.data.Dataset.range(5)))
ds = ds.map(lambda seed, _: tf.random.stateless_uniform([], seed=seed)).batch(5)

print("epoch 1:")
for elem in ds:
  print(elem.numpy())
print("epoch 2:")
for elem in ds:
  print(elem.numpy())  

In my colab, the above determinstically produces:

epoch 1:
[0.17277443 0.01138496 0.5387242  0.14688337 0.98976684]
epoch 2:
[0.12150574 0.7640343  0.28948808 0.09558952 0.08135116]

Last but not least, to answer your question about (?) both is intended behavior. For the case when the graph-level seed is not specified, the random ops in the input pipeline graph will be unseeded, which means non-deterministic behavior. For the case when the graph-level seed is specified, then the input pipeline will be deterministic and since it does not depend on any state outside of the input pipeline graph, each execution of the input pipeline will be identical.

bersbersbers

comment created time in a month

issue closedtensorflow/tensorflow

_ = dataset.cache() accelerates data pipeline

System information

  • Have I written custom code
  • Linux Ubuntu 18.04
  • TensorFlow ROCm installed from PyPI
  • TensorFlow version: v2.0.0-rocm-3-g0826c3a 2.0.2
  • Python version: Python 3.7.5
  • CUDA/cuDNN version: None
  • GPU model and memory: 2 x Radeon Vega 64

Describe the current behavior

TF.Data is quicker when one caches the whole dataset to an unused name.

  • With dataset_*.cache(...):
  • With _= dataset.cache(...): 10870 samples/s stdev 27 samples/s
  • Without _= dataset.cache(...): 10563 samples/s, stdev 50 samples/s
  • (10870 - 27) / (10563 + 50) = 1.02167153491
  • Without dataset_*.cache(...):
  • With _= dataset.cache(...): 2902 samples/s stdev 14 samples/s
  • Without _= dataset.cache(...): 2732 samples/s, stdev 9 samples/s
  • (10870 - 27) / (10563 + 50) = 1.05363006202

Describe the expected behavior

TF.Data is as quick as possible by default.

Code to reproduce the issue

My input data pipeline looks something like this:

data_frame_valid = pd.read_csv(...)
data_frame_invalid = pd.read_csv(...)
dataset_valid: tf.data.Dataset = tf.data.Dataset.from_tensor_slices(...)
dataset_invalid: tf.data.Dataset = tf.data.Dataset.from_tensor_slices(...)
dataset: tf.data.Dataset = dataset_valid.concatenate(dataset_invalid)
dataset = dataset.shuffle(...)
_ = dataset.cache()  # This speeds up iterating the data by 2 - 5 %
dataset_training = dataset.take(data_set_size_training)
dataset_testing = dataset.skip(data_set_size_training)
dataset_validation = dataset_testing.skip(data_set_size_validation)
dataset_testing = dataset_testing.take(data_set_size_testing)
# dataset_training = dataset_training.cache()  # "_ speed up" can be observed with and without
# dataset_validation = dataset_validation.cache()  # "_ speed up" can be observed with and without
# dataset_testing = dataset_testing.cache()  # "_ speed up" can be observed with and without

My benchmark:

import statistics

import time
import tensorflow as tf

from pfasdr.neural.ze_discriminate_pd_np.get_data_sets_module import \
    get_data_sets
from pfasdr.neural.ze_discriminate_pd_np.path_templates_module import \
    valid_file_path_template, invalid_file_path_template


def benchmark_dataset(dataset, num_epochs=2):
    tf.print('Iterating data set ...')

    throughput_history = []
    for index_epoch in tf.data.Dataset.range(num_epochs):
        tf.print(f'Iterating for epoch {index_epoch}')
        index = 0

        # The actual benchmark
        tine_start = time.perf_counter()
        for index, _ in enumerate(dataset):
            pass
            # Uncomment for progress reporting
            # if not index % 100:
            #     print('\r' + str(index), end='')
        time_end = time.perf_counter()

        print('\r' + str(index))
        duration = time_end - tine_start
        throughput = index / duration

        tf.print(
            f'Iterating data set took: '
            f'{round(duration, 2)} s in epoch number {index_epoch}. '
            f'This makes for a throughput of {round(throughput)} 1/s'
        )

        if not index_epoch:
            # First round uses cold caches.
            # So do not record it.
            continue
        throughput_history.append(throughput)

    throughput_average = statistics.mean(throughput_history)
    throughput_deviation = statistics.stdev(throughput_history)
    throughput_upper = throughput_average + throughput_deviation
    throughput_lower = throughput_average - throughput_deviation
    tf.print(f'Average dataset entry throughput {round(throughput_average)} '
             f'with a variation of +/- {round(throughput_deviation)}, '
             f'which means {round(throughput_upper)} (+), '
             f'or {round(throughput_lower)} (-).')


def main():
    batch_size_training = 32
    node_count = 15
    valid_file_path = valid_file_path_template.substitute(
        length=node_count,
    )
    invalid_file_path = invalid_file_path_template.substitute(
        length=node_count,
    )

    dataset_training, dataset_validation, dataset_testing, \
        batch_size_training, dataset_size_validation, dataset_size_evaluate,\
        dataset_size_training \
        = get_data_sets(
            batch_size=batch_size_training,
            names=list(range(node_count)),
            invalid_file_path=invalid_file_path,
            valid_file_path=valid_file_path,
        )

    benchmark_dataset(dataset=dataset_training, num_epochs=5)


if __name__ == '__main__':
    main()

closed time in a month

Bengt

issue commenttensorflow/tensorflow

_ = dataset.cache() accelerates data pipeline

Putting _ = dataset.cache() will have no effect on the input pipeline graph that ends up being created and executed, so your performance difference must be due to noise.

Bengt

comment created time in a month

issue commenttensorflow/tensorflow

Dataset padded_batch does not work as documented

The feature you are requesting seems reasonable and would make a good candidate for an external contribution. If you create a feature request, please cc me on it so that I can tag it as contributions welcome.

dhpollack

comment created time in a month

issue commenttensorflow/tensorflow

Dataset padded_batch does not work as documented

You can do the following:

ds = tf.data.Dataset.from_tensors((0.0, 1))
padding_values = tf.nest.map_structure(lambda _: -1, ds.element_spec)
print(padding_values) # prints (-1, -1)
dhpollack

comment created time in a month

issue commenttensorflow/tensorflow

Dataset padded_batch does not work as documented

The structure of padding_values needs to match the structure of your dataset elements (so that you can specify different padding values for different structure components).

The documentation calls this out, including an example of a dataset whose elements are tuples.

dhpollack

comment created time in a month

issue closedtensorflow/tensorflow

Dataset padded_batch does not work as documented

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): A little
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab (ubuntu-based)
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.1.0-rc1
  • Python version: 3.7

Describe the current behavior

calling Dataset.padded_batch([batch_size], [output_shape], padding_values=1) fails with the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-6fccac1ccecf> in <module>()
     20 ds_train = ds_train.padded_batch(BATCH_SIZE, padded_shapes)
     21 
---> 22 ds_test = ds_test.padded_batch(BATCH_SIZE, padded_shapes, padding_values=padded_values)

3 frames
/tensorflow-2.1.0/python3.6/tensorflow_core/python/data/util/nest.py in assert_shallow_structure(shallow_tree, input_tree, check_types)
    297       raise TypeError(
    298           "If shallow structure is a sequence, input must also be a sequence. "
--> 299           "Input has type: %s." % type(input_tree))
    300 
    301     if check_types and not isinstance(input_tree, type(shallow_tree)):

TypeError: If shallow structure is a sequence, input must also be a sequence. Input has type: <class 'int'>.

Note that this does not fail if one uses the default value of None

Describe the expected behavior

Should pad the data with the value in padding_values.

Also, the error message could be friendly by telling me what type it expects.

Code to reproduce the issue

import tensorflow as tf
import tensorflow_datasets as tfds

BATCH_SIZE = 64

ds, ds_info = tfds.load("imdb_reviews/subwords8k", with_info=True, as_supervised=True)
ds_train, ds_test = ds["train"], ds["test"]

output_shapes_train = tf.compat.v1.data.get_output_shapes(ds_train)
padded_shapes = output_shapes_train  # (TensorShape([None]), TensorShape([]))
padded_values = -1

ds_train = ds_train.padded_batch(BATCH_SIZE, padded_shapes)  # does not fail here
ds_test = ds_test.padded_batch(BATCH_SIZE, padded_shapes, padding_values=padded_values)  # but does fail here

Other info / logs https://www.tensorflow.org/api_docs/python/tf/data/Dataset?version=stable#padded_batch

Documentation seems pretty clear that the second part should work.

closed time in a month

dhpollack

issue commenttensorflow/tensorflow

Multiple output files for tf.data.experimental.TFRecordWriter

Yes, the group_by_window with "infinite" group size will have that effect.

The challenge with the existing API is that tf.data.experimental.TFRecordWriter.write() takes a dataset and in order to create a dataset for a shard, you need to iterate through the entire original dataset. An alternative to materializing all shards in memory would be to iterate through the original dataset multiple time.

I can imagine extending the API of TFRecordWriter to support num_shards argument which would avoid the above problems if that's something that you are interested in, please create a feature request for it and I will mark it as contributions welcome to see if anyone in the external community has interest in implementing it. Thanks.

harahu

comment created time in a month

issue commenttensorflow/autograph

WARNING:tensorflow:Entity <function _CopyToDeviceDataset.__init__.<locals>._init_func at 0x7fadaffa7d90> could not be transformed and will be executed as-is.

The GeneratorDataset warning can be safely ignored. Unfortunately, it looks like my fix (https://github.com/tensorflow/tensorflow/commit/b6edd34c5858ab0ab4380da774e7e2fd81a92da0) that silences the warning didn't make it into TF 2.1.

Bengt

comment created time in a month

issue commenttensorflow/autograph

WARNING:tensorflow:Entity <function _CopyToDeviceDataset.__init__.<locals>._init_func at 0x7fadaffa7d90> could not be transformed and will be executed as-is.

That bug should be fixed in TF 2.1 which was released on January 8th. Could you try with the latest version?

Bengt

comment created time in a month

issue commenttensorflow/tensorflow

tf.data.Dataset.map repeats random numbers in each epoch of a Keras training loop when a graph-level seed is set

My recommendation would be to avoid reliance on graph-level / default seeds and switch to using tf.random.stateless_uniform which takes the seed argument so that you can control the sequence of seeds.

Here is a fully reproducible example that deterministically reshuffles between epochs:

from __future__ import print_function

import tensorflow.compat.v2 as tf

tf.enable_v2_behavior()

def map_fn(seed):
  return tf.random.stateless_uniform([], [seed, seed])

ds = tf.data.Dataset.range(5)
ds = ds.map(map_fn).shuffle(5, seed=42, reshuffle_each_iteration=True).batch(5)

for elem in ds:
  print(elem.numpy())
for elem in ds:
  print(elem.numpy())

The above (consistently) produces the following output:

[0.21101546 0.61040807 0.09827709 0.7736759  0.9589814 ]
[0.7736759  0.61040807 0.09827709 0.21101546 0.9589814 ]
bersbersbers

comment created time in a month

issue commenttensorflow/tensorflow

[TF2.0] Can't use tf.data.Dataset::cache with distribution strategy

Can you try with TF 2.1? I believe this issue should have been fixed by https://github.com/tensorflow/tensorflow/commit/08f41c6216d177933ba8eb48cd171a1e004e6ca2

lgeiger

comment created time in a month

issue commenttensorflow/tensorflow

tf.data.Dataset.map repeats random numbers in each epoch of a Keras training loop when a graph-level seed is set

I am not sure I understand your question regarding map because map does not reorder elements. It simply applies the user-defined transformation to each element. Are you talking about the situation when there is a randomized op inside of the user-defined function? The answer to how randomness behaves in that case depends on the implementation of the op and is orthogonal to map.

In other words, if your input pipeline does not contain shuffle and your map's user-defined function does not contain random ops, then your input pipeline will produce the same sequence every time it is executed.

bersbersbers

comment created time in a month

issue commenttensorflow/tensorflow

tf.data.Dataset.map repeats random numbers in each epoch of a Keras training loop when a graph-level seed is set

@bersbersbers the TLDR is that legacy graph-mode (i.e. tf.compat.v1.disable_eager_execution()) does not support controlling shuffling behavior across different iterations of the same dataset.

If you would like to control the shuffle order and run in graph-mode with TF2, you should rely on tf.function (as oppposed to the legacy graph-mode). The following snippet illustrates idiomatic mechanism for controlling shuffling behavior in TF 2:

from __future__ import print_function

import tensorflow.compat.v2 as tf

tf.enable_v2_behavior()

def eager_execution(reshuffle_each_iteration):
  ds = tf.data.Dataset.range(5).shuffle(5, reshuffle_each_iteration=reshuffle_each_iteration).batch(5)

  def print_dataset(dataset):
    for elem in dataset:
      tf.print(elem)

  print_dataset(ds)
  print_dataset(ds)

print("Eager mode with reshuffling:")
eager_execution(True)
print("Eager mode without reshuffling:")
eager_execution(False)

@tf.function
def graph_execution(reshuffle_each_iteration):
  return eager_execution(reshuffle_each_iteration)

print("Graph mode with reshuffling:")
graph_execution(True)

print("Graph mode without reshuffling:")
graph_execution(False)

produces:

Eager mode with reshuffling:
[4 1 0 2 3]
[2 1 3 4 0]
Eager mode without reshuffling:
[2 0 3 1 4]
[2 0 3 1 4]
Graph mode with reshuffling:
[2 3 4 0 1]
[2 4 1 3 0]
Graph mode without reshuffling:
[4 2 3 0 1]
[4 2 3 0 1]
bersbersbers

comment created time in 2 months

pull request commenttensorflow/tensorflow

Add tests of LMDBDataset

The internal tests are indicating the following memory leak:

E0103 09:55:04.665212    7665 heap-profile-table.cc:445] RAW: Leak check _main_ detected leaks of 464 bytes in 2 objects
E0103 09:55:04.665305    7665 heap-profile-table.cc:465] RAW: The 2 largest leaks:
E0103 09:55:04.678486    7665 heap-profile-table.cc:486] RAW: Leak of 232 bytes in 1 objects allocated from:
	@ 0x7f5fd04e76d9 mdb_env_create
	@ 0x5639e5c1ef10 tensorflow::data::experimental::LMDBDatasetOp::Dataset::Iterator::SetupStreamsLocked()
	@ 0x5639e5c1e470 tensorflow::data::experimental::LMDBDatasetOp::Dataset::Iterator::GetNextInternal()
	@ 0x7f5fceb9d515 tensorflow::data::DatasetBaseIterator::GetNext()
	@ 0x7f5fd051053e tensorflow::data::experimental::(anonymous namespace)::LMDBDatasetOpTest_InvalidPathAtStart_Test::TestBody()
	@ 0x7f5fcca0c1ca testing::Test::Run()
	@ 0x7f5fcca0d330 testing::TestInfo::Run()
	@ 0x7f5fcca0dd27 testing::TestSuite::Run()
	@ 0x7f5fcca1d187 testing::internal::UnitTestImpl::RunAllTests()
	@ 0x7f5fcca1c795 testing::UnitTest::Run()
	@ 0x7f5fd04cdd2e main
	@ 0x7f5fc6d18bbd __libc_start_main
	@ 0x5639e5c1c769 ../sysdeps/x86_64/start.S:121 _start

E0103 09:55:04.678629    7665 heap-profile-table.cc:486] RAW: Leak of 232 bytes in 1 objects allocated from:
	@ 0x7f5fd04e76d9 mdb_env_create
	@ 0x5639e5c1ef10 tensorflow::data::experimental::LMDBDatasetOp::Dataset::Iterator::SetupStreamsLocked()
	@ 0x5639e5c1e470 tensorflow::data::experimental::LMDBDatasetOp::Dataset::Iterator::GetNextInternal()
	@ 0x7f5fceb9d515 tensorflow::data::DatasetBaseIterator::GetNext()
	@ 0x7f5fd051125b tensorflow::data::experimental::(anonymous namespace)::LMDBDatasetOpTest_InvalidPathInMiddle_Test::TestBody()
	@ 0x7f5fcca0c1ca testing::Test::Run()
	@ 0x7f5fcca0d330 testing::TestInfo::Run()
	@ 0x7f5fcca0dd27 testing::TestSuite::Run()
	@ 0x7f5fcca1d187 testing::internal::UnitTestImpl::RunAllTests()
	@ 0x7f5fcca1c795 testing::UnitTest::Run()
	@ 0x7f5fd04cdd2e main
	@ 0x7f5fc6d18bbd __libc_start_main
	@ 0x5639e5c1c769 ../sysdeps/x86_64/start.S:121 _start
frreiss

comment created time in 2 months

issue commenttensorflow/tensorflow

Memory leaks when using tf.strings.split in map_func for tf.data.Dataset.map with eager execution.

@kkimdev could you try running this through your tooling for checking leaks of Python objects? Thank you.

workingloong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: tf.data Snapshot

+# tf.data Snapshot++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [193](https://github.com/tensorflow/community/pull/193) |+| **Author(s)** | Frank Chen (frankchn@google.com), Rohan Jain            |+|               | (rohanj@google.com)                                     |+| **Sponsor**   | Jiri Simsa (jsimsa@google.com)                          |+| **Updated**   | 2020-01-07                                              |++## Objective++With ever faster accelerators available in Cloud and hyperparameter tuning+consuming larger chunks of accelerator time, TensorFlow users are increasingly+finding that they don’t have enough CPU resources to keep up with these+accelerators, leaving valuable accelerator resources idle.++To alleviate this problem, we are proposing a `snapshot` API within `tf.data`,+to allow users to transparently persist the output of their preprocessing+pipeline to disk, and materialize the pre-processed data on a different training+run.++This API enables repeated preprocessing steps to be consolidated, and allowing+re-use of already processed data, trading off disk storage and network bandwidth+for freeing up more valuable CPU resources and accelerator compute time.++## Motivation++Large TensorFlow users have indicated that they have complicated input+processing pipelines which saturate their CPUs before saturating their+accelerators (TPUs in particular). Since they often experiment with+hyperparameter tuning or tweaks to existing model without affecting their input+pipeline, they are asking for ways to avoid similar repeated preprocessing of+data by either saving a dataset or caching it to disk.++## User Benefit++Users will be able to transparently persist partially or fully processed data+from `tf.data` input pipelines to disk or Cloud storage systems, and materialize+the pre-processed data during subsequent runs from the same pipeline. This will+cut down on the input pipeline processing overheads during second and subsequent+runs.++## Design Proposal++We propose that we add a new `snapshot` transformation to tf.data. To illustrate+the usage of the transformation, we can start with some sample code:++```python+dataset = Dataset.list_files("/raw/data/*").shard(num_workers, i)+dataset = dataset.parallel_interleave(TFRecordDataset)+dataset = dataset.map(my_preprocessing_fn)+dataset = dataset.apply(tf.data.snapshot("/saved/data", options...))

either dataset = dataset.apply(tf.data.experimental.snapshot(...)) or dataset = dataset.snapshot(...) (I would prefer the former before providing backwards compatibility guarantee for the API).

frankchn

comment created time in 2 months

issue commenttensorflow/autograph

WARNING:tensorflow:Entity <function _CopyToDeviceDataset.__init__.<locals>._init_func at 0x7fadaffa7d90> could not be transformed and will be executed as-is.

Regarding your tf.data usage, you program will create separate instances of the dataset to feed each GPU. Is that your intention?

If not, I recommend that you take a look at tf.distribute.MirroredStrategy, which can be used to create a single instance of the dataset and prefetch its data to the GPU.

Bengt

comment created time in 2 months

issue closedtensorflow/tensorflow

tf.dataset may out of memory

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 in Docker
  • TensorFlow installed from (source or binary): pip install
  • TensorFlow version (use command below): v2.0.0-rc2-26-g64c3d38
  • Python version: 3.5
  • CUDA/cuDNN version: 10.0 / 7
  • GPU model and memory: GTX 1080Ti / 11175MiB

Describe the current behavior

Hi authors and developers,

I am developing our project in tf=2.0.0 and eager_mode is disable.

The main reason is tf=1.x will not be maintained but third party libraries have not been ready for tf=2.0 yet.

For some resaons, we have to re-generate trainX at the end of each epoch in our custom model.

In tf=1.x version, tensorflow provides placeholder API so we can feed new trainX to tf.data and it works very well.

However, placeholder API is deprecated in tf=2.0 or above.

I have to re-generate tf.data again and again at the end of each epoch.

Finally, our program will be killed eventually because it is out of memory.

Describe the expected behavior

It should work properly.

Code to reproduce the issue

#%%
import tensorflow as tf
tf.compat.v1.disable_eager_execution()
#tf.compat.v1.disable_v2_behavior()

import numpy as np

batch_size = 100

def download_data():

    # get raw data
    (trainX, trainY), (testX, testY) = tf.keras.datasets.cifar10.load_data()
    trainX = trainX.astype(np.float32)
    testX  = testX.astype(np.float32)

    # ont-hot
    trainY = tf.keras.utils.to_categorical(trainY, 10)
    testY  = tf.keras.utils.to_categorical(testY , 10)

    # get validation sets
    training_size = 45000
    validX = trainX[training_size:,:]
    validY = trainY[training_size:,:]

    trainX = trainX[:training_size,:]
    trainY = trainY[:training_size,:]

    return trainX, trainY, validX, validY, testX, testY

def data_pipeline(dataX, dataY):

    # create dataset API
    def preprocess_fn(dataX, dataY):
        
        dataX = tf.image.random_flip_left_right(dataX)
        return dataX, dataY

    dataset = tf.data.Dataset.from_tensor_slices( (dataX, dataY) )
    dataset = dataset.shuffle(batch_size * 8)
    dataset = dataset.repeat()
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(preprocess_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
    return dataset

if __name__ == '__main__':

    # set GPU
    import os
    if os.environ.get("CUDA_VISIBLE_DEVICES") is None:
        os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    # reset tf session
    tf.compat.v1.keras.backend.clear_session()
    gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)
    sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))
    tf.compat.v1.keras.backend.set_session(sess) 

    # prepare data
    trainX, trainY, validX, validY, testX, testY = download_data()
    train_gen = data_pipeline(trainX, trainY)
    valid_gen = data_pipeline(validX, validY)
    test_gen = data_pipeline(testX, testY)

    # build targeted model
    model = tf.keras.applications.resnet_v2.ResNet50V2(include_top=True, weights=None, input_shape=(32,32,3), pooling='max', classes=10)
    model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])

    # fit and evalutate
    num_epoch = 20
    for ii in range(num_epoch):
        model.fit(train_gen,
                steps_per_epoch = trainY.shape[0] // batch_size,
                validation_data = valid_gen,
                validation_steps= validY.shape[0] // batch_size,
                epochs=1,
                verbose=2)
        model.evaluate(testX, testY, verbose=2, batch_size=batch_size)

        # update trainX and re-generate train_gen
        trainX = trainX + 0
        train_gen = data_pipeline(trainX, trainY)

The following is the output:

450/450 - 37s - loss: 1.9472 - accuracy: 0.3077 - val_loss: 1.7661 - val_accuracy: 0.3764
10000/10000 - 3s - loss: 1.7696 - accuracy: 0.3729
Train on 450 steps, validate on 50 steps
450/450 - 37s - loss: 1.5704 - accuracy: 0.4347 - val_loss: 1.6101 - val_accuracy: 0.4224
10000/10000 - 3s - loss: 1.6036 - accuracy: 0.4274
Train on 450 steps, validate on 50 steps
450/450 - 37s - loss: 1.4119 - accuracy: 0.4903 - val_loss: 1.4621 - val_accuracy: 0.4728
10000/10000 - 3s - loss: 1.4667 - accuracy: 0.4759
Train on 450 steps, validate on 50 steps
450/450 - 38s - loss: 1.3042 - accuracy: 0.5313 - val_loss: 1.3688 - val_accuracy: 0.5060
10000/10000 - 3s - loss: 1.3773 - accuracy: 0.5024
Train on 450 steps, validate on 50 steps
450/450 - 36s - loss: 1.2168 - accuracy: 0.5671 - val_loss: 1.3069 - val_accuracy: 0.5330
10000/10000 - 3s - loss: 1.3197 - accuracy: 0.5284
Train on 450 steps, validate on 50 steps
450/450 - 36s - loss: 1.1384 - accuracy: 0.5935 - val_loss: 1.2692 - val_accuracy: 0.5462
10000/10000 - 3s - loss: 1.2831 - accuracy: 0.5437
Train on 450 steps, validate on 50 steps
450/450 - 36s - loss: 1.0762 - accuracy: 0.6156 - val_loss: 1.3297 - val_accuracy: 0.5320
10000/10000 - 3s - loss: 1.3435 - accuracy: 0.5324
Train on 450 steps, validate on 50 steps
450/450 - 38s - loss: 1.0080 - accuracy: 0.6396 - val_loss: 1.3039 - val_accuracy: 0.5404
10000/10000 - 3s - loss: 1.3260 - accuracy: 0.5351
Train on 450 steps, validate on 50 steps
450/450 - 37s - loss: 0.9562 - accuracy: 0.6609 - val_loss: 1.1603 - val_accuracy: 0.5926
10000/10000 - 3s - loss: 1.1833 - accuracy: 0.5848
Train on 450 steps, validate on 50 steps
450/450 - 38s - loss: 0.8957 - accuracy: 0.6823 - val_loss: 1.2314 - val_accuracy: 0.5728
10000/10000 - 3s - loss: 1.2559 - accuracy: 0.5720
Killed

Other info / logs

The following message is the result generated by tf_env_collect.sh

== check python ===================================================
python version: 3.5.2
python branch:
python build version: ('default', 'Oct  8 2019 13:06:37')
python compiler version: GCC 5.4.0 20160609
python implementation: CPython


== check os platform ===============================================
os: Linux
os kernel version: #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019
os release version: 5.0.0-37-generic
os platform: Linux-5.0.0-37-generic-x86_64-with-Ubuntu-16.04-xenial
linux distribution: ('Ubuntu', '16.04', 'xenial')
linux os distribution: ('Ubuntu', '16.04', 'xenial')
mac version: ('', ('', '', ''), '')
uname: uname_result(system='Linux', node='f7f509f1dacf', release='5.0.0-37-generic', version='#40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019', machine='x86_64', processor='x86_64')
architecture: ('64bit', 'ELF')
machine: x86_64


== are we in docker =============================================
Yes

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== check pips ===================================================
numpy                  1.17.4
protobuf               3.11.1
tensorflow-estimator   2.0.1
tensorflow-gpu         2.0.0
tensorflow-probability 0.8.0

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.version.VERSION = 2.0.0
tf.version.GIT_VERSION = v2.0.0-rc2-26-g64c3d38
tf.version.COMPILER_VERSION = 7.3.1 20180303
Sanity check: array([1], dtype=int32)
       443:     find library=libpthread.so.0 [0]; searching
       443:      search path=/usr/local/nvidia/lib/tls/x86_64:/usr/local/nvidia/lib/tls:/usr/local/nvidia/lib/x86_64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64/tls/x86_64:/usr/local/nvidia/lib64/tls:/usr/local/nvidia/lib64/x86_64:/usr/local/nvidia/lib64          (LD_LIBRARY_PATH)
       443:       trying file=/usr/local/nvidia/lib/tls/x86_64/libpthread.so.0
       443:       trying file=/usr/local/nvidia/lib/tls/libpthread.so.0
       443:       trying file=/usr/local/nvidia/lib/x86_64/libpthread.so.0
       443:       trying file=/usr/local/nvidia/lib/libpthread.so.0
       443:       trying file=/usr/local/nvidia/lib64/tls/x86_64/libpthread.so.0
       443:       trying file=/usr/local/nvidia/lib64/tls/libpthread.so.0
       443:       trying file=/usr/local/nvidia/lib64/x86_64/libpthread.so.0
       443:       trying file=/usr/local/nvidia/lib64/libpthread.so.0
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libpthread.so.0
       443:
       443:     find library=libc.so.6 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libc.so.6
       443:
       443:     find library=libdl.so.2 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libdl.so.2
       443:
       443:     find library=libutil.so.1 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libutil.so.1
       443:
       443:     find library=libexpat.so.1 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libexpat.so.1
       443:
       443:     find library=libz.so.1 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libz.so.1
       443:
       443:     find library=libm.so.6 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libm.so.6
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libpthread.so.0
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libc.so.6
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libm.so.6
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libz.so.1
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libexpat.so.1
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libutil.so.1
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libdl.so.2
       443:
       443:
       443:     initialize program: /usr/local/bin/python
       443:
       443:
       443:     transferring control: /usr/local/bin/python
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_opcode.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libopenblasp-r0-34a18dc3.3.7.so [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/tls/x86_64:/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/tls:/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/x86_64:/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs            (RPATH from file /usr/local/lib/python3.5/dist-packages/numpy/core/_multiarray_umath.cpython-35m-x86_64-linux-gnu.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/tls/x86_64/libopenblasp-r0-34a18dc3.3.7.so
       443:       trying file=/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/tls/libopenblasp-r0-34a18dc3.3.7.so
       443:       trying file=/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/x86_64/libopenblasp-r0-34a18dc3.3.7.so
       443:       trying file=/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
       443:
       443:     find library=libgfortran-ed201abd.so.3.0.0 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs         (RPATH from file /usr/local/lib/python3.5/dist-packages/numpy/core/_multiarray_umath.cpython-35m-x86_64-linux-gnu.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/core/_multiarray_umath.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/core/_multiarray_tests.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/linalg/lapack_lite.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/linalg/_umath_linalg.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libbz2.so.1.0 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libbz2.so.1.0
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libbz2.so.1.0
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_bz2.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=liblzma.so.5 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/liblzma.so.5
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/liblzma.so.5
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_lzma.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libmpdec.so.2 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/usr/lib/x86_64-linux-gnu/libmpdec.so.2
       443:
       443:
       443:     calling init: /usr/lib/x86_64-linux-gnu/libmpdec.so.2
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_decimal.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/fft/_pocketfft_internal.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/mtrand.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/common.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/bounded_integers.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/mt19937.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/bit_generator.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libcrypto.so.1.0.0 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/philox.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/pcg64.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/sfc64.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/numpy/random/generator.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libtensorflow_framework.so.2 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow/tls/x86_64:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow/tls:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow/x86_64:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/tls/x86_64:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/tls:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/x86_64:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../tls/x86_64:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../tls:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../x86_64:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/..            (RPATH from file /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow/tls/x86_64/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow/tls/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow/x86_64/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/tls/x86_64/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/tls/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/x86_64/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../tls/x86_64/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../tls/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../x86_64/libtensorflow_framework.so.2
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
       443:
       443:     find library=librt.so.1 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/..             (RPATH from file /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/librt.so.1
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../librt.so.1
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/librt.so.1
       443:
       443:     find library=libstdc++.so.6 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/..             (RPATH from file /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/libstdc++.so.6
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../libstdc++.so.6
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/usr/lib/x86_64-linux-gnu/libstdc++.so.6
       443:
       443:     find library=libgcc_s.so.1 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/..             (RPATH from file /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/libgcc_s.so.1
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../libgcc_s.so.1
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libgcc_s.so.1
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libgcc_s.so.1
       443:
       443:
       443:     calling init: /usr/lib/x86_64-linux-gnu/libstdc++.so.6
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/librt.so.1
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
       443:
       443:     find library=libhdfs.so [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/..           (RPATH from file /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../libhdfs.so
       443:      search path=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python:/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/..             (RPATH from file /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/libhdfs.so
       443:       trying file=/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../libhdfs.so
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:      search path=/lib/x86_64-linux-gnu/tls/x86_64:/lib/x86_64-linux-gnu/tls:/lib/x86_64-linux-gnu/x86_64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/tls/x86_64:/usr/lib/x86_64-linux-gnu/tls:/usr/lib/x86_64-linux-gnu/x86_64:/usr/lib/x86_64-linux-gnu:/lib/tls/x86_64:/lib/tls:/lib/x86_64:/lib:/usr/lib/tls/x86_64:/usr/lib/tls:/usr/lib/x86_64:/usr/lib              (system search path)
       443:       trying file=/lib/x86_64-linux-gnu/tls/x86_64/libhdfs.so
       443:       trying file=/lib/x86_64-linux-gnu/tls/libhdfs.so
       443:       trying file=/lib/x86_64-linux-gnu/x86_64/libhdfs.so
       443:       trying file=/lib/x86_64-linux-gnu/libhdfs.so
       443:       trying file=/usr/lib/x86_64-linux-gnu/tls/x86_64/libhdfs.so
       443:       trying file=/usr/lib/x86_64-linux-gnu/tls/libhdfs.so
       443:       trying file=/usr/lib/x86_64-linux-gnu/x86_64/libhdfs.so
       443:       trying file=/usr/lib/x86_64-linux-gnu/libhdfs.so
       443:       trying file=/lib/tls/x86_64/libhdfs.so
       443:       trying file=/lib/tls/libhdfs.so
       443:       trying file=/lib/x86_64/libhdfs.so
       443:       trying file=/lib/libhdfs.so
       443:       trying file=/usr/lib/tls/x86_64/libhdfs.so
       443:       trying file=/usr/lib/tls/libhdfs.so
       443:       trying file=/usr/lib/x86_64/libhdfs.so
       443:       trying file=/usr/lib/libhdfs.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/google/protobuf/internal/_api_implementation.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/google/protobuf/pyext/_message.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_csv.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/termios.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/framework/fast_tensor_util.so
       443:
       443:     find library=libuuid.so.1 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libuuid.so.1
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libuuid.so.1
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/wrapt/_wrappers.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_json.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libssl.so.1.0.0 [0]; searching
       443:      search path=           (LD_LIBRARY_PATH)
       443:      search cache=/etc/ld.so.cache
       443:       trying file=/lib/x86_64-linux-gnu/libssl.so.1.0.0
       443:
       443:
       443:     calling init: /lib/x86_64-linux-gnu/libssl.so.1.0.0
       443:
       443:
       443:     calling init: /usr/lib/python3.5/lib-dynload/_ssl.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libhdf5-49599f4e.so.103.0.0 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/h5py/.libs/tls/x86_64:/usr/local/lib/python3.5/dist-packages/h5py/.libs/tls:/usr/local/lib/python3.5/dist-packages/h5py/.libs/x86_64:/usr/local/lib/python3.5/dist-packages/h5py/.libs                (RPATH from file /usr/local/lib/python3.5/dist-packages/h5py/_errors.cpython-35m-x86_64-linux-gnu.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/tls/x86_64/libhdf5-49599f4e.so.103.0.0
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/tls/libhdf5-49599f4e.so.103.0.0
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/x86_64/libhdf5-49599f4e.so.103.0.0
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5-49599f4e.so.103.0.0
       443:
       443:     find library=libhdf5_hl-db841637.so.100.1.1 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/h5py/.libs          (RPATH from file /usr/local/lib/python3.5/dist-packages/h5py/_errors.cpython-35m-x86_64-linux-gnu.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5_hl-db841637.so.100.1.1
       443:
       443:     find library=libsz-1c7dd0cf.so.2.0.1 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/h5py/.libs/./tls/x86_64:/usr/local/lib/python3.5/dist-packages/h5py/.libs/./tls:/usr/local/lib/python3.5/dist-packages/h5py/.libs/./x86_64:/usr/local/lib/python3.5/dist-packages/h5py/.libs/.                (RPATH from file /usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5-49599f4e.so.103.0.0)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/./tls/x86_64/libsz-1c7dd0cf.so.2.0.1
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/./tls/libsz-1c7dd0cf.so.2.0.1
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/./x86_64/libsz-1c7dd0cf.so.2.0.1
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/./libsz-1c7dd0cf.so.2.0.1
       443:
       443:     find library=libaec-2147abcd.so.0.0.4 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/h5py/.libs/.                (RPATH from file /usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5-49599f4e.so.103.0.0)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/./libaec-2147abcd.so.0.0.4
       443:
       443:     find library=libz-a147dcb0.so.1.2.3 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/h5py/.libs/.                (RPATH from file /usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5-49599f4e.so.103.0.0)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/h5py/.libs/./libz-a147dcb0.so.1.2.3
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/.libs/./libz-a147dcb0.so.1.2.3
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/.libs/./libaec-2147abcd.so.0.0.4
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/.libs/./libsz-1c7dd0cf.so.2.0.1
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5-49599f4e.so.103.0.0
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5_hl-db841637.so.100.1.1
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/_errors.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/defs.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/_objects.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/_conv.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5r.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5t.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/utils.cpython-35m-x86_64-linux-gnu.so
       443:
       452:     find library=libc.so.6 [0]; searching
       452:      search path=/usr/local/nvidia/lib/tls/x86_64:/usr/local/nvidia/lib/tls:/usr/local/nvidia/lib/x86_64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64/tls/x86_64:/usr/local/nvidia/lib64/tls:/usr/local/nvidia/lib64/x86_64:/usr/local/nvidia/lib64          (LD_LIBRARY_PATH)
       452:       trying file=/usr/local/nvidia/lib/tls/x86_64/libc.so.6
       452:       trying file=/usr/local/nvidia/lib/tls/libc.so.6
       452:       trying file=/usr/local/nvidia/lib/x86_64/libc.so.6
       452:       trying file=/usr/local/nvidia/lib/libc.so.6
       452:       trying file=/usr/local/nvidia/lib64/tls/x86_64/libc.so.6
       452:       trying file=/usr/local/nvidia/lib64/tls/libc.so.6
       452:       trying file=/usr/local/nvidia/lib64/x86_64/libc.so.6
       452:       trying file=/usr/local/nvidia/lib64/libc.so.6
       452:      search cache=/etc/ld.so.cache
       452:       trying file=/lib/x86_64-linux-gnu/libc.so.6
       452:
       452:
       452:     calling init: /lib/x86_64-linux-gnu/libc.so.6
       452:
       452:
       452:     initialize program: /bin/sh
       452:
       452:
       452:     transferring control: /bin/sh
       452:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5z.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5a.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5s.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5p.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5ac.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/_proxy.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5d.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5ds.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5f.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5g.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5i.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5fd.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5pl.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5o.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/h5py/h5l.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/_lib/_ccallback_c.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/sparse/_sparsetools.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/sparse/_csparsetools.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_shortest_path.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_tools.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_traversal.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_min_spanning_tree.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_reordering.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libjpeg-3b10b538.so.9.3.0 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/PIL/.libs/tls/x86_64:/usr/local/lib/python3.5/dist-packages/PIL/.libs/tls:/usr/local/lib/python3.5/dist-packages/PIL/.libs/x86_64:/usr/local/lib/python3.5/dist-packages/PIL/.libs            (RPATH from file /usr/local/lib/python3.5/dist-packages/PIL/_imaging.cpython-35m-x86_64-linux-gnu.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/tls/x86_64/libjpeg-3b10b538.so.9.3.0
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/tls/libjpeg-3b10b538.so.9.3.0
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/x86_64/libjpeg-3b10b538.so.9.3.0
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/libjpeg-3b10b538.so.9.3.0
       443:
       443:     find library=libopenjp2-b3d7668a.so.2.3.1 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/PIL/.libs           (RPATH from file /usr/local/lib/python3.5/dist-packages/PIL/_imaging.cpython-35m-x86_64-linux-gnu.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/libopenjp2-b3d7668a.so.2.3.1
       443:
       443:     find library=libtiff-8267adfe.so.5.4.0 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/PIL/.libs           (RPATH from file /usr/local/lib/python3.5/dist-packages/PIL/_imaging.cpython-35m-x86_64-linux-gnu.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/libtiff-8267adfe.so.5.4.0
       443:
       443:     find library=liblzma-6cd627ed.so.5.2.4 [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/PIL/.libs/./tls/x86_64:/usr/local/lib/python3.5/dist-packages/PIL/.libs/./tls:/usr/local/lib/python3.5/dist-packages/PIL/.libs/./x86_64:/usr/local/lib/python3.5/dist-packages/PIL/.libs/.            (RPATH from file /usr/local/lib/python3.5/dist-packages/PIL/.libs/libtiff-8267adfe.so.5.4.0)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/./tls/x86_64/liblzma-6cd627ed.so.5.2.4
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/./tls/liblzma-6cd627ed.so.5.2.4
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/./x86_64/liblzma-6cd627ed.so.5.2.4
       443:       trying file=/usr/local/lib/python3.5/dist-packages/PIL/.libs/./liblzma-6cd627ed.so.5.2.4
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/PIL/.libs/./liblzma-6cd627ed.so.5.2.4
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/PIL/.libs/libjpeg-3b10b538.so.9.3.0
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/PIL/.libs/libtiff-8267adfe.so.5.4.0
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/PIL/.libs/libopenjp2-b3d7668a.so.2.3.1
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/PIL/_imaging.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/ndimage/_nd_image.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/ndimage/_ni_label.cpython-35m-x86_64-linux-gnu.so
       443:
       443:     find library=libopenblasp-r0-2ecf47d5.3.7.dev.so [0]; searching
       443:      search path=/usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/tls/x86_64:/usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/tls:/usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/x86_64:/usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs            (RPATH from file /usr/local/lib/python3.5/dist-packages/scipy/linalg/_fblas.cpython-35m-x86_64-linux-gnu.so)
       443:       trying file=/usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/tls/x86_64/libopenblasp-r0-2ecf47d5.3.7.dev.so
       443:       trying file=/usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/tls/libopenblasp-r0-2ecf47d5.3.7.dev.so
       443:       trying file=/usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/x86_64/libopenblasp-r0-2ecf47d5.3.7.dev.so
       443:       trying file=/usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/libopenblasp-r0-2ecf47d5.3.7.dev.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/libopenblasp-r0-2ecf47d5.3.7.dev.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_fblas.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_flapack.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_flinalg.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_solve_toeplitz.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_decomp_update.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/linalg/cython_blas.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/scipy/linalg/cython_lapack.cpython-35m-x86_64-linux-gnu.so
       443:
       443:
       443:     calling init: /usr/local/lib/python3.5/dist-packages/tensorflow_core/lite/experimental/microfrontend/python/ops/_audio_microfrontend_op.so
       443:
       443:
       443:     calling fini: /usr/local/bin/python [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libutil.so.1 [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libexpat.so.1 [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libz.so.1 [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_opcode.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/core/_multiarray_umath.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/core/_multiarray_tests.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/linalg/lapack_lite.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/linalg/_umath_linalg.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_bz2.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libbz2.so.1.0 [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_lzma.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/liblzma.so.5 [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_decimal.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/lib/x86_64-linux-gnu/libmpdec.so.2 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/fft/_pocketfft_internal.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/mtrand.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/common.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/bounded_integers.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/mt19937.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/bit_generator.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/philox.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/pcg64.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/sfc64.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/random/generator.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/google/protobuf/internal/_api_implementation.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/google/protobuf/pyext/_message.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_csv.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/termios.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/framework/fast_tensor_util.so [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libuuid.so.1 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/wrapt/_wrappers.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_json.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/lib/python3.5/lib-dynload/_ssl.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libssl.so.1.0.0 [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/_errors.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/defs.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/_objects.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/_conv.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5r.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5t.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/utils.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5z.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5a.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5s.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5p.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5ac.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/_proxy.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5d.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5ds.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5f.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5g.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5i.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5fd.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5pl.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5o.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/h5l.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5_hl-db841637.so.100.1.1 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/.libs/libhdf5-49599f4e.so.103.0.0 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/.libs/./libsz-1c7dd0cf.so.2.0.1 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/.libs/./libaec-2147abcd.so.0.0.4 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/_lib/_ccallback_c.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/sparse/_sparsetools.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/sparse/_csparsetools.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_shortest_path.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_tools.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_traversal.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_min_spanning_tree.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/sparse/csgraph/_reordering.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/PIL/_imaging.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/PIL/.libs/libopenjp2-b3d7668a.so.2.3.1 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/PIL/.libs/libtiff-8267adfe.so.5.4.0 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/PIL/.libs/libjpeg-3b10b538.so.9.3.0 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/h5py/.libs/./libz-a147dcb0.so.1.2.3 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/PIL/.libs/./liblzma-6cd627ed.so.5.2.4 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/ndimage/_nd_image.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/ndimage/_ni_label.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_fblas.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_flapack.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_flinalg.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_solve_toeplitz.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/linalg/_decomp_update.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/linalg/cython_blas.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/linalg/cython_lapack.cpython-35m-x86_64-linux-gnu.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/scipy/linalg/../.libs/libopenblasp-r0-2ecf47d5.3.7.dev.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0 [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/tensorflow_core/lite/experimental/microfrontend/python/ops/_audio_microfrontend_op.so [0]
       443:
       443:
       443:     calling fini: /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2 [0]
       443:
       443:
       443:     calling fini: /usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libgcc_s.so.1 [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/librt.so.1 [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libm.so.6 [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libdl.so.2 [0]
       443:
       443:
       443:     calling fini: /lib/x86_64-linux-gnu/libpthread.so.0 [0]
       443:

closed time in 2 months

CNOCycle

issue commenttensorflow/tensorflow

Memory leak when using py_function inside tf.data.Dataset

Thanks Kibeom. So the problem is that _py_funcs_used_in_graph is not garbage collected?

QiJune

comment created time in 2 months

issue commenttensorflow/tensorflow

Memory leak when using py_function inside tf.data.Dataset

@kkimdev this seems related to your work on properly garbage collecting traced functions that go out of scope ... could you please take a look? thank you

QiJune

comment created time in 2 months

issue commenttensorflow/tensorflow

tf.dataset may out of memory

You would still compute gradients at the end of the epoch and store the result in a tensor variable.

Here is a simple snippet that illustrates how to augment an input pipeline with adversarial example computed based on a resource variable that can change between epochs:

import tensorflow.compat.v2 as tf

tf.enable_v2_behavior()

gradients = tf.Variable([0, 0, 0, 0, 0])

ds = tf.data.Dataset.range(5)

def map_fn(i):
  return tf.stack([i, i + tf.cast(gradients[i], tf.int64)])

ds = ds.map(map_fn).unbatch()

for elem in ds:
  print(elem.numpy())

gradients.assign([-1, 1, -1, 1, -1])

for elem in ds:
  print(elem.numpy())

When executed, the above program produces:

0
0
1
1
2
2
3
3
4
4
0
-1
1
2
2
1
3
4
4
3
CNOCycle

comment created time in 2 months

issue commenttensorflow/tensorflow

ImageDataGenerator does not work with tpu

Do you have back of the envelope analysis to back your claim? A thousand kernels should be representable in <1MB, so (in the large scale case you seem to be using as a justification for not implementing sharing of the op registry) you are talking about sending less than 1GB to a single machine, which given 100Gb network would be less than a second. In other words, I would expect the cost of this to be negligible given other costs at such at large scale.

Shiro-LK

comment created time in 2 months

pull request commenttensorflow/tensorflow

improve_bfc_allocator

@gbaned I am not familiar with this code. Please add the author of these methods as a reviewer.

Xuxue1

comment created time in 2 months

issue commenttensorflow/tensorflow

Memory leak with tf.shuffle, doesn't release buffer memory

@gadagashwini the user program does not provide evidence of the issue, so I am not sure what do you mean by "replicating" the issue.

When I modify the program with tracking of memory use:

...
def view_used_mem():
  used_mem = psutil.virtual_memory().used
  print("used memory: {} Mb".format(used_mem / 1024 / 1024))


def main(argv):
  del argv

  test_tfASRDataSet = TFASRDataSet(248, 192)
  test_tfASRDataSet.setDataSetIterator()
  test_iter = test_tfASRDataSet.iterator
  test_next = test_iter.get_next()

  run_config = tf.ConfigProto()
  run_config.gpu_options.allow_growth = True
  run_config.allow_soft_placement = True

  with tf.Session(config=run_config) as sess:

    for i in range(100):

      sess.run(test_iter.initializer)

      while True:
        try:
          loss_list = sess.run([test_next])
        except tf.errors.OutOfRangeError:
          print('train epoch %d finish' % (i + 1))
          view_used_mem()
          break
...

and run it using internal TF 1.15 build, the memory is not increasing between epochs.

Furthermore, I have confirmed that the shuffle buffer is properly disposed of between epochs by enabling logging of tf.data iterator constructors / destructors:

...
I0102 13:32:09.063783   34490 dataset.h:887] Iterator::Model constructor
I0102 13:32:09.063821   34490 dataset.h:887] Iterator::Model::MapAndBatch constructor
I0102 13:32:09.063834   34490 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle constructor
I0102 13:32:09.064474   34490 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle destructor
I0102 13:32:09.064498   34490 dataset.h:891] Iterator::Model::MapAndBatch destructor
I0102 13:32:09.064547   34490 dataset.h:891] Iterator::Model destructor
I0102 13:32:09.100376   34570 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap constructor
I0102 13:32:09.100427   34570 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor constructor
I0102 13:32:09.101287   34570 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap[0]::Generator constructor
I0102 13:32:19.100455   34570 shuffle_dataset_op.cc:185] Filling up shuffle buffer (this may take a while): 10188 of 30000
I0102 13:32:29.100570   34570 shuffle_dataset_op.cc:185] Filling up shuffle buffer (this may take a while): 20548 of 30000
I0102 13:32:34.942694   34570 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap[0]::Generator destructor
I0102 13:32:34.944426   34570 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor destructor
I0102 13:32:34.944455   34570 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap constructor
I0102 13:32:34.944461   34570 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap destructor
I0102 13:32:34.944477   34570 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor constructor
I0102 13:32:34.946282   34570 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor destructor
I0102 13:32:34.946300   34570 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap destructor
I0102 13:32:34.946307   34570 shuffle_dataset_op.cc:234] Shuffle buffer filled.
train epoch 2 finish
used memory: 17154.9375 Mb
I0102 13:32:36.206103   34490 dataset.h:887] Iterator::Model constructor
I0102 13:32:36.206153   34490 dataset.h:887] Iterator::Model::MapAndBatch constructor
I0102 13:32:36.206179   34490 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle constructor
I0102 13:32:36.206920   34490 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle destructor
I0102 13:32:36.206948   34490 dataset.h:891] Iterator::Model::MapAndBatch destructor
I0102 13:32:36.206960   34490 dataset.h:891] Iterator::Model destructor
I0102 13:32:36.213317   34616 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap constructor
I0102 13:32:36.213364   34616 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor constructor
I0102 13:32:36.214274   34616 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap[0]::Generator constructor
I0102 13:32:46.213390   34616 shuffle_dataset_op.cc:185] Filling up shuffle buffer (this may take a while): 10597 of 30000
I0102 13:32:56.214073   34616 shuffle_dataset_op.cc:185] Filling up shuffle buffer (this may take a while): 20791 of 30000
I0102 13:33:01.736376   34616 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap[0]::Generator destructor
I0102 13:33:01.736460   34616 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor destructor
I0102 13:33:01.736486   34616 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap constructor
I0102 13:33:01.736495   34616 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap destructor
I0102 13:33:01.736521   34616 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor constructor
I0102 13:33:01.736606   34616 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor destructor
I0102 13:33:01.736621   34616 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap destructor
I0102 13:33:01.736632   34616 shuffle_dataset_op.cc:234] Shuffle buffer filled.
train epoch 3 finish
used memory: 17118.4609375 Mb
I0102 13:33:02.945749   34490 dataset.h:887] Iterator::Model constructor
I0102 13:33:02.945802   34490 dataset.h:887] Iterator::Model::MapAndBatch constructor
I0102 13:33:02.945827   34490 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle constructor
I0102 13:33:02.946734   34490 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle destructor
I0102 13:33:02.946763   34490 dataset.h:891] Iterator::Model::MapAndBatch destructor
I0102 13:33:02.946774   34490 dataset.h:891] Iterator::Model destructor
I0102 13:33:02.973794   34643 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap constructor
I0102 13:33:02.973845   34643 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor constructor
I0102 13:33:02.974667   34643 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap[0]::Generator constructor
I0102 13:33:12.974602   34643 shuffle_dataset_op.cc:185] Filling up shuffle buffer (this may take a while): 10238 of 30000
I0102 13:33:22.973838   34643 shuffle_dataset_op.cc:185] Filling up shuffle buffer (this may take a while): 20354 of 30000
I0102 13:33:28.369214   34643 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap[0]::Generator destructor
I0102 13:33:28.369300   34643 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor destructor
I0102 13:33:28.369320   34643 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap constructor
I0102 13:33:28.369330   34643 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap destructor
I0102 13:33:28.369357   34643 dataset.h:887] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor constructor
I0102 13:33:28.369455   34643 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap::FromTensor destructor
I0102 13:33:28.369468   34643 dataset.h:891] Iterator::Model::MapAndBatch::Shuffle::FlatMap destructor
I0102 13:33:28.369478   34643 shuffle_dataset_op.cc:234] Shuffle buffer filled.
train epoch 4 finish
used memory: 17145.87109375 Mb
kindernerd

comment created time in 2 months

issue commenttensorflow/tensorflow

Support other types of Tensors in tf.data.Dataset.from_generator

This would indeed be useful. My preference for the API would be to add an optional output_spec argument which -- when set -- would be used in place of output_types and output_shapes to identify the output signature of the generator.

This is also a good candidate for external contribution as it is narrowly scoped, so I am marking it as "contributions welcome".

konopik

comment created time in 2 months

issue commenttensorflow/tensorflow

ImageDataGenerator does not work with tpu

I think the cleanest solution to this problem would be to make the placer aware of what kernels are available across processes and have it place the PyFunc op on the local host and execute it remotely.

@mrry @donglimm

Shiro-LK

comment created time in 2 months

issue commenttensorflow/tensorflow

Lack of dataset length or cardinality causes `BaseCollectiveExecutor::StartAbort Out of range` issues

@dubey @omalleyt12 why is the collective executor involved here?

@rxsang @tomerk what is the recommendation for Keras users? My expectation would be that Keras can handle this scenario (fit with dataset whose cardinality is not divisible by batch size) correctly.

00krishna

comment created time in 2 months

issue commenttensorflow/tensorflow

tf.dataset may out of memory

As far as I know, TensorFlow does not provide APIs for garbage collecting parts of a graph.

If the random_noise could be represented by a variable, what you could do instead of recreating the dataset from scratch every epoch is to add dataset = dataset.map(apply_noise) transformation to your input pipeline which accesses the random noise state (e.g. represented as a TF resource variable).

Last but not least, the on_epoch_end method in your example does not seem to be applying random noise. Instead, it seems to be permuting all elements of the training set. The equivalent can be achieved by adding dataset = dataset.shuffle(num_elements) transformation in your input pipeline, avoiding the need to creating a copy of the dataset.

CNOCycle

comment created time in 2 months

issue commenttensorflow/tensorflow

dataset's map operation performance is O(N) with data shape for O(1) operations, can be orders of magnitude slower than using a dataset from generator.

Thank you for the explanation. In the case the data is small and present in memory, you can use from_generator to avoid the from_tensor_slices data copy.

I am going to close this issue as the original reason for opening (O(n) time complexity of map) has been refuted and the O(n) time complexity of from_tensor_slices is expected.

@aaudiber as per our offline discussion, it makes sense to comment on the from_tensor_slices and unbatch performance in either the API documentation of the tf.data performance guide.

@feature-engineer feel free to create a PR updating the performance guide with information that would help others with a similar use case. Thank you.

feature-engineer

comment created time in 2 months

issue closedtensorflow/tensorflow

dataset's map operation performance is O(N) with data shape for O(1) operations, can be orders of magnitude slower than using a dataset from generator.

<em>Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template</em>

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 18.04, Windows 7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.0.0
  • Python version: 3.7.4
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior dataset.map takes 10 times longer when getting 10 times bigger tensors in the dataset, regardless of the function being passed. e.g. data.map(lambda x: x[0, 0]) would take 100 times longer, if data contained tensors of shape (1000, 1000) rather than (100, 100).

Describe the expected behavior I would expect that if the mapped function has an O(1) performance, mapping it on any dataset would still have O(1) performance not O(N).

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

timeit.timeit("""
tf.reduce_sum([x for x in data.map(lambda x: x[:10, :10]).repeat().take(1000)])
""",
setup="""
import tensorflow as tf
tf.autograph.set_verbosity(0, False)
x = tf.ones((1, 10000,1000))
data = tf.data.Dataset.from_tensor_slices([x])
""", number=10)

takes 10 times longer than

timeit.timeit("""
tf.reduce_sum([x for x in data.map(lambda x: x[:10, :10]).repeat().take(1000)])
""",
setup="""
import tensorflow as tf
tf.autograph.set_verbosity(0, False)
x = tf.ones((1, 1000,1000))
data = tf.data.Dataset.from_tensor_slices([x])
""", number=10)

Where the only difference is x having 10 times the values in the first example.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

closed time in 2 months

feature-engineer

issue commenttensorflow/tensorflow

DatasetVariantWrapper "No unary variant device copy function found"

In that case the answer is no. The tf.data implementation prevents such copies from happening silently (by not implementing / registering a method for copying the dataset variant) and instead relies on explicit APIs (such as the multi-device iterator [API](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/data/ops/multi_device_iterator_ops.py used by tf.distribute) to copy data between devices.

mwalmsley

comment created time in 2 months

issue commenttensorflow/tensorflow

Dataset scan loses variable modifications

I believe that this is another instance of the "datasets do not propagate" control dependencies bug (b/142341957) that and I plan to fix as soon as possible.

leandro-gracia-gil

comment created time in 2 months

issue commenttensorflow/tensorflow

dataset's map operation performance is O(N) with data shape for O(1) operations, can be orders of magnitude slower than using a dataset from generator.

@aaudiber from_tensor_slices is not that different from from_tensors followed by unbatch. For datasets materialized into memory (e.g. as a NumPy array), from_tensor_slices is commonly used in TF tutorials as an example to convert these into a dataset and this will be suboptimal (c.f. this issue).

@feature-engineer from_tensor_slices represents a "ease-of-use vs. performance" trade-off and TF tutorials optimize for simplicity over performance. In many cases, the overhead of the data copy is negligible compared to the per-element pre-processing computation. For best performance, not only you want to avoid / minimize data copies and vectorize and parallelize processing. An example of a performant and complex use of tf.data for handling NumPy objects can be found in keras which converts NumPy objects to a tf.data.Dataset that shuffles and batches the original data (in a way that outperforms doing the same in NumPy).

As a side note, is there a reason you are materializing your images in memory outside of tf.data as opposed to reading the images (presumably from files) using tf.data (example)?

feature-engineer

comment created time in 2 months

issue commenttensorflow/tensorflow

DatasetVariantWrapper "No unary variant device copy function found"

Yes, for instance https://github.com/tensorflow/tensorflow/issues/34519.

mwalmsley

comment created time in 2 months

issue commenttensorflow/tensorflow

tf.range + for x,y in dataset issue

The problem here is in an assumption that the placer algorithm makes -- that any tensor can be copied between different devices -- which is generally not true (e.g. for dataset variant tensors). The proper way to fix this would be to make sure that placer only allows placement that collocate a node with its non-copyable inputs.

SSSxCCC

comment created time in 2 months

issue closedtensorflow/tensorflow

tf.py_function is unusable in map function

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.0.0
  • Python version: 3.7
  • CUDA/cuDNN version: 10.0/7
  • GPU model and memory: Tesla T4, 16gb

Describe the current behavior Attempting to use tf.py_function inside a tf.data.Dataset.map. Failure occurs because the tensor returned from py_function is accessed during build time of the graph, but function is expected to be evaluated during runtime of the graph.

Attempt 1

import tensorflow as tf
import numpy as np
#make magical data that requires python
def make_data(i): 
    return np.cast[np.uint8](i) * np.ones([20,256,256,3], dtype=np.float32) / 10.

#clean up magical data
@tf.function 
def make_clean_data(i): 
    ones = tf.py_function(make_data,[i],tf.float32) 
    ones = tf.reshape(ones,ones.get_shape()) 
    ones =tf.image.resize(ones,[224,224]) 
    return ones

ds = tf.data.Dataset.range(10)
ds.map(make_clean_data)

Error:

ValueError: in converted code:

    <ipython-input-13-6dd8c09d760a>:8 make_clean_data  *
        ones = tf.reshape(ones,ones.get_shape())
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py:131 reshape
        result = gen_array_ops.reshape(tensor, shape, name)
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_array_ops.py:8117 reshape
        "Reshape", tensor=tensor, shape=shape, name=name)
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py:545 _apply_op_helper
        (input_name, err))

    ValueError: Tried to convert 'shape' to a tensor and failed. Error: Cannot convert a partially known TensorShape to a Tensor: <unknown>

Attempt 2

import tensorflow as tf
import numpy as np
#make magical data that requires python
def make_data(i): 
    return np.cast[np.uint8](i) * np.ones([20,256,256,3], dtype=np.float32) / 10.

#clean up magical data
@tf.function 
def make_clean_data(i): 
    ones = tf.py_function(make_data,[i],tf.float32) 
    ones.set_shape(ones.get_shape()) 
    ones =tf.image.resize(ones,[224,224]) 
    return ones

ds = tf.data.Dataset.range(10)
ds.map(make_clean_data)

Error

ValueError: in converted code:

    <ipython-input-14-311bd0ebc814>:9 make_clean_data  *
        ones =tf.image.resize(ones,[224,224])
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/ops/image_ops_impl.py:1319 resize_images_v2
        skip_resize_if_same=False)
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/ops/image_ops_impl.py:1034 _resize_images_common
        raise ValueError('\'images\' contains no shape.')

    ValueError: 'images' contains no shape.

Attempt 3

import tensorflow as tf
import numpy as np
#make magical data that requires python
def make_data(i): 
    data = np.cast[np.uint8](i) * np.zeros([20,256,256,3], dtype=np.float32) / 10.
    return  data, data.shape

#clean up magical data
@tf.function 
def make_clean_data(i): 
    ones, shape = tf.py_function(make_data,[i],(tf.float32, tf.int32)) 
    ones.set_shape(shape) 
    ones =tf.image.resize(ones,[224,224]) 
    return ones

ds = tf.data.Dataset.range(10)
ds.map(make_clean_data)

Error

OperatorNotAllowedInGraphError: in converted code:

    <ipython-input-11-a2af44684940>:9 make_clean_data  *
        ones.set_shape(shape)
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:626 set_shape
        shape = tensor_shape.TensorShape(shape)
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_shape.py:776 __init__
        self._dims = [as_dimension(d) for d in dims_iter]
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_shape.py:776 <listcomp>
        self._dims = [as_dimension(d) for d in dims_iter]
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:547 __iter__
        self._disallow_iteration()
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:540 _disallow_iteration
        self._disallow_when_autograph_enabled("iterating over `tf.Tensor`")
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:518 _disallow_when_autograph_enabled
        " decorating it directly with @tf.function.".format(task))

    OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did not convert this function. Try decorating it directly with @tf.function.

Attempt 4

import tensorflow as tf
import numpy as np
#make magical data that requires python
def make_data(i): 
    data = np.cast[np.uint8](i) * np.zeros([20,256,256,3], dtype=np.float32) / 10.
    return  data, data.shape

#clean up magical data
@tf.function 
def make_clean_data(i): 
    ones, shape = tf.py_function(make_data,[i],(tf.float32, tf.int32)) 
    ones = tf.reshape(ones, shape) 
    ones =tf.image.resize(ones,[224,224]) 
    return ones

ds = tf.data.Dataset.range(10)
ds.map(make_clean_data)

Error

ValueError: in converted code:

    <ipython-input-16-e967bbec561a>:13 make_clean_data  *
        ones =tf.image.resize(ones,[224,224])
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/ops/image_ops_impl.py:1319 resize_images_v2
        skip_resize_if_same=False)
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/ops/image_ops_impl.py:1034 _resize_images_common
        raise ValueError('\'images\' contains no shape.')

    ValueError: 'images' contains no shape.

Attempt 5

import tensorflow as tf
import numpy as np
#make magical data that requires python
def make_data(i): 
    data = np.cast[np.uint8](i) * np.zeros([20,256,256,3], dtype=np.float32) / 10.
    return  data, data.shape

#clean up magical data
@tf.function 
def make_clean_data(ones, shape): 
    ones.set_shape(shape) 
    ones =tf.image.resize(ones,[224,224]) 
    return ones

ds = tf.data.Dataset.range(10)
ds = ds.map(lambda i:  tf.py_function(make_data,[i],(tf.float32, tf.int32)))
ds.map(make_clean_data)

Error

OperatorNotAllowedInGraphError: in converted code:

    <ipython-input-12-6bfeded17ffe>:8 make_clean_data  *
        ones.set_shape(shape)
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:626 set_shape
        shape = tensor_shape.TensorShape(shape)
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_shape.py:776 __init__
        self._dims = [as_dimension(d) for d in dims_iter]
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_shape.py:776 <listcomp>
        self._dims = [as_dimension(d) for d in dims_iter]
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:547 __iter__
        self._disallow_iteration()
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:540 _disallow_iteration
        self._disallow_when_autograph_enabled("iterating over `tf.Tensor`")
    /home/hollowgalaxy/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:518 _disallow_when_autograph_enabled
        " decorating it directly with @tf.function.".format(task))

    OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did not convert this function. Try decorating it directly with @tf.function.

closed time in 2 months

hollowgalaxy

issue commenttensorflow/tensorflow

tf.py_function is unusable in map function

py_function produces tensors of unknown shape and rank (as shape inference does not work on arbitrary Python code) and if you need the shape (e.g. for tf.image.resize) you need to set it based on statically available information (i.e. it cannot be something that py_function returns).

For example, you can use set_shape to set the rank of the py_function output to 4, which will be sufficient for tf.image.resize to operate correctly:

import tensorflow.compat.v2 as tf
import numpy as np

tf.enable_v2_behavior()

#make magical data that requires python
def make_data(i): 
    return np.cast[np.uint8](i) * np.ones([20,256,256,3], dtype=np.float32) / 10.
    

#clean up magical data
def make_clean_data(i): 
    ones = tf.py_function(make_data,[i],tf.float32) 
    ones.set_shape(tf.TensorShape([None, None, None, None]))
    ones = tf.image.resize(ones, [224,224])
    return ones

ds = tf.data.Dataset.range(10)
ds = ds.map(make_clean_data)
hollowgalaxy

comment created time in 2 months

issue commenttensorflow/tensorflow

Memory leaks when using tf.strings.split in map_func for tf.data.Dataset.map with eager execution.

What happens when you run the tf.strings.split for 20 epochs? Does the memory use plateau?

Can you reproduce the issue without tf.data (simply wrapping your _parse_data for both with and without strings.split in tf.function and applying it to elements generated by gen_samples in your inner loop)?

workingloong

comment created time in 2 months

issue commenttensorflow/tensorflow

tf.dataset may out of memory

You are (likely) running out of memory because every call to data_pipeline will embed the dataX and dataY inputs into the graph (as constant inputs to from_tensor_slices).

What does your "update" to trainX do? (In your repro, you are mocking this out using + 0). Can you provide an example that uses a placeholder that does not have the memory issue?

Taking a step back, you seem to try to avoid reliance on the placeholder API because it is deprecated in TF 2 but your repro program still uses Session which is also deprecated.

My recommendation would be to write your program in idiomatic TF 2 style using eager mode and use tf.function when efficient graph execution is needed? For instance, creating the dataset in eager mode will result in proper garbage collection of the objects once they are not references (as opposed to them being accumulated in the default graph which is what's going on in your program).

CNOCycle

comment created time in 2 months

issue commenttensorflow/tensorflow

Memory leaks when using tf.strings.split in map_func for tf.data.Dataset.map with eager execution.

When I change the order in which the tf.strings.split and tf.py_function experiments are executed, I see fast memory growth for tf.py_function experiment and not the tf.strings.split. This suggests that the memory behavior you are observing is unrelated to tf.strings.split vs tf.py_function -- as it occurs irrespective of which experiment you run first -- and is more likely related to caching of objects inside of TensorFlow's implementation and the deferred nature of Python's garbage collection.

workingloong

comment created time in 2 months

issue commenttensorflow/tensorflow

tf.range + for x,y in dataset issue

As far as I can tell, this is a op placement issue -- the tf.print op will be placed on GPU because tf.range will be placed on GPU (which is the default policy for ops that have GPU kernel in the presence of GPU), but the tf.data input will reside on CPU. The right thing to do is for the tf.print op to be placed on CPU.

@mhong could you please triage this to someone on the runtime team to investigate. Thank you.

SSSxCCC

comment created time in 2 months

issue commenttensorflow/tensorflow

logging broken with Python-3.8: findCaller() takes from 1 to 2 positional arguments but 3 were given

@jvishnuvardhan please assign "tensorflow_datasets" / TFDS issue to @rsepassi for triage.

olk

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""++  def __init__(self, depth=None, num_bins=None, name=None, **kwargs):+    """Constructs a CategoryCrossing layer.+    Args:+      depth: depth of input crossing. By default None, all inputs are crossed+             into one output. It can be an int or tuple/list of ints, where inputs are+             combined into all combinations of output with degree of `depth`. For example,+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]+      num_bins: Number of hash bins. By default None, no hashing is performed.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a list of int tensors of shape `[batch_size, d1, ..., dm]`+    Output: a single int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,+    and if depth is 2, then+    the output will be a single integer tensor `[[i, j, k, l]]`, where:+    i is the index of the category "a1=1 and b1=1"+    j is the index of the category "a1=1 and b2=3"+    k is the index of the category "a2=2 and b1=1"+    l is the index of the category "a2=2 and b2=3"+    """+    pass++`tf.keras.layers.CategoryEncoding`+CategoryEncoding(PreprocessingLayer):+"""This layer transforms categorical inputs from index space to category space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, num_categories, mode="sum", axis=-1, name=None, **kwargs):+    """Constructs a CategoryEncoding layer.+    Args:+      num_categories: Number of elements in the vocabulary.+      mode: how to reduce a categorical input if multivalent, can be one of "sum",  +          "mean", "binary", "tfidf". It can also be None if this is not a multivalent input,+          and simply needs to convert input from index space to category space. "tfidf" is only+          valid when adapt is called on this layer.+      axis: the axis to reduce, by default will be the last axis, specially true +          for sequential feature columns.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm-1, dm]`+    Output: a float tensor of shape `[batch_size, d1, ..., dm-1, num_categories]`+    """+    pass++`tf.keras.layers.CategoryHashing`+CategoryHashing(PreprocessingLayer):+"""This layer transforms categorical inputs to hashed output.+   If input is dense/sparse, then output is dense/sparse."""+  def __init__(self, num_bins, name=None, **kwargs):+    """Constructs a CategoryHashing layer.++    Args:+      num_bins: Number of hash bins.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm]`+    Output: a int tensor of shape `[batch_size, d1, ..., dm]`+    """+    pass++```++We also propose a `to_sparse` op to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input.++```python+`tf.to_sparse`+def to_sparse(input, ignore_value):+  """Convert dense/sparse tensor to sparse while dropping user specified values.

This filtering can be built out of existing operations. You can call, tf.where on values and pass the result to tf.sparse.retain, which is simple enough that I do not see the point of introducing syntactic sugar for that.

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""++  def __init__(self, depth=None, num_bins=None, name=None, **kwargs):+    """Constructs a CategoryCrossing layer.+    Args:+      depth: depth of input crossing. By default None, all inputs are crossed+             into one output. It can be an int or tuple/list of ints, where inputs are+             combined into all combinations of output with degree of `depth`. For example,+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]+      num_bins: Number of hash bins. By default None, no hashing is performed.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a list of int tensors of shape `[batch_size, d1, ..., dm]`+    Output: a single int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,+    and if depth is 2, then+    the output will be a single integer tensor `[[i, j, k, l]]`, where:+    i is the index of the category "a1=1 and b1=1"+    j is the index of the category "a1=1 and b2=3"+    k is the index of the category "a2=2 and b1=1"+    l is the index of the category "a2=2 and b2=3"+    """+    pass++`tf.keras.layers.CategoryEncoding`+CategoryEncoding(PreprocessingLayer):+"""This layer transforms categorical inputs from index space to category space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, num_categories, mode="sum", axis=-1, name=None, **kwargs):+    """Constructs a CategoryEncoding layer.+    Args:+      num_categories: Number of elements in the vocabulary.+      mode: how to reduce a categorical input if multivalent, can be one of "sum",  +          "mean", "binary", "tfidf". It can also be None if this is not a multivalent input,+          and simply needs to convert input from index space to category space. "tfidf" is only+          valid when adapt is called on this layer.+      axis: the axis to reduce, by default will be the last axis, specially true +          for sequential feature columns.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm-1, dm]`+    Output: a float tensor of shape `[batch_size, d1, ..., dm-1, num_categories]`+    """+    pass++`tf.keras.layers.CategoryHashing`+CategoryHashing(PreprocessingLayer):+"""This layer transforms categorical inputs to hashed output.+   If input is dense/sparse, then output is dense/sparse."""+  def __init__(self, num_bins, name=None, **kwargs):+    """Constructs a CategoryHashing layer.++    Args:+      num_bins: Number of hash bins.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm]`+    Output: a int tensor of shape `[batch_size, d1, ..., dm]`+    """+    pass++```++We also propose a `to_sparse` op to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input.

I realized, we already have from_dense, so perhaps you should just extend it with the option to set the element to be ignore?

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""

Got it. Maybe you should say, "If any of the inputs is sparse, then all outputs will be sparse. Otherwise, all outputs will be dense."

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.

My point is that this should be documented. Stating the the vocabulary can be set later without showing how is not useful.

tanzhenyu

comment created time in 2 months

issue commenttensorflow/tensorflow

dataset's map operation performance is O(N) with data shape for O(1) operations, can be orders of magnitude slower than using a dataset from generator.

So do you think we should update the documentation of all tf.data transformations that currently perform a data copy (batch, padded_batch, from_tensor_slices, unbatch, ...)?

feature-engineer

comment created time in 2 months

issue commenttensorflow/tensorflow

dataset's map operation performance is O(N) with data shape for O(1) operations, can be orders of magnitude slower than using a dataset from generator.

I am generally opposed to documenting implementation details (e.g. which methods are implementated using a data copy and which are not).

@aaudiber what are your thoughts on this? This is related to our discussion of deprecating from_tensor_slices() in favor of from_tensors().unbatch() -- if we do that, should we document that unbatch() performs a data copy?

feature-engineer

comment created time in 2 months

issue commenttensorflow/tensorflow

Error occurred when finalizing GeneratorDataset iterator

@guptapriya I realized that generator dataset is used in multi-device iterator. This seems related to newly added support for cancellation in tf.data.

The good news is that, as you pointed out, the warning is superfluous. The bad news is that, as far as I can tell, this warning will be present for all tf.distribute jobs in TF 2.1 (given how tf.data cancellation is implemented). I will look into having a fix for this cherrypicked into TF 2.1.

olk

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""++  def __init__(self, depth=None, num_bins=None, name=None, **kwargs):+    """Constructs a CategoryCrossing layer.+    Args:+      depth: depth of input crossing. By default None, all inputs are crossed+             into one output. It can be an int or tuple/list of ints, where inputs are+             combined into all combinations of output with degree of `depth`. For example,+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]+      num_bins: Number of hash bins. By default None, no hashing is performed.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a list of int tensors of shape `[batch_size, d1, ..., dm]`+    Output: a single int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,+    and if depth is 2, then+    the output will be a single integer tensor `[[i, j, k, l]]`, where:+    i is the index of the category "a1=1 and b1=1"+    j is the index of the category "a1=1 and b2=3"+    k is the index of the category "a2=2 and b1=1"+    l is the index of the category "a2=2 and b2=3"+    """+    pass++`tf.keras.layers.CategoryEncoding`+CategoryEncoding(PreprocessingLayer):+"""This layer transforms categorical inputs from index space to category space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, num_categories, mode="sum", axis=-1, name=None, **kwargs):+    """Constructs a CategoryEncoding layer.+    Args:+      num_categories: Number of elements in the vocabulary.+      mode: how to reduce a categorical input if multivalent, can be one of "sum",  +          "mean", "binary", "tfidf". It can also be None if this is not a multivalent input,+          and simply needs to convert input from index space to category space. "tfidf" is only+          valid when adapt is called on this layer.+      axis: the axis to reduce, by default will be the last axis, specially true +          for sequential feature columns.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm-1, dm]`+    Output: a float tensor of shape `[batch_size, d1, ..., dm-1, num_categories]`+    """+    pass++`tf.keras.layers.CategoryHashing`+CategoryHashing(PreprocessingLayer):+"""This layer transforms categorical inputs to hashed output.+   If input is dense/sparse, then output is dense/sparse."""+  def __init__(self, num_bins, name=None, **kwargs):+    """Constructs a CategoryHashing layer.++    Args:+      num_bins: Number of hash bins.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm]`+    Output: a int tensor of shape `[batch_size, d1, ..., dm]`+    """+    pass++```++We also propose a `to_sparse` op to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input.++```python+`tf.to_sparse`+def to_sparse(input, ignore_value):+  """Convert dense/sparse tensor to sparse while dropping user specified values.

What is the benefit of calling this API with a SparseTensor input?

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""++  def __init__(self, depth=None, num_bins=None, name=None, **kwargs):+    """Constructs a CategoryCrossing layer.+    Args:+      depth: depth of input crossing. By default None, all inputs are crossed+             into one output. It can be an int or tuple/list of ints, where inputs are+             combined into all combinations of output with degree of `depth`. For example,+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]+      num_bins: Number of hash bins. By default None, no hashing is performed.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a list of int tensors of shape `[batch_size, d1, ..., dm]`+    Output: a single int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,+    and if depth is 2, then+    the output will be a single integer tensor `[[i, j, k, l]]`, where:+    i is the index of the category "a1=1 and b1=1"+    j is the index of the category "a1=1 and b2=3"+    k is the index of the category "a2=2 and b1=1"+    l is the index of the category "a2=2 and b2=3"+    """+    pass++`tf.keras.layers.CategoryEncoding`+CategoryEncoding(PreprocessingLayer):+"""This layer transforms categorical inputs from index space to category space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, num_categories, mode="sum", axis=-1, name=None, **kwargs):+    """Constructs a CategoryEncoding layer.+    Args:+      num_categories: Number of elements in the vocabulary.+      mode: how to reduce a categorical input if multivalent, can be one of "sum",  +          "mean", "binary", "tfidf". It can also be None if this is not a multivalent input,+          and simply needs to convert input from index space to category space. "tfidf" is only+          valid when adapt is called on this layer.+      axis: the axis to reduce, by default will be the last axis, specially true +          for sequential feature columns.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm-1, dm]`+    Output: a float tensor of shape `[batch_size, d1, ..., dm-1, num_categories]`+    """+    pass++`tf.keras.layers.CategoryHashing`+CategoryHashing(PreprocessingLayer):+"""This layer transforms categorical inputs to hashed output.+   If input is dense/sparse, then output is dense/sparse."""+  def __init__(self, num_bins, name=None, **kwargs):+    """Constructs a CategoryHashing layer.++    Args:+      num_bins: Number of hash bins.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm]`+    Output: a int tensor of shape `[batch_size, d1, ..., dm]`+    """+    pass++```++We also propose a `to_sparse` op to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input.

I would prefer if the SparseTensor class had a from_dense method.

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""++  def __init__(self, depth=None, num_bins=None, name=None, **kwargs):+    """Constructs a CategoryCrossing layer.+    Args:+      depth: depth of input crossing. By default None, all inputs are crossed+             into one output. It can be an int or tuple/list of ints, where inputs are+             combined into all combinations of output with degree of `depth`. For example,+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]+      num_bins: Number of hash bins. By default None, no hashing is performed.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a list of int tensors of shape `[batch_size, d1, ..., dm]`+    Output: a single int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,+    and if depth is 2, then+    the output will be a single integer tensor `[[i, j, k, l]]`, where:+    i is the index of the category "a1=1 and b1=1"+    j is the index of the category "a1=1 and b2=3"+    k is the index of the category "a2=2 and b1=1"+    l is the index of the category "a2=2 and b2=3"+    """+    pass++`tf.keras.layers.CategoryEncoding`+CategoryEncoding(PreprocessingLayer):+"""This layer transforms categorical inputs from index space to category space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, num_categories, mode="sum", axis=-1, name=None, **kwargs):+    """Constructs a CategoryEncoding layer.+    Args:+      num_categories: Number of elements in the vocabulary.+      mode: how to reduce a categorical input if multivalent, can be one of "sum",  +          "mean", "binary", "tfidf". It can also be None if this is not a multivalent input,+          and simply needs to convert input from index space to category space. "tfidf" is only+          valid when adapt is called on this layer.+      axis: the axis to reduce, by default will be the last axis, specially true +          for sequential feature columns.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a int tensor of shape `[batch_size, d1, ..., dm-1, dm]`+    Output: a float tensor of shape `[batch_size, d1, ..., dm-1, num_categories]`+    """+    pass++`tf.keras.layers.CategoryHashing`+CategoryHashing(PreprocessingLayer):

Please add example for this layer.

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""++  def __init__(self, depth=None, num_bins=None, name=None, **kwargs):+    """Constructs a CategoryCrossing layer.+    Args:+      depth: depth of input crossing. By default None, all inputs are crossed+             into one output. It can be an int or tuple/list of ints, where inputs are+             combined into all combinations of output with degree of `depth`. For example,+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]+      num_bins: Number of hash bins. By default None, no hashing is performed.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a list of int tensors of shape `[batch_size, d1, ..., dm]`+    Output: a single int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,+    and if depth is 2, then+    the output will be a single integer tensor `[[i, j, k, l]]`, where:+    i is the index of the category "a1=1 and b1=1"+    j is the index of the category "a1=1 and b2=3"+    k is the index of the category "a2=2 and b1=1"+    l is the index of the category "a2=2 and b2=3"+    """+    pass++`tf.keras.layers.CategoryEncoding`+CategoryEncoding(PreprocessingLayer):

Please add example for this layer.

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""++  def __init__(self, depth=None, num_bins=None, name=None, **kwargs):+    """Constructs a CategoryCrossing layer.+    Args:+      depth: depth of input crossing. By default None, all inputs are crossed+             into one output. It can be an int or tuple/list of ints, where inputs are+             combined into all combinations of output with degree of `depth`. For example,+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]+      num_bins: Number of hash bins. By default None, no hashing is performed.+      name: Name to give to the layer.+      **kwargs: Keyword arguments to construct a layer.++    Input: a list of int tensors of shape `[batch_size, d1, ..., dm]`+    Output: a single int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`,+    and if depth is 2, then+    the output will be a single integer tensor `[[i, j, k, l]]`, where:+    i is the index of the category "a1=1 and b1=1"+    j is the index of the category "a1=1 and b2=3"+    k is the index of the category "a2=2 and b1=1"+    l is the index of the category "a2=2 and b2=3"

I don't understand this example. What is a1 vs a2? What will the "single integer tensor [[i, j, k, l]]" actually look like for the given inputs?

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""++  def __init__(self, depth=None, num_bins=None, name=None, **kwargs):+    """Constructs a CategoryCrossing layer.+    Args:+      depth: depth of input crossing. By default None, all inputs are crossed+             into one output. It can be an int or tuple/list of ints, where inputs are+             combined into all combinations of output with degree of `depth`. For example,+             with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc]

the example should be moved to the "Example" section below

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.+   If any input is sparse, then output is sparse, otherwise dense."""

OOC, why is the wording here different than in the other API endpoints (it seems that the intended behavior is the same?)

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.+      name: Name to give to the layer.+     **kwargs: Keyword arguments to construct a layer.++    Input: a string or int tensor of shape `[batch_size, d1, ..., dm]`+    Output: an int tensor of shape `[batch_size, d1, ..., dm]`++    Example:++    If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"],+    and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is+    `[1, 3, 4, 1, 0]`. 0 stands for an OOV token.+    """+    pass++`tf.keras.layers.CategoryCrossing`+CategoryCrossing(PreprocessingLayer):+"""This layer transforms multiple categorical inputs to categorical outputs+   by Cartesian product. and hash the output if necessary.

nit: remove extra .

tanzhenyu

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Keras categorical inputs

+# Keras categorical inputs++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)|+| **Sponsor**   | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) |+| **Updated**   | 2019-12-12                                           |++## Objective++This document proposes 4 new preprocessing Keras layers (`CategoryLookup`, `CategoryCrossing`, `CategoryEncoding`, `CategoryHashing`), and 1 additional op (`to_sparse`) to allow users to:+* Perform feature engineering for categorical inputs+* Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers+* Introduce sparse inputs that work with Keras linear models and other layers that support sparsity++Other proposed layers for replacement of feature columns such as `tf.feature_column.bucketized_column` and `tf.feature_column.numeric_column` has been discussed [here](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md) and are not the focus of this document.++## Motivation++Specifically, by introducing the 4 layers, we aim to address these pain points:+* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).+* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.+* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.++## User Benefit++We expect to get rid of the user painpoints once migrating off feature columns.++## Example Workflows++Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY).++### Workflow 1++The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset:++```python+dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')+y_train = dftrain.pop('survived')++CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']+NUMERICAL_COLUMNS = ['age', 'fare']+# input list to create functional model.+model_inputs = []+# input list to feed linear model.+linear_inputs = []+for feature_name in CATEGORICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True)+	vocab_list = sorted(dftrain[feature_name].unique())+	# Map string values to indices+	x = tf.keras.layers.CategoryLookup(vocabulary=vocab_list, name=feature_name)(feature_input)+  x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x)+	linear_inputs.append(x)+	model_inputs.append(feature_input)++for feature_name in NUMERICAL_COLUMNS:+	feature_input = tf.keras.Input(shape=(1,), name=feature_name)+	linear_inputs.append(feature_input)+	model_inputs.append(feature_input)++linear_model = tf.keras.experimental.LinearModel(units=1)+linear_logits = linear_model(linear_inputs)+model = tf.keras.Model(model_inputs, linear_logits)++model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy'])++dataset = tf.data.Dataset.from_tensor_slices((+	(tf.to_sparse(dftrain.sex, "Unknown"), tf.to_sparse(dftrain.n_siblings_spouses, -1),+	tf.to_sparse(dftrain.parch, -1), tf.to_sparse(dftrain['class'], "Unknown"), tf.to_sparse(dftrain.deck, "Unknown"),+	tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)),+	y_train)).batch(bach_size).repeat(n_epochs)++model.fit(dataset)+```++### Workflow 2++The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)).++1. Categorical vocab list column++Original:+```python+fc = tf.feature_column.categorical_feature_column_with_vocabulary_list(+	   key, vocabulary_list, dtype, default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets)+out = layer(x)+```++2. categorical vocab file column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_file(+       key, vocabulary_file, vocabulary_size, dtype,+       default_value, num_oov_buckets)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryLookup(+            vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets)+out = layer(x)+```+Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines.++3. categorical hash column++Original:+```python+fc = tf.feature_column.categorical_column_with_hash_bucket(+       key, hash_bucket_size, dtype)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.CategoryHashing(num_bins=hash_bucket_size)+out = layer(x)+```++4. categorical identity column++Original:+```python+fc = tf.feature_column.categorical_column_with_identity(+       key, num_buckets, default_value)+```+Proposed:+```python+x = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x))+out = layer(x)+```++5. cross column++Original:+```python+fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, +         dtype, default_value, num_oov_buckets)+fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size,+         dtype)+fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key)+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype)+layer1 = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,  +           num_oov_tokens=num_oov_buckets)+x1 = layer1(x1)+layer2 = tf.keras.layers.CategoryHashing(+           num_bins=hash_bucket_size)+x2 = layer2(x2)+layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size)+out = layer([x1, x2])+```++6. weighted categorical column++Original:+```python+fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list,+         dtype, default_value, num_oov_buckets)+weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, +         dtype=weight_dtype)+linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc])+```+Proposed:+```python+x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype)+x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype)+layer = tf.keras.layers.CategoryLookup(+           vocabulary=vocabulary_list,   +           num_oov_tokens=num_oov_buckets)+x1 = layer(x1)+x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2])+linear_model = tf.keras.premade.LinearModel(units)+linear_logits = linear_model(x)+```++## Design Proposal+We propose a CategoryLookup layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `CategoryHashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models.++```python+`tf.keras.layers.CategoryLookup`+CategoryLookup(PreprocessingLayer):+"""This layer transforms categorical inputs to index space.+   If input is dense/sparse, then output is dense/sparse."""++  def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None,+               name=None, **kwargs):+    """Constructs a CategoryLookup layer.++    Args:+      max_tokens: The maximum size of the vocabulary for this layer. If None,+              there is no cap on the size of the vocabulary. This is used when `adapt`+              is called.+      num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. +              All out-of-vocab inputs will be assigned IDs in the range of +              [0, num_oov_tokens) based on a hash. When+              `vocabulary` is None, it will convert inputs in [0, num_oov_tokens)+      vocabulary: the vocabulary to lookup the input. If it is a file, it represents the +              source vocab file; If it is a list/tuple, it represents the source vocab +              list; If it is None, the vocabulary can later be set.

what is the format of the file? how do you set the vocabulary later? what is the expected use of the adapt method?

tanzhenyu

comment created time in 2 months

issue commenttensorflow/tensorflow

dataset's map operation performance is O(N) with data shape for O(1) operations, can be orders of magnitude slower than using a dataset from generator.

Your setup only builds the input pipeline graph. The from_tensor_slices data copy is performed in each call to get_next which only happens in your tf.reduce_sum. The following presentation explains in detail how tf.data input pipeline construction and execution works.

Your gist is still using from_tensor_slices. Can you reproduce the issue by using from_tensors with differently sized inputs?

feature-engineer

comment created time in 2 months

issue commenttensorflow/tensorflow

dataset's map operation performance is O(N) with data shape for O(1) operations, can be orders of magnitude slower than using a dataset from generator.

Your benchmark is not measuring the speed of map execution accurately. from_tensor_slices performs a data copy so executing your input pipeline with inputs of different size will result in different amount of data being copied (which would on its own explain the O(1) vs O(n) difference).

A better way to test whether map is O(1) (which it should be for constant-time operations) vs. O(n) is to use tf.data.Dataset.from_tensors(X).repeat() varying the size of X in your setup.

Now as for the from_generator comparison, Python-based from_generator is expected to perform better than tf.data map (which incurs about 10us overhead per element to execute the user-defined computation through TensorFlow executor). The benefits of using map are realized when you are performing non-trivial computation that can benefit from either the graph optimizations TensorFlow performs and / or parallel execution enabled through the num_parallel_calls argument.

feature-engineer

comment created time in 2 months

issue commenttensorflow/tensorflow

Error occurred when finalizing GeneratorDataset iterator

@guptapriya @qlzh727 this seems to be an issue related to tf.distribute + tf.keras. In particular, as far as I can tell, the user code does not use tf.data.Dataset.from_generator but the error indicates that GeneratorDataset is used. Could you please triage? Thanks.

olk

comment created time in 2 months

IssuesEvent

issue commenttensorflow/tensorflow

Start background prefetching without calling next on dataset iterator

That's an interesting idea. prefetch is not the only asychronous transformation though (parallel map and parallel interleave are examples of others).

Having said that, I can imagine having a tf.data.Option that specify whether input pipeline should start background threads eagerly (similar to how you can specify whether input pipeline is allowed to execute in a non-deterministic fashion). This option would then trigger a graph rewrite at input pipeline instantiation time, which would set the execution mode for all asynchronous ops in the input pipeline.

In other words, this could work. I will re-open this issue and mark it as contributions welcome as the tf.data team will not have cycles to prioritize this in Q1.

Tilps

comment created time in 2 months

issue commenttensorflow/tensorflow

DatasetVariantWrapper "No unary variant device copy function found"

In general, it is possible the happen -- for instance if the user program makes incorrect use of device scopes.

@mwalmsley you can create a minimal reproducible example by starting with your user program and repeatedly removing complexity from it (e.g. replacing a complex model with a trivial one, replacing complex input pipeline with a trivial one) until there is nothing that can be removed without the issue going away. In the absence of knowing what the root cause is, this process is helpful in zeroing in on the issue.

mwalmsley

comment created time in 2 months

issue commenttensorflow/tensorflow

Start background prefetching without calling next on dataset iterator

I agree that it would be generally beneficial to be able to overlap the "warming up" of the input pipeline with other computation (such as model compilation).

However, it is non-trivial to provide this functionality given the current tf.data implementation. Each tf.data transformation has its own C++ iterator implementation with a "GetNext" method which upon its first invocation starts the background activity (such as prefetching). To add support for "start prefetching" we would need to add a new method to all C++ iterator implementations which starts the background activity (if the iterator is asynchronous) or recursively invokes this method on the input iterators (if the iterator is synchronous).

As an additional concern, if the input pipeline used custom tf.data transformations (e.g. from SIG I/O) they would also need to implement this method in order for the "start prefetch" functionality to work correctly across all input pipelines.

All in all, I think that the current implementation prevents the functionality to be implemented in an effective and robust manner and for that reason, the tf.data team does not plan to support such functionality.

PS: If you are writing your own custom training loop (and calling iter and next explicitly), then you should be able to use tf.group or tf.tuple + tf.py_function to wrap the model compilation to parallelize the model compilation and fetching of the first element.

Tilps

comment created time in 2 months

issue closedtensorflow/tensorflow

Start background prefetching without calling next on dataset iterator

<em>Please make sure that this is a feature request. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template</em>

System information

  • TensorFlow version (you are using): 2.0
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state. Currently if you have a dataset with a large shuffle buffer, it doesn't start trying to populate that buffer until the first request to retrieve data from the data set. This is a lost opportunity for parallelism during training script startup - the shuffle buffer could be filling in the background while the model is compiling.
Since calling next is blocking, its not possible to trigger this overlap to happen.

Will this change the current api? How? Maybe - could add a start_prefetching() method. The other idea I had was to make it implicit in calling iter - but that is possibly more useful for my specific use case than more generally.

Who will benefit with this feature? Anyone who cares about the startup time of their training script and uses a large shuffle buffer.

Any Other info.

closed time in 2 months

Tilps

issue commenttensorflow/tensorflow

Tensorflow 2.0 does not iterate through entire dataset when tf.keras.model.fit is called

@rachellim could you please take a look? thanks

michaelarfreed

comment created time in 2 months

issue commenttensorflow/tensorflow

Dataset from TFRecord has unknown shapes.

@aaudiber could you please take a look

zhangjh915

comment created time in 2 months

pull request commenttensorflow/tensorflow

add dtypes to range dataset op. fix #33414

@gbaned what is this blocked on?

ljwh

comment created time in 2 months

pull request commenttensorflow/tensorflow

add dtypes to range dataset op. fix #33414

@gbaned could you please unblock this PR? thank you

ljwh

comment created time in 2 months

issue commenttensorflow/tensorflow

TF dataset padded_batch: support for batching non-padable component

See https://github.com/tensorflow/community/pull/5/files?short_path=df3302a#diff-df3302ad45d4c94d919eefaefa059780.

twangnh

comment created time in 3 months

issue commenttensorflow/tensorflow

Dataset iterator hanging from within a dataset interleave

I can confirm that I can reproduce the issue. This seems to be a general TensorFlow runtime issue related to not properly cancelling TensorFlow computation which executes Python computation (your gen function) which in turns executes TensorFlow computation (the tf.data.TFRecordDataset call).

I do not foresee this being prioritized for a fix and would instead recommend not layering TensorFlow and Python computation this way. In this particular case, instead of doing from_generator, you could use tf.data.TFRecordDataset directly and follow it up with tf.data.Dataset.map that uses py_function, which performs the Python computation you previously did in from_generator.

andrewstanfordjason

comment created time in 3 months

issue commenttensorflow/tensorflow

TF Dataset - group_by_window without iterating the dataset first (input_window_size)

Assuming your ids are consecutive, you could do this with scan as follows:

from __future__ import print_function
import tensorflow.compat.v2 as tf

tf.enable_v2_behavior()

ds = tf.data.Dataset.from_tensor_slices(([0, 0, 0, 1, 1, 2, 2, 2, 2, -1], [1, 2, 3, 1, 2, 1, 2, 3, 4, 1]))

empty_batch = tf.constant([], tf.int32, shape=[0,])
initial_state = (-1, empty_batch)

def scan_func(old_state, input_element):
  current_id, accumulated_batch = old_state
  id, feature = input_element

  def _accumulate():
    new_accumulated_batch = tf.concat([accumulated_batch, [feature]], 0)
    new_state = (id, new_accumulated_batch)
    return new_state, (current_id, empty_batch)
  
  def _accumulate_and_emit():
    new_state = (id, tf.concat([empty_batch, [feature]], 0))
    return new_state, (current_id, accumulated_batch)
  
  return tf.cond(tf.math.logical_or(current_id == id, current_id == -1), _accumulate, _accumulate_and_emit)

ds = ds.apply(tf.data.experimental.scan(initial_state, scan_func))
ds = ds.filter(lambda id, batch: tf.shape(batch)[0] > 0)

for elem in ds:
  id, batch = elem
  print(id.numpy(), batch.numpy())

which produces:

0 [1 2 3]
1 [1 2]
2 [1 2 3 4]

Note that you need to add a "sentinel" value to the end of your dataset to make sure the last batch is emitted.

kpe

comment created time in 3 months

issue commenttensorflow/tensorflow

TF Dataset - group_by_window without iterating the dataset first

group_by_window does not iterate through the entire dataset before outputing something. It iterates through enough of the dataset for one of the windows to reach window_size. The following example illustrates that:

import itertools
import tensorflow.compat.v2 as tf

tf.enable_v2_behavior()

def gen():
  for i in itertools.count(1):
    print("generate: ", i)
    yield i

ds = tf.data.Dataset.from_generator(gen, (tf.int64), (tf.TensorShape([])))

def key_fn(x):
  return x % 2

def reduce_fn(key, ds):
  return ds.batch(batch_size=2)

ds = ds.apply(tf.data.experimental.group_by_window(key_fn, reduce_fn, window_size=2))

for value in ds.take(4):
  print("output: ", value.numpy())

produces the following output:

('generate: ', 1)
('generate: ', 2)
('generate: ', 3)
('output: ', array([1, 3]))
('generate: ', 4)
('output: ', array([2, 4]))
('generate: ', 5)
('generate: ', 6)
('generate: ', 7)
('output: ', array([5, 7]))
('generate: ', 8)
('output: ', array([6, 8]))
kpe

comment created time in 3 months

more