Ask questionsUsing has big overhead

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 7.5
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.1.0
  • Python version: 3.7.4
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: GTX 1080

Describe the current behavior

Using a Dataset reduces performance by a small but significant amount, ~7% for ImageNet like data

Describe the expected behavior

Using Dataset has no or only marginal performance impact

Standalone code to reproduce the issue

import tensorflow as tf
from timeit import timeit

def train_step(x, y):
    model.train_on_batch(x, y)

for useData in (True, False):
    model = tf.keras.applications.ResNet50(weights=None, classes=1000)

    if useData:
        x = tf.random.uniform([1, 32, 224, 224, 3])
        y = tf.random.uniform([1, 32, 1], minval=0, maxval=999, dtype=tf.int64)
        dataset =, y)).repeat()

        def train(steps):
            for x, y in dataset.take(steps):
                train_step(x, y)
        x = tf.random.uniform([32, 224, 224, 3])
        y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)

        def train(steps):
            for _ in range(steps):
                train_step(x, y)

    # warmup
    t = timeit(lambda: train(50), number=10)
    print('useData: %s -> %s' % (useData, t))

Sample output: useData: True -> 89.92945478390902 useData: False -> 86.73652107780799

For more realistic training loops (e.g. including callbacks) the difference is even bigger. Some of my tests:

constant: total images/sec: 496.47 (calculation(497.53) + preprocessing(1.06)) 
dataset:  total images/sec: 465.09 (calculation(478.64) + preprocessing(13.55)) 

First number is calculated from training loop execution time (after warmup) the latter only the train-step and the difference (to the first number) which I called "preprocessing" as it is iterating over the dataset (calling next on the iterator by the for loop) and hence dominated by preprocessing functions if present (none here) including the repeat and take Dataset adapters.

So 2 conclusions: Getting elements from the iterator seems to be quite costly (1->13.6) and even the training loop itself gets slower (498 -> 479)

This would be a reason to avoid the dataset API.


Answer questions aaudiber

@Flamefire The example is slicing a 4D tensor into a 3D tensor (which requires copying the data every step), while the non-Dataset code starts with 3D tensors and therefore doesn't need to copy. To compare apples to apples here, you should define the Dataset data with

x = tf.random.uniform([32, 224, 224, 3])
y = tf.random.uniform([32, 1], minval=0, maxval=999, dtype=tf.int64)
dataset =, y)).repeat()

Related questions

ModuleNotFoundError: No module named 'tensorflow.contrib' hot 8
Error occurred when finalizing GeneratorDataset iterator hot 6
ModuleNotFoundError: No module named 'tensorflow.contrib'
When importing TensorFlow, error loading Hadoop
tf.keras.layers.Conv1DTranspose ?
tensorflow-gpu CUPTI errors hot 4
[TF 2.0] tf.keras.optimizers.Adam hot 4
Lossy conversion from float32 to uint8. Range [0, 1]. Convert image to uint8 prior to saving to suppress this warning. hot 4
TF2.0 AutoGraph issue hot 4
Tf.Keras metrics issue hot 4
module 'tensorflow' has no attribute 'ConfigProto' hot 4
TF 2.0 'Tensor' object has no attribute 'numpy' while using .numpy() although eager execution enabled by default hot 4
ModuleNotFoundError: No module named 'tensorflow.examples.tutorials' hot 4
AttributeError: module 'tensorflow.python.framework.op_def_registry' has no attribute 'register_op_list' hot 4
tensorflow2.0 detected 'xla_gpu' , but 'gpu' expected hot 3
Github User Rank List