Ask questionsBuggy behaviour of dataset API
Describe the current behavior At Dataset graph branching points, the node, which is the root of the branching is resampled for each branch during one round of execution. With non-randomized inputs to the Dataset, this does not cause any problems. If the root node is after a .shuffle() call, the branches will receive different inputs in the same computation round.
Describe the expected behavior Downstream branches should receive the same data even if shuffle() is applied.
Standalone code to reproduce the issue https://colab.research.google.com/drive/1AeVRilpcGp8zb0hZijTIxGWdL9GkzfN_
This behaviour is also present if the dataset is created from a generator, which handles the shuffling implicitly.
Edit: fixed the links here as well
Answer questions aaudiber
This is working as intended. Datasets can be much larger than the memory of a single machine, so Dataset objects act like blueprints for how to produce data (instead of trying to hold the entire dataset at once). Datasets provide a streaming API for consuming data through an iterator. If you want each iterator created on a shuffled dataset to produce elements in the same order, use the
reshuffle_each_iteration argument to
index = index.shuffle(buffer_size=len(self.index), reshuffle_each_iteration=False)