profile
viewpoint

enthought/mayavi 772

3D visualization of scientific data in Python

enthought/traits 263

Optional type-checking, data dependencies, and event notifications for Python

enthought/chaco 234

Chaco is a Python package for building interactive and custom 2-D plots.

enthought/traitsui 226

TraitsUI: Traits-capable windowing framework

enthought/pyface 64

pyface: traits-capable windowing framework

enthought/enable 58

Enable: low-level drawing and interaction

enthought/envisage 45

Envisage is a Python-based framework for building applications whose functionalities can be extended by adding "plug-ins".

enthought/ets 32

ets.py is a utility to clone and manage various Enthought Tool Suite packages

issue commentscikit-learn/scikit-learn

Version 1.0 of scikit-learn

Not very happy. Actually, I'd rather have the breaking changes in a v .99.

Also, I'd like to be convinced that there is no way to avoid a smooth deprecation

⁣Sent from my phone. Please forgive typos and briefness.​

On Sep 18, 2020, 20:23, at 20:23, Nicolas Hug notifications@github.com wrote:

@GaelVaroquaux so would you be OK with introducing a few minor breaking changes in 1.0, provided that the next breaking changes would happen many years from now (if ever)?

I have the impression that during the last meeting, people understood my point of view as "let's make breaking changes often" but that's absolutely not what I'm advocating for.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/14386#issuecomment-695018042

MartinThoma

comment created time in 2 days

issue commentscikit-learn/scikit-learn

Version 1.0 of scikit-learn

I don't have a strong point of view here other than that we should limit breakage

⁣Sent from my phone. Please forgive typos and briefness.​

On Sep 18, 2020, 20:01, at 20:01, Nicolas Hug notifications@github.com wrote:

Thanks @ogrisel, at least I understand your POV because it's consistent. But I still need to understand @amueller and @GaelVaroquaux 's reasoning then ;)

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/14386#issuecomment-695007444

MartinThoma

comment created time in 2 days

pull request commentnilearn/nilearn

[RVW] Update Decoding Userguide (using Decoder)

It seems however that there is a legitimate code failure:

sphinx.errors.SphinxWarning: /home/circleci/project/examples/02_decoding/plot_haxby_glm_decoding.py failed to execute correctly: Traceback (most recent call last):
  File "/home/circleci/project/examples/02_decoding/plot_haxby_glm_decoding.py", line 106, in <module>
    conditions.append(condition_)
AttributeError: 'numpy.ndarray' object has no attribute 'append'

thomasbazeille

comment created time in 2 days

Pull request review commentnilearn/nilearn

[RVW] Update Decoding Userguide (using Decoder)

+"""+Advanced decoding using scikit learn+==========================================++This tutorial opens the box of decoding pipelines to bridge integrated+functionalities provided by the :class:`nilearn.decoding.Decoder` object+with more advanced usecases. It reproduces basic examples functionalities with+direct calls to scikit-learn function and gives pointers to more advanced+objects. If some concepts seem unclear, please refer to the :ref:`documentation+on decoding <decoding_intro>` and in particular to the+:ref:`advanced section <going_further>`. As in many other examples,+we perform decoding of the visual category of a stimuli on Haxby+2001 dataset, focusing on distinguishing two categories : face and cat images.++    * J.V. Haxby et al. "Distributed and Overlapping Representations of Faces+      and Objects in Ventral Temporal Cortex", Science vol 293 (2001), p+      2425.-2430.++.. contents:: **Contents**+    :local:+    :depth: 1++"""++###########################################################################+# Retrieve and load the fMRI data from the Haxby study+# ------------------------------------------------------+#+# First download the data+# ........................+#++# The :func:`nilearn.datasets.fetch_haxby` function will download the+# Haxby dataset composed of fmri images in a Niimg, a spatial mask and a text+# document with label of each image+from nilearn import datasets+haxby_dataset = datasets.fetch_haxby()+mask_filename = haxby_dataset.mask_vt[0]+fmri_filename = haxby_dataset.func[0]+# Loading the behavioral labels+import pandas as pd+behavioral = pd.read_csv(haxby_dataset.session_target[0], delimiter=' ')

Maybe we should use sphinx-gallery a bit more here, and display the "behavioral" dataframe

behavioral = pd.read_csv(haxby_dataset.session_target[0], delimiter=' ')
behavioral

###########################################################################
thomasbazeille

comment created time in 2 days

Pull request review commentnilearn/nilearn

[RVW] Update Decoding Userguide (using Decoder)

+.. _going_further:++==========================================================================+Running scikit-learn functions for more control on the analysis+==========================================================================++This section gives pointers to design your own decoding pipelines with+scikit-learn. This builds on the :ref:`didactic introduction to decoding <decoding_intro>`.++.. note::+  This documentation gives links and additional definitions needed to work+  correctly with scikit-learn. For a full code example, please check out: :ref:`sphx_glr_auto_examples_07_advanced_plot_advanced_decoding_scikit.py`+++.. contents:: **Contents**+    :local:+    :depth: 1+++Performing decoding with scikit-learn+=======================================++Using scikit-learn estimators+--------------------------------++You can easily import estimators from the `scikit-learn <http://scikit-learn.org>`+machine-learning library, those available in the `Decoder` object and many+others. They all have the `fit` and `predict` functions. For example you can+directly import the versatile `Support Vector Classifier <http://scikit-learn.org/stable/modules/svm.html>`_ (or SVC).++To learn more about the variety of classifiers available in scikit-learn,+see the `scikit-learn documentation on supervised learning+<http://scikit-learn.org/stable/supervised_learning.html>`_).+++Cross-validation with scikit-learn+-----------------------------------++To perform cross-validation using a scikit-learn estimator, you should first+mask the data using a :class:`nilearn.input_data.NiftiMasker`: to extract+only the voxels inside the mask of interest, and transform 4D input fMRI+data to 2D arrays (shape (n_timepoints, n_voxels)) that estimators can work on.

Maybe point to one of our examples here (in a topic box, to make it more visible)

thomasbazeille

comment created time in 2 days

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentnilearn/nilearn

Addition of cortical surface-based searchlight, with example Surfsearchlight dev

Hurray!! Thanks!!

kingjr

comment created time in 2 days

issue commentscikit-learn/scikit-learn

Feature Request: Include p-values attribute for logistic regression

Even if the CI is not calculated, a strong assumption (identifiability) is necessary in order to uniquely obtain the regression coefficient.

Assumptions differ whether they are about control on the coefficients or on the prediction. The control on the prediction is much more lax then the control on the coefficients.

Should sklearn not be able to do what statsmodels (sm) can do?

No. No need to duplicate functionality across packages.

bhishanpdl

comment created time in 7 days

pull request commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

I think it lost its sarcastic connotation but I can replace with just "Recommendation summary" or something like that?

It's probably a better title. But I like the fact that you wrote such a paragraph.

NicolasHug

comment created time in 8 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit` or `split` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit` and `split` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TL;DR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)++    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit` has randomness.++We can see from the snippet above that `rf.fit` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share+the same `RandomState` instance will influence each-other, as we will see+later when we discuss cloning.++If we had passed an int to the `random_state` parameter of the+:class:`~sklearn.ensemble.RandomForestClassifier`, we would have obtained the+same models, and thus the same scores each time. When we pass an int, the+same RNG is used across all calls to `fit`. What internally happens is that+even though the RNG is consumed when `fit` is called, it is always reset to+its original state at the beginning of `fit`.++.. note::+    Using `max_features=2, max_samples=10` is likely to be a poor choice in+    general for a random forest. We here only set these parameters for+    illustration purposes: otherwise the scores would not significantly+    differ. The variance of the fitted models can typically be reduced by+    increasing the number of randomly drawn samples and features, or by using+    more trees.++CV splitters+............++Randomized cv splitters have a similar behavior when a `RandomState`+instance is passed::++    >>> from sklearn.model_selection import KFold+    >>> import numpy as np++    >>> X = np.arange(10)+    >>> rng = np.random.RandomState(0)+    >>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 3 5 6 7] [1 2 4 8 9]+    [1 2 4 8 9] [0 3 5 6 7]+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 4 6 7 8] [1 2 3 5 9]+    [1 2 3 5 9] [0 4 6 7 8]++We can see that the splits are different from the second time `split` is+called. This may lead to unexpected results if you compare the performance of+multiple estimators by calling `split` many times: the estimators will not+be evaluated on the same folds, and the scores on each fold will not be+comparable. On average, if enough folds are used and with enough data, one+can however expect that that mean score allows to conclude whether one+estimator is better than another. In any case, passing an int makes score+comparisons across folds meaningful, and is usually safer.+++Common pitfalls and subtleties+------------------------------++While the rules that govern the `random_state` parameter are seemingly simple,+they do however have some subtle implications. In some cases, this can even+lead to wrong conclusions.++Estimators+..........++**Differences in cross-validation procedures**++Depending on what is passed as the `random_state` parameter, estimators may+behave very differently, especially in cross-validation procedures. Consider+the following snippet::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import cross_val_score+    >>> import numpy as np++    >>> X, y = make_classification(random_state=0)++    >>> rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))+    >>> cross_val_score(rf_inst, X, y)+    array([0.9 , 0.95, 0.95, 0.9 , 0.9 ])++    >>> rf_123 = RandomForestClassifier(random_state=123)+    >>> cross_val_score(rf_123, X, y)+    array([0.85, 0.95, 0.95, 0.9 , 0.9 ])++We see that the cross-validated scores of `rf_inst` and `rf_123` are+different, as should be expected since we didn't pass the same `random_state`+parameter. However, the difference between these scores is more subtle that+it may look, and **the cross-validation procedures that were performed by**+:func:`~sklearn.model_selection.cross_val_score` **significantly differ in+each case**:++- Since `rf_123` was passed an int, every call to `fit` uses the same RNG:+  the same (random) subset of features will be used across all folds to fit+  the random forest.+- Since `rf_inst` was passed a `RandomState` instance, each call to `fit`+  starts from a different RNG, and the randomly sampled subset of features+  will be different for each of the 5 folds of the CV procedure.++Here, neither procedure is inherently wrong, and one might prefer one over+the other depending on the task at hand. It is however important to

Here, I would loop back to the comment that I made above:

  1. to evaluate a training procedure, seeds shouldn't be fixed (because what is a good seed in a given situation may be a bad one in another situation)

  2. in a train-test split, to evaluate a trained model, it can be useful to fix the seed.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit` or `split` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit` and `split` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TL;DR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)++    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit` has randomness.++We can see from the snippet above that `rf.fit` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share+the same `RandomState` instance will influence each-other, as we will see+later when we discuss cloning.++If we had passed an int to the `random_state` parameter of the+:class:`~sklearn.ensemble.RandomForestClassifier`, we would have obtained the+same models, and thus the same scores each time. When we pass an int, the+same RNG is used across all calls to `fit`. What internally happens is that+even though the RNG is consumed when `fit` is called, it is always reset to+its original state at the beginning of `fit`.++.. note::+    Using `max_features=2, max_samples=10` is likely to be a poor choice in+    general for a random forest. We here only set these parameters for+    illustration purposes: otherwise the scores would not significantly+    differ. The variance of the fitted models can typically be reduced by+    increasing the number of randomly drawn samples and features, or by using+    more trees.++CV splitters+............++Randomized cv splitters have a similar behavior when a `RandomState`+instance is passed::++    >>> from sklearn.model_selection import KFold+    >>> import numpy as np++    >>> X = np.arange(10)+    >>> rng = np.random.RandomState(0)+    >>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 3 5 6 7] [1 2 4 8 9]+    [1 2 4 8 9] [0 3 5 6 7]+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 4 6 7 8] [1 2 3 5 9]+    [1 2 3 5 9] [0 4 6 7 8]++We can see that the splits are different from the second time `split` is+called. This may lead to unexpected results if you compare the performance of+multiple estimators by calling `split` many times: the estimators will not+be evaluated on the same folds, and the scores on each fold will not be+comparable. On average, if enough folds are used and with enough data, one+can however expect that that mean score allows to conclude whether one+estimator is better than another. In any case, passing an int makes score+comparisons across folds meaningful, and is usually safer.+++Common pitfalls and subtleties+------------------------------++While the rules that govern the `random_state` parameter are seemingly simple,+they do however have some subtle implications. In some cases, this can even+lead to wrong conclusions.++Estimators+..........++**Differences in cross-validation procedures**

I wonder whether the right way of explaining this is no to explain what is measured by a given evaluation procedure.

  • A cross-validation procedure is used to measure the risk associated to the procedure of 1) training a model on given data, 2) applying it to new data.

  • A train-test split (or train-test-validation split) will measure the risk associated to a given trained model.

(I realize that we could add the distinction above in the model selection section).

In the second case, it makes sense to use an integer seed, as you may want to control fully how the trained model is built. In the first case, it makes little sense.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv spliters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`

I think that "entropy" is the terminology classically used to discussion chains of numbers coming from pseudo random number generators. But if people find "variability" more intuitive it's fine. Note that there is a subtle difference: if you use a seeded RandomState object, its not variable, but it's entropic.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv spliters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain

I think that it's useful to discuss the difference between instances and execution chains, though probably not with these terms (which might not relate to the user's line of thoughts).

With these thoughts, I propose to keep the phrasing above (ie resolve the current comment), and add text / subsection titles elsewhere.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit` or `split` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit` and `split` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TL;DR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)++    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit` has randomness.++We can see from the snippet above that `rf.fit` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share+the same `RandomState` instance will influence each-other, as we will see+later when we discuss cloning.++If we had passed an int to the `random_state` parameter of the+:class:`~sklearn.ensemble.RandomForestClassifier`, we would have obtained the+same models, and thus the same scores each time. When we pass an int, the+same RNG is used across all calls to `fit`. What internally happens is that+even though the RNG is consumed when `fit` is called, it is always reset to+its original state at the beginning of `fit`.++.. note::+    Using `max_features=2, max_samples=10` is likely to be a poor choice in+    general for a random forest. We here only set these parameters for+    illustration purposes: otherwise the scores would not significantly+    differ. The variance of the fitted models can typically be reduced by+    increasing the number of randomly drawn samples and features, or by using+    more trees.++CV splitters+............++Randomized cv splitters have a similar behavior when a `RandomState`+instance is passed::++    >>> from sklearn.model_selection import KFold+    >>> import numpy as np++    >>> X = np.arange(10)+    >>> rng = np.random.RandomState(0)+    >>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 3 5 6 7] [1 2 4 8 9]+    [1 2 4 8 9] [0 3 5 6 7]+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 4 6 7 8] [1 2 3 5 9]+    [1 2 3 5 9] [0 4 6 7 8]++We can see that the splits are different from the second time `split` is+called. This may lead to unexpected results if you compare the performance of+multiple estimators by calling `split` many times: the estimators will not+be evaluated on the same folds, and the scores on each fold will not be+comparable. On average, if enough folds are used and with enough data, one+can however expect that that mean score allows to conclude whether one+estimator is better than another. In any case, passing an int makes score+comparisons across folds meaningful, and is usually safer.+++Common pitfalls and subtleties+------------------------------++While the rules that govern the `random_state` parameter are seemingly simple,+they do however have some subtle implications. In some cases, this can even+lead to wrong conclusions.++Estimators+..........++**Differences in cross-validation procedures**++Depending on what is passed as the `random_state` parameter, estimators may+behave very differently, especially in cross-validation procedures. Consider+the following snippet::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import cross_val_score+    >>> import numpy as np++    >>> X, y = make_classification(random_state=0)++    >>> rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))+    >>> cross_val_score(rf_inst, X, y)+    array([0.9 , 0.95, 0.95, 0.9 , 0.9 ])++    >>> rf_123 = RandomForestClassifier(random_state=123)+    >>> cross_val_score(rf_123, X, y)+    array([0.85, 0.95, 0.95, 0.9 , 0.9 ])++We see that the cross-validated scores of `rf_inst` and `rf_123` are+different, as should be expected since we didn't pass the same `random_state`+parameter. However, the difference between these scores is more subtle that+it may look, and **the cross-validation procedures that were performed by**+:func:`~sklearn.model_selection.cross_val_score` **significantly differ in+each case**:++- Since `rf_123` was passed an int, every call to `fit` uses the same RNG:+  the same (random) subset of features will be used across all folds to fit+  the random forest.+- Since `rf_inst` was passed a `RandomState` instance, each call to `fit`+  starts from a different RNG, and the randomly sampled subset of features+  will be different for each of the 5 folds of the CV procedure.++Here, neither procedure is inherently wrong, and one might prefer one over+the other depending on the task at hand. It is however important to+understand how these procedures differ.++.. note::+    Here, :func:`~sklearn.model_selection.cross_val_score` will use a+    non-randomized cv splitter (as is the default), so both estimators will+    be evaluated on the same splits. Also, whether we pass an int or an+    instance to :func:`~sklearn.datasets.make_classification` isn't relevant+    for our illustration purpose: what matters is what we pass to the+    :class:`~sklearn.ensemble.RandomForestClassifier` estimator.++**Cloning**++Another subtle side effect of passing `RandomState` instances is how+:func:`~sklearn.clone` will work::++    >>> from sklearn import clone+    >>> from sklearn.ensemble import RandomForestClassifier+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> a = RandomForestClassifier(random_state=rng)+    >>> b = clone(a)++Since a `RandomState` instance was passed to `a`, `a` and `b` are not clones+in the strict sense, but rather clones in the statistical sense: `a` and `b`+will still be different models, even after calling `fit(X, y)` on the same+data. Moreover, `a` and `b` will influence each-other since they share the+same internal RNG: calling `a.fit` will consume `b`'s RNG, and calling+`b.fit` will consume `a`'s RNG, since they are the same. This is true for+any estimators that share a `random_state` parameter; it is not specific to+clones.++If an int were passed, `a` and `b` would be exact clones and they would not+influence each other.++This is an important thing to remember because :func:`~sklearn.clone` is+called everywhere in scikit-learn tools: in particular, most meta-estimators+that accept non-fitted estimators will in fact call :func:`~sklearn.clone`+internally (:class:`~sklearn.model_selection.GridSearchCV`,+:class:`~sklearn.ensemble.StackingClassifier`,+:class:`~sklearn.calibration.CalibratedClassifierCV`, etc.).++CV splitters+............++When passed a `RandomState` instance, cv splitters yield different splits+each time `split` is called. This can lead to dramatic mistakes when

Not dramatic mistakes, I disagree. The expectation of the performance of the estimator is correct.

each time `split` is called. This can lead to overestimate the variance of the difference in performance between estimator.
NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit` or `split` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit` and `split` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TL;DR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)++    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit` has randomness.++We can see from the snippet above that `rf.fit` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share+the same `RandomState` instance will influence each-other, as we will see+later when we discuss cloning.++If we had passed an int to the `random_state` parameter of the+:class:`~sklearn.ensemble.RandomForestClassifier`, we would have obtained the+same models, and thus the same scores each time. When we pass an int, the+same RNG is used across all calls to `fit`. What internally happens is that+even though the RNG is consumed when `fit` is called, it is always reset to+its original state at the beginning of `fit`.++.. note::+    Using `max_features=2, max_samples=10` is likely to be a poor choice in+    general for a random forest. We here only set these parameters for+    illustration purposes: otherwise the scores would not significantly+    differ. The variance of the fitted models can typically be reduced by+    increasing the number of randomly drawn samples and features, or by using+    more trees.++CV splitters+............++Randomized cv splitters have a similar behavior when a `RandomState`+instance is passed::++    >>> from sklearn.model_selection import KFold+    >>> import numpy as np++    >>> X = np.arange(10)+    >>> rng = np.random.RandomState(0)+    >>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 3 5 6 7] [1 2 4 8 9]+    [1 2 4 8 9] [0 3 5 6 7]+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 4 6 7 8] [1 2 3 5 9]+    [1 2 3 5 9] [0 4 6 7 8]++We can see that the splits are different from the second time `split` is+called. This may lead to unexpected results if you compare the performance of+multiple estimators by calling `split` many times: the estimators will not+be evaluated on the same folds, and the scores on each fold will not be+comparable. On average, if enough folds are used and with enough data, one+can however expect that that mean score allows to conclude whether one+estimator is better than another. In any case, passing an int makes score+comparisons across folds meaningful, and is usually safer.+++Common pitfalls and subtleties+------------------------------++While the rules that govern the `random_state` parameter are seemingly simple,+they do however have some subtle implications. In some cases, this can even+lead to wrong conclusions.++Estimators+..........++**Differences in cross-validation procedures**++Depending on what is passed as the `random_state` parameter, estimators may+behave very differently, especially in cross-validation procedures. Consider+the following snippet::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import cross_val_score+    >>> import numpy as np++    >>> X, y = make_classification(random_state=0)++    >>> rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))+    >>> cross_val_score(rf_inst, X, y)+    array([0.9 , 0.95, 0.95, 0.9 , 0.9 ])++    >>> rf_123 = RandomForestClassifier(random_state=123)+    >>> cross_val_score(rf_123, X, y)+    array([0.85, 0.95, 0.95, 0.9 , 0.9 ])++We see that the cross-validated scores of `rf_inst` and `rf_123` are+different, as should be expected since we didn't pass the same `random_state`+parameter. However, the difference between these scores is more subtle that+it may look, and **the cross-validation procedures that were performed by**+:func:`~sklearn.model_selection.cross_val_score` **significantly differ in+each case**:++- Since `rf_123` was passed an int, every call to `fit` uses the same RNG:+  the same (random) subset of features will be used across all folds to fit+  the random forest.+- Since `rf_inst` was passed a `RandomState` instance, each call to `fit`+  starts from a different RNG, and the randomly sampled subset of features+  will be different for each of the 5 folds of the CV procedure.++Here, neither procedure is inherently wrong, and one might prefer one over+the other depending on the task at hand. It is however important to+understand how these procedures differ.++.. note::+    Here, :func:`~sklearn.model_selection.cross_val_score` will use a+    non-randomized cv splitter (as is the default), so both estimators will+    be evaluated on the same splits. Also, whether we pass an int or an+    instance to :func:`~sklearn.datasets.make_classification` isn't relevant+    for our illustration purpose: what matters is what we pass to the+    :class:`~sklearn.ensemble.RandomForestClassifier` estimator.++**Cloning**++Another subtle side effect of passing `RandomState` instances is how+:func:`~sklearn.clone` will work::++    >>> from sklearn import clone+    >>> from sklearn.ensemble import RandomForestClassifier+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> a = RandomForestClassifier(random_state=rng)+    >>> b = clone(a)++Since a `RandomState` instance was passed to `a`, `a` and `b` are not clones+in the strict sense, but rather clones in the statistical sense: `a` and `b`+will still be different models, even after calling `fit(X, y)` on the same+data. Moreover, `a` and `b` will influence each-other since they share the+same internal RNG: calling `a.fit` will consume `b`'s RNG, and calling+`b.fit` will consume `a`'s RNG, since they are the same. This is true for+any estimators that share a `random_state` parameter; it is not specific to+clones.++If an int were passed, `a` and `b` would be exact clones and they would not+influence each other.++This is an important thing to remember because :func:`~sklearn.clone` is+called everywhere in scikit-learn tools: in particular, most meta-estimators+that accept non-fitted estimators will in fact call :func:`~sklearn.clone`+internally (:class:`~sklearn.model_selection.GridSearchCV`,+:class:`~sklearn.ensemble.StackingClassifier`,+:class:`~sklearn.calibration.CalibratedClassifierCV`, etc.).++CV splitters+............++When passed a `RandomState` instance, cv splitters yield different splits+each time `split` is called. This can lead to dramatic mistakes when+comparing the performance of different estimators::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.ensemble import GradientBoostingClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import KFold+    >>> from sklearn.model_selection import cross_val_score+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> gbdt = GradientBoostingClassifier(random_state=rng)+    >>> cv = KFold(shuffle=True, random_state=rng)++    >>> for est in (rf, gbdt):+    ...     print(cross_val_score(est, X, y, cv=cv))+    [0.85 0.95 0.9  0.95 0.95]+    [0.85 0.7  0.95 0.8  0.85]++Directly comparing the performance of the random forest vs the gradient+boosting estimator on each fold would be a methodological mistake: **the+splits on which the estimators are evaluated are different**. Indeed,+:func:`~sklearn.model_selection.cross_val_score` will internally call+`cv.split` on the same :class:`~sklearn.model_selection.KFold` instance,+but the splits will be different each time. This is also true for any tool+that performs model selection via cross-validation, e.g.+:class:`~sklearn.model_selection.GridSearchCV` and+:class:`~sklearn.model_selection.RandomizedSearchCV`: scores are not+comparable fold-to-fold across different calls to `search.fit`, since+`cv.split` would have been called multiple times. Within a single call to+`search.fit`, however, fold-to-fold comparison is possible since the search+estimator only calls `cv.split` once.

This paragraph lacks nuance: the estimate of the performance of the two estimators are still unbiased, they just have an uncontrolled variance. The error would be to substract one list to the other.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit` or `split` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit` and `split` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TL;DR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)++    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit` has randomness.++We can see from the snippet above that `rf.fit` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share+the same `RandomState` instance will influence each-other, as we will see+later when we discuss cloning.++If we had passed an int to the `random_state` parameter of the+:class:`~sklearn.ensemble.RandomForestClassifier`, we would have obtained the+same models, and thus the same scores each time. When we pass an int, the+same RNG is used across all calls to `fit`. What internally happens is that+even though the RNG is consumed when `fit` is called, it is always reset to+its original state at the beginning of `fit`.++.. note::+    Using `max_features=2, max_samples=10` is likely to be a poor choice in+    general for a random forest. We here only set these parameters for+    illustration purposes: otherwise the scores would not significantly+    differ. The variance of the fitted models can typically be reduced by+    increasing the number of randomly drawn samples and features, or by using+    more trees.++CV splitters+............++Randomized cv splitters have a similar behavior when a `RandomState`+instance is passed::++    >>> from sklearn.model_selection import KFold+    >>> import numpy as np++    >>> X = np.arange(10)+    >>> rng = np.random.RandomState(0)+    >>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 3 5 6 7] [1 2 4 8 9]+    [1 2 4 8 9] [0 3 5 6 7]+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 4 6 7 8] [1 2 3 5 9]+    [1 2 3 5 9] [0 4 6 7 8]++We can see that the splits are different from the second time `split` is+called. This may lead to unexpected results if you compare the performance of+multiple estimators by calling `split` many times: the estimators will not+be evaluated on the same folds, and the scores on each fold will not be+comparable. On average, if enough folds are used and with enough data, one+can however expect that that mean score allows to conclude whether one+estimator is better than another. In any case, passing an int makes score+comparisons across folds meaningful, and is usually safer.+++Common pitfalls and subtleties+------------------------------++While the rules that govern the `random_state` parameter are seemingly simple,+they do however have some subtle implications. In some cases, this can even+lead to wrong conclusions.++Estimators+..........++**Differences in cross-validation procedures**++Depending on what is passed as the `random_state` parameter, estimators may+behave very differently, especially in cross-validation procedures. Consider+the following snippet::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import cross_val_score+    >>> import numpy as np++    >>> X, y = make_classification(random_state=0)++    >>> rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))+    >>> cross_val_score(rf_inst, X, y)+    array([0.9 , 0.95, 0.95, 0.9 , 0.9 ])++    >>> rf_123 = RandomForestClassifier(random_state=123)+    >>> cross_val_score(rf_123, X, y)+    array([0.85, 0.95, 0.95, 0.9 , 0.9 ])++We see that the cross-validated scores of `rf_inst` and `rf_123` are+different, as should be expected since we didn't pass the same `random_state`+parameter. However, the difference between these scores is more subtle that+it may look, and **the cross-validation procedures that were performed by**+:func:`~sklearn.model_selection.cross_val_score` **significantly differ in+each case**:++- Since `rf_123` was passed an int, every call to `fit` uses the same RNG:+  the same (random) subset of features will be used across all folds to fit+  the random forest.+- Since `rf_inst` was passed a `RandomState` instance, each call to `fit`+  starts from a different RNG, and the randomly sampled subset of features+  will be different for each of the 5 folds of the CV procedure.++Here, neither procedure is inherently wrong, and one might prefer one over+the other depending on the task at hand. It is however important to+understand how these procedures differ.++.. note::+    Here, :func:`~sklearn.model_selection.cross_val_score` will use a+    non-randomized cv splitter (as is the default), so both estimators will+    be evaluated on the same splits. Also, whether we pass an int or an+    instance to :func:`~sklearn.datasets.make_classification` isn't relevant+    for our illustration purpose: what matters is what we pass to the+    :class:`~sklearn.ensemble.RandomForestClassifier` estimator.++**Cloning**++Another subtle side effect of passing `RandomState` instances is how+:func:`~sklearn.clone` will work::++    >>> from sklearn import clone+    >>> from sklearn.ensemble import RandomForestClassifier+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> a = RandomForestClassifier(random_state=rng)+    >>> b = clone(a)++Since a `RandomState` instance was passed to `a`, `a` and `b` are not clones+in the strict sense, but rather clones in the statistical sense: `a` and `b`+will still be different models, even after calling `fit(X, y)` on the same+data. Moreover, `a` and `b` will influence each-other since they share the+same internal RNG: calling `a.fit` will consume `b`'s RNG, and calling+`b.fit` will consume `a`'s RNG, since they are the same. This is true for+any estimators that share a `random_state` parameter; it is not specific to+clones.++If an int were passed, `a` and `b` would be exact clones and they would not+influence each other.++This is an important thing to remember because :func:`~sklearn.clone` is+called everywhere in scikit-learn tools: in particular, most meta-estimators+that accept non-fitted estimators will in fact call :func:`~sklearn.clone`+internally (:class:`~sklearn.model_selection.GridSearchCV`,+:class:`~sklearn.ensemble.StackingClassifier`,+:class:`~sklearn.calibration.CalibratedClassifierCV`, etc.).++CV splitters+............++When passed a `RandomState` instance, cv splitters yield different splits+each time `split` is called. This can lead to dramatic mistakes when+comparing the performance of different estimators::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.ensemble import GradientBoostingClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import KFold+    >>> from sklearn.model_selection import cross_val_score+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> gbdt = GradientBoostingClassifier(random_state=rng)

Maybe the most visible way of illustrating the problem would be to use twice the same estimator, and a deterministic one (eg a naive bayes), to make the difference more obvious.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit` or `split` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit` and `split` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TL;DR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)++    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit` has randomness.++We can see from the snippet above that `rf.fit` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share+the same `RandomState` instance will influence each-other, as we will see+later when we discuss cloning.++If we had passed an int to the `random_state` parameter of the+:class:`~sklearn.ensemble.RandomForestClassifier`, we would have obtained the+same models, and thus the same scores each time. When we pass an int, the+same RNG is used across all calls to `fit`. What internally happens is that+even though the RNG is consumed when `fit` is called, it is always reset to+its original state at the beginning of `fit`.++.. note::+    Using `max_features=2, max_samples=10` is likely to be a poor choice in+    general for a random forest. We here only set these parameters for+    illustration purposes: otherwise the scores would not significantly+    differ. The variance of the fitted models can typically be reduced by+    increasing the number of randomly drawn samples and features, or by using+    more trees.++CV splitters+............++Randomized cv splitters have a similar behavior when a `RandomState`+instance is passed::++    >>> from sklearn.model_selection import KFold+    >>> import numpy as np++    >>> X = np.arange(10)+    >>> rng = np.random.RandomState(0)+    >>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 3 5 6 7] [1 2 4 8 9]+    [1 2 4 8 9] [0 3 5 6 7]+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 4 6 7 8] [1 2 3 5 9]+    [1 2 3 5 9] [0 4 6 7 8]++We can see that the splits are different from the second time `split` is+called. This may lead to unexpected results if you compare the performance of+multiple estimators by calling `split` many times: the estimators will not+be evaluated on the same folds, and the scores on each fold will not be+comparable. On average, if enough folds are used and with enough data, one+can however expect that that mean score allows to conclude whether one+estimator is better than another. In any case, passing an int makes score+comparisons across folds meaningful, and is usually safer.+++Common pitfalls and subtleties+------------------------------++While the rules that govern the `random_state` parameter are seemingly simple,+they do however have some subtle implications. In some cases, this can even+lead to wrong conclusions.++Estimators+..........++**Differences in cross-validation procedures**++Depending on what is passed as the `random_state` parameter, estimators may+behave very differently, especially in cross-validation procedures. Consider+the following snippet::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import cross_val_score+    >>> import numpy as np++    >>> X, y = make_classification(random_state=0)++    >>> rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))+    >>> cross_val_score(rf_inst, X, y)+    array([0.9 , 0.95, 0.95, 0.9 , 0.9 ])++    >>> rf_123 = RandomForestClassifier(random_state=123)+    >>> cross_val_score(rf_123, X, y)+    array([0.85, 0.95, 0.95, 0.9 , 0.9 ])++We see that the cross-validated scores of `rf_inst` and `rf_123` are+different, as should be expected since we didn't pass the same `random_state`+parameter. However, the difference between these scores is more subtle that+it may look, and **the cross-validation procedures that were performed by**+:func:`~sklearn.model_selection.cross_val_score` **significantly differ in+each case**:++- Since `rf_123` was passed an int, every call to `fit` uses the same RNG:+  the same (random) subset of features will be used across all folds to fit+  the random forest.+- Since `rf_inst` was passed a `RandomState` instance, each call to `fit`+  starts from a different RNG, and the randomly sampled subset of features+  will be different for each of the 5 folds of the CV procedure.++Here, neither procedure is inherently wrong, and one might prefer one over+the other depending on the task at hand. It is however important to+understand how these procedures differ.++.. note::+    Here, :func:`~sklearn.model_selection.cross_val_score` will use a+    non-randomized cv splitter (as is the default), so both estimators will+    be evaluated on the same splits. Also, whether we pass an int or an+    instance to :func:`~sklearn.datasets.make_classification` isn't relevant+    for our illustration purpose: what matters is what we pass to the+    :class:`~sklearn.ensemble.RandomForestClassifier` estimator.++**Cloning**++Another subtle side effect of passing `RandomState` instances is how+:func:`~sklearn.clone` will work::++    >>> from sklearn import clone+    >>> from sklearn.ensemble import RandomForestClassifier+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> a = RandomForestClassifier(random_state=rng)+    >>> b = clone(a)++Since a `RandomState` instance was passed to `a`, `a` and `b` are not clones+in the strict sense, but rather clones in the statistical sense: `a` and `b`+will still be different models, even after calling `fit(X, y)` on the same+data. Moreover, `a` and `b` will influence each-other since they share the+same internal RNG: calling `a.fit` will consume `b`'s RNG, and calling+`b.fit` will consume `a`'s RNG, since they are the same. This is true for+any estimators that share a `random_state` parameter; it is not specific to+clones.++If an int were passed, `a` and `b` would be exact clones and they would not+influence each other.++This is an important thing to remember because :func:`~sklearn.clone` is+called everywhere in scikit-learn tools: in particular, most meta-estimators+that accept non-fitted estimators will in fact call :func:`~sklearn.clone`+internally (:class:`~sklearn.model_selection.GridSearchCV`,+:class:`~sklearn.ensemble.StackingClassifier`,+:class:`~sklearn.calibration.CalibratedClassifierCV`, etc.).++CV splitters+............++When passed a `RandomState` instance, cv splitters yield different splits+each time `split` is called. This can lead to dramatic mistakes when+comparing the performance of different estimators::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.ensemble import GradientBoostingClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import KFold+    >>> from sklearn.model_selection import cross_val_score+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> gbdt = GradientBoostingClassifier(random_state=rng)+    >>> cv = KFold(shuffle=True, random_state=rng)++    >>> for est in (rf, gbdt):+    ...     print(cross_val_score(est, X, y, cv=cv))+    [0.85 0.95 0.9  0.95 0.95]+    [0.85 0.7  0.95 0.8  0.85]++Directly comparing the performance of the random forest vs the gradient+boosting estimator on each fold would be a methodological mistake: **the+splits on which the estimators are evaluated are different**. Indeed,+:func:`~sklearn.model_selection.cross_val_score` will internally call+`cv.split` on the same :class:`~sklearn.model_selection.KFold` instance,+but the splits will be different each time. This is also true for any tool+that performs model selection via cross-validation, e.g.+:class:`~sklearn.model_selection.GridSearchCV` and+:class:`~sklearn.model_selection.RandomizedSearchCV`: scores are not+comparable fold-to-fold across different calls to `search.fit`, since+`cv.split` would have been called multiple times. Within a single call to+`search.fit`, however, fold-to-fold comparison is possible since the search+estimator only calls `cv.split` once.++For comparable fold-to-fold results in all scenarios, one should pass an int+to the CV plitter: `KFold(shuffle=True, random_state=0)`.++.. note::+    What matters in this example is what was passed to+    :class:`~sklearn.model_selection.KFold`. Whether we pass a `RandomState`+    instance or an int to :func:`~sklearn.datasets.make_classification` or to+    the estimators is not relevant for our illustration purpose. It does+    however have an impact on the cross-validation procedure as explained+    above, but it isn't what makes the fold-to-fold comparison incorrect.++.. note::+    When evaluating scores via cross-validation, it is important to+    understand that the variance of scores between candidates may come from+    different sources: the cross-validation variance, and the estimation+    variance. The principled way to limit the cross-validation variance is to+    use more folds, at the expense of running time. The more data you have,+    the less this actually is a problem as variance will automatically be+    reduced. Limiting estimation variance is estimator-dependent, but here+    well, more data helps reducing the variance.

+1

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TLDR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(+    ...     X, y, random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit()` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(+    ...     X, y, random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit()` has randomness.++We can see from the snippet above that `rf.fit()` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit()` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit()`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share
objects become somewhat inter-dependent. For example, two estimators that share
NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TLDR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.

I disagree with this recommendation for estimators (for splitters, I can be convinced that it is a good one, and I could be convinced to change the default).

If you run cross-validation on a non convex model, to decide whether it is a good or a bad model that you will use to fit on more data. If your seed gives you a a good initialization, you will falsely conclude that the model is a good one, while the conclusion is conditional on the seed.

For a TLDR, I would recommend to keep the default (unless you know what you are doing), and potential seed the global RNG.

For the splitters, it could be different if you want to interpret the variability of the differences between two models (but only this). In which case, the simplest option is to set to an integer the random_state of the splitters. Given that this case is reasonably common, we could recommend setting the random_state to an integer, and even discuss the default.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit` or `split` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit` and `split` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TL;DR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)++    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit` has randomness.++We can see from the snippet above that `rf.fit` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit` is called, and this consumed (mutaded) RNG
estimator is consumed when `fit` is called, and this consumed (mutated) RNG
NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TLDR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(+    ...     X, y, random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit()` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(+    ...     X, y, random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit()` has randomness.++We can see from the snippet above that `rf.fit()` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit()` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit()`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share+the same `RandomState` instance will influence each-other, as we will see+later when we discuss cloning.
later when we discuss cloning. This point can be important to keep in mind when debugging.

From a statistical perspective, this dependence is seldom a problem. However, when debugging, it can be an issue, as it may make bugs hard to isolate.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit` or `split` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit` and `split` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TL;DR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,+    ...                                                     random_state=rng)++    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit` has randomness.++We can see from the snippet above that `rf.fit` has produced different+models, even if the data was the same. This is because the RNG of the
models, even if the data was the same. This is because the random number generator (RNG) of the
NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`
  yield different results each time they are called, and the succession of fits
  explore all sources of variability. `None` is the default value for all `random_state`

Here is a suggestion for a phrasing closer to users concerns.

NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TLDR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(+    ...     X, y, random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.88++We are now guaranteed that the result of this script will always be 0.88, no+matter how many times we run it. Changing the global `rng` variable to a+different value should affect the results, as expected.++We could have used a `RandomState` instance instead of an integer::++    >>> import numpy as np+    >>> rng = np.random.RandomState(0)++This too would give us reproducible results across executions. However, using+`RandomState` instances may have surprising and unwanted effects that will be+described later.++We do not recommend either setting the global `numpy` seed by calling+`np.random.seed(0)`. See `here+<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_+for a discussion.++Using None or `RandomState` instances+-------------------------------------++Estimators+..........++As described above, passing instances means that calling `fit()` multiple+times will not yield the same results, even if the estimator is fitted on the+same data, with the same hyper-parameters::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split+    >>> import numpy as np++    >>> rng = np.random.RandomState(0)+    >>> X, y = make_classification(random_state=rng)+    >>> rf = RandomForestClassifier(max_features=2, max_samples=10,+    ...                             random_state=rng)+    >>> X_train, X_test, y_train, y_test = train_test_split(+    ...     X, y, random_state=rng)+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.76+    >>> rf.fit(X_train, y_train).score(X_test, y_test)+    0.8++.. note::+    `score()` is not a random procedure. Only `fit()` has randomness.++We can see from the snippet above that `rf.fit()` has produced different+models, even if the data was the same. This is because the RNG of the+estimator is consumed when `fit()` is called, and this consumed (mutaded) RNG+will be used in the subsequent calls to `fit()`. In addition, the `rng`+object is shared across any object that uses it, and as a consequence, these+object become somewhat inter-dependent. For example, two estimator that share+the same `RandomState` instance will influence each-other, as we will see+later when we discuss cloning.++If we had passed an int to the `random_state` parameter of the+:class:`~sklearn.ensemble.RandomForestClassifier`, we would have obtained the+same models, and thus the same scores each time. When we pass an int, the+same RNG is used across all calls to `fit()`. What internally happens is that+even though the RNG is consumed when `fit()` is called, it is always reset to+its original state at the beginning of `fit()`.++.. note::+    Using `max_features=2, max_samples=10` is likely a poor choice in general+    for a random forest. We here only set these parameters for illustration+    purposes: otherwise the scores would not significantly differ. The+    variance of the fitted models can typically be reduced by increasing the+    number of randomly drawn samples and features, or by using more trees.++CV splitters+............++Randomized cv splitters have a similar behavior when a `RandomState`+instance is passed::++    >>> from sklearn.model_selection import KFold+    >>> import numpy as np+    >>> X = np.arange(10)+    >>> rng = np.random.RandomState(0)+    >>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 3 5 6 7] [1 2 4 8 9]+    [1 2 4 8 9] [0 3 5 6 7]+    >>> for train, test in cv.split(X):+    ...     print(train, test)+    [0 4 6 7 8] [1 2 3 5 9]+    [1 2 3 5 9] [0 4 6 7 8]++We can see that the splits are different from the second time `split()` is+called. This may lead to unexpected results if you compare the performance of+multiple estimators by calling `split()` many times: the estimators will not+be evaluated on the same folds, and the scores on each fold will not be+comparable. On average, if enough folds are used and with enough data, one

I agree with this formulation, and I think that it is important to stress it! :).

multiple estimators by calling `split()` many times: **the estimators will not
be evaluated on the same folds, and the scores on each fold will not be
comparable**. On average, if enough folds are used and with enough data, one
NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.+

We could add a subsection title here:


Reproducibility of fits or splits
-----------------------------------------
NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TLDR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::++    >>> from sklearn.ensemble import RandomForestClassifier+    >>> from sklearn.datasets import make_classification+    >>> from sklearn.model_selection import train_test_split++    >>> rng = 42  # or any other integer
    >>> rng = np.random.RandomState(42)  # or any other integer
NicolasHug

comment created time in 10 days

Pull request review commentscikit-learn/scikit-learn

[MRG] DOC Good practices and pitfalls for random_state

+.. Places parent toc into the sidebar++:parenttoc: True++.. _randomness:++Controlling randomness: good practices and common pitfalls+==========================================================++Some scikit-learn objects are inherently random. These are usually estimators+(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation+splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of+these objects is controlled via their `random_state` parameter, as described+in the :term:`Glossary <random_state>`. This section expands on the glossary+entry, and describes good practice and common pitfalls w.r.t. to this+subtle parameter.++The `random_state` parameter determines whether multiple calls to :term:`fit`+(for estimators) or to :term:`split` (for cv splitters) will produce the same+results, according to these rules:++- If an int is passed, calling `fit()` or `split()` multiple times yields the+  same results.+- If `None` or a `RandomState` instance is passed: `fit()` and `split()` will+  yield different results each time they are called, producing an event chain+  of maximal entropy. `None` is the default value for all `random_state`+  parameters.++  .. note::+      Since passing `random_state=None` is equivalent to passing the global+      `RandomState` instance from `numpy`+      (`random_state=np.random.mtrand._rand`), we will not explicitly mention+      `None` here, but everything that applies to instances also applies to+      using `None`.++.. warning:: TLDR++    Unless you know what you are doing, we recommend to use integers as the+    `random_state` parameter of estimators and cv splitters, as it is usually+    the safest option and it allows you to isolate different sources of+    variance in your analysis. Leaving the default (`None`) or using+    `RandomState` instances is sometimes useful to assess to statistical+    significance of your results, but can also have surprising effects.++Getting reproducible results across multiple executions+-------------------------------------------------------++In order to obtain reproducible (i.e. constant) results across multiple+program executions, the easiest way is to set a `rng` variable to an+integer value at the top of your program, and pass it down to any object that+accepts a `random_state` parameter::

This is most likely bad practice: you will be exploring only one random init, only one set of random projection, only when kernel expansion. If you're lucky, you'll falsely conclude that you have an awesome processing pipeline.

program executions, the easiest way is to create a dedicated random-number generator
at the top of your program, named `rng` for instance, and pass it down to any object
that accepts a `random_state` parameter::
NicolasHug

comment created time in 10 days

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentsphinx-gallery/sphinx-gallery

Release 0.8.1

Thanks for pushing forward on this @lucyleeow

+10000

lucyleeow

comment created time in 11 days

Pull request review commentscikit-learn/enhancement_proposals

SLEP011: Fixing randomness handling in estimators and splitters

+.. _slep_011:++===================================+SLEP011: Fixing randomness handling+===================================++:Author: Nicolas Hug+:Status: Under review+:Type: Standards Track+:Created: 2019-11-27++Abstract+========++This SLEP aims at fixing the issues related to how scikit-learn handles+randomness of estimators and CV splitters.++The proposed solution is to make estimators and splitter stateless, by+storing the state of the `random_state` parameter that is passed in+`__init__`.++More than anything, this SLEP's goal is to *inform* the discussions related+to randomness handling: if we end up going with the status quo (i.e. keep+estimators and splitters stateful), then at least we are all aware of the+price we're paying.++Background: How we currently handle randomness+==============================================++`random_state` parameters are used commonly in estimators, and in CV+splitters. They can be either int, RandomState instances, or None. The+parameter is stored as an attribute in init and never changes, as per our+convention.++`fit` and `split` methods typically look like this::++    def fit(self, X, y):  # or split(self, X, y)+        rng = check_random_state(self.random_state)+        ...+        rng.some_method()  # directly use instance, e.g. for sampling+        some_function_or_class(random_state=rng)  # pass it down to other tools+                                                  # e.g. if self is a meta-estimator++`check_random_state` behaves as follows::++    def check_random_state(x):+        if x is an int, return np.RandomState(seed=x)+        if x is a RandomState instance, return x+        if x is None, return numpy's RandomState singleton instance++Accepting RandomState instances or None comes with different complications,+listed below. `None` is the current default for all estimators / splitters.++Problems with passing RandomState instances or None+===================================================++The main issue with RandomState instances and None is that it makes the+estimators and the splitters stateful accross calls to `fit` or `split`. As+a result, different calls to `fit` or `split` on the same instance yield+different results. Almost all the other issues are consequences of this+fact.++Statefulness and violation of fit idempotence+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~++Estimators should be stateless. That is, any call to `fit` should forget+whatever happened to previous calls to `fit`, provided that no warm-start is+hapenning. Whether CV splitters should be stateful or not is debatable, and+that point is discussed below.++Another related convention is that the `fit` method should be idempotent:+calling `est.fit(X, y)` and then `est.fit(X, y)` again should yield the same+model both times.++These properties are key for enforcing reproducibility. We have a common+checks for them.++If a `RandomState` instance or None are used, the idemptency property may be+violated::++    rng = RandomState(0)+    est = Est(..., random_state=rng)+    est.fit(...)  # consume rng+    est.fit(...)++The model inferred the second time isn't the same as the previous one since+the rng has been consumed during the first called. The statefulness property+isn't respected either since the first call to `fit` has an impact on the+second.++The same goes with passing `None`: you get different models every time you call+`fit`.++A related issue is that the `rng` may be consumed outside of the estimator.+The estimator isn't "self contained" anymore and its behaviour is now+dependent on some stuff that happen outside.++Countless bugs+~~~~~~~~~~~~~~++Quoting @amueller from `14042+<https://github.com/scikit-learn/scikit-learn/issues/14042>`_, many bugs+have happened over the years because of RandomState instances and None.++These bugs are often hard to find, and some of them are actual data leaks,+see e.g. `14034+<https://github.com/scikit-learn/scikit-learn/issues/14034>`_. Some of these+bugs have been around forever and we just haven't discovered them yet.++The bug in `14034+<https://github.com/scikit-learn/scikit-learn/issues/14034>`_ is that the+validation subset for early-stopping was computed using `self.random_state`:+that subset is different accross multiple calls to `fit`, since the RNG is+getting consumed. This is a problem when doing warm-start, because we want+the subset to be the same for all calls to `fit` in this case. The fix was+to store a private seed that was generated the first time `fit` is called.++This is a typical example of many other similar bugs that we need to+monkey-patch with potentially overly complex logic.++Cloning may return a different estimator+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~++`my_clone = clone(est)` returns an *unfitted* estimator whose parameters are+(supposed to be) the same as those that were passed when `est` was+instanciated. Whether+*clone* is a good name for describing this process is another issue, but the+semantic of cloning is scikit-learn is as described above. We can think of+*cloning* as *reset with initial parameters*.++That semantic is not respected if `est` was instanciated with an instance or+with None::++    rng = RandomState(0)+    est = Est(..., random_state=rng)+    est.fit(X, y)  # consume RNG here+    my_clone = clone(est)+    my_clone.fit(X, y)  # not the same model!++`my_clone` isn't the same as what `est` was, since the RNG has been consumed+in-between. Fitting `my_clone` on the same data will not give the same model+as `est`. While this is not desirable when an instance is passed, one might+argue that this is the desired behaviour when None is passed.++In addition, `est` and `my_clone` are now interdependent because they share the+same `rng` instance.++Incompatibility with third-party estimators+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~++From the snippet above in the introduction, most meta-estimators will+directly pass down `self.random_state` to their sub-estimators. Some+third-party libraries only support integers as `random_state`, not instances+or None. See e.g. `15611+<https://github.com/scikit-learn/scikit-learn/issues/15611>`_++CV-Splitters statefulness+~~~~~~~~~~~~~~~~~~~~~~~~~++CV-splitters are stateful::++    rng = np.random.RandomState(0)+    cv = KFolds(shuffle=True, random_state=rng)+    a = cv.split(X, y)+    b = cv.split(X, y)  # different from a++`a` and `b` are different splits, because of how `split` is implemented (see+introduction above).++This behaviour is inconsistent for two reasons.++The first one is that if `rng` were an int, then `a` and `b` would have been+equal. As a result, the behaviour of the CV splitter depends on the+**type** of the `random_state` parameter::++- int -> stateless, get the same splits each time you call split()+- None or instance -> stateful, get different splits each time you call split()++Concretely, we have a method (`split`) whose behaviour depends on the *type*+of a parameter that was passed to `init`. We can argue that this is a common+pattern in object-oriented design, but in the case of the `random_state`+parameter, this is potentially confusing.++The second inconsistency is that splitters are stateful by design, while we+want our estimators to be stateless. Granted, splitters aren't estimators.+But, quoting `@GaelVaroquaux+<https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_,+consistency is one thing that we are really good at.+So it is important to have the splitters consistent with the estimators,+w.r.t. the statelessness property. The current behaviour is not necessarily+clear for users.++Note that fixing how random_state is handled in the splitters is one of the+entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_.++Potential bugs in custom parameter searches+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~++This issue is a direct consequence of the splitters being stateful. It's also+more advanced than the rest, you may want to skip it.++We have a private API for subclassing BaseSearchCV and implementing custom+parameter search strategies. The contract is that the custom class should+override the `_run_search(evaluate_candidate, ...)` method which itself must+call the `evaluate_candidates()` closure, were `cv.split()` will be called.++Third-party developers may only *call* `evaluate_candidates()`, not change+its content. Now, since `cv.split()` is called in `evaluate_candadates()`,+that means that `evalute_candidates()` will evaluate the candidate+parameters **on different splits** each time it is called.++This is a quite subtle issue that third-party developers might easily+overlook.++Depending on the intended behaviour of the parameter search, this may or may+not be a good thing. This is typically a bug if we implement successive+halving + warm start (details ommitted here, you may refer to `this issue+<https://github.com/scikit-learn/scikit-learn/issues/15125>`_ for some more+details).++Proposed Solution+=================++We need a solution that fixes the statefulness of the estimators and the+splitters. Most of the remaining issues would be fixed as a consequence.++A toy example of the proposed solution is implemented in this `notebook+<https://gist.github.com/NicolasHug/1169ee253a4669ff993c947507ae2cb5>`_.+The bulk of the solution is to manipulate actual random *states*, as+returned by `get_state()+<https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.get_state.html#numpy.random.get_state>`_.++Specifically, we would store the *state* of a RandomState instance in+`__init__`::++    def __init__(self, ..., random_state=None):+        self.random_state = check_random_state(random_state).get_state()++That `random_state` attribute is a tuple with about 620 integers.+That state is then used in `fit` or in `split` as follows::++    def fit(self, X, y):  # or split()+        rng = np.random.RandomState()+        rng.set_state(self.random_state)+        # ... use rng as before++Since `self.random_state` is an immutable tuple that never changes, calling+`fit` or `split` on the same instance always gives the same results.++We want `__init__` and `set_params/get_params` to be consistent. To that end,+we will need to special-case these methods::++    def get_params(self):++        random_state = np.random.RandomState()+        random_state.set_state(self.random_state)+        return {'random_state': random_sate, ...}++    def set_params(self, ...):++        self.random_state = check_random_state(random_state).get_state()  # same as in init++`clone` does not need to be special-cased, because `get_params` does all the+work. Note that the following::++    est.set_params(random_state=est.get_params()['random_state'])++behaves as expected and does not change the `random_state` attribute of the+estimator. However, one should not use::++    est.set_params(random_state=est.random_state)++since `est.random_state` is neither an int, None or an instance: it is a tuple.+We can error with a decent message in that case.++Advantages:++- It fixes the statefullness issue. `fit` is now idempotent. Calling `split` on+  the same instance gives the same splits. In other words, it does what we+  want.++- The behaviour is clear and intuitive: the object is fully defined at init,+  and only at init. Things that happen between init or fit *do not* influence+  the state of the object.++- It is relatively simple to implement, and not too intrusive.++- Backward compatibility is preserved between scikit-learn versions. Let A+  be a version with the current behaviour (say 0.22) and let B be a version+  where the new behaviour is implemented. The models and the splits obtained+  will be the same in A and in B. That property may not be respected with+  other solutions, see below.++- Both RandomState instances and None are still supported. We don't need to+  deprecate the use of any of them.++- As a bonus, the `self.random_state` attribute is an *actual* random state:+  it is the state of some RNG. What we currently call `random_state` is not+  a state but a RNG (though this is numpy's fault.)++Drawbacks:++- We break our convention that `__init__` should only ever store attributes, as+  they are passed in. Note however that the reason we have this convention+  is that we want the semantic of `__init__` and `set_params` are the same,+  and we want to enable people to change public attributes without having+  surprising behaviour. **This is still respected here.** So this isn't+  really an issue.++- There is a subtelty that occurs when passing `None`. `check_random_state`+  will return the singleton `np.random.mtrand._rand`, and we will call+  `get_state()` on the singleton. The thing is, its state won't change+  unless the singleton is consumed. So if we do+  `a = Est(random_state=None); b = Est(random_state=None)`, a and b actually+  have exactly the same `random_state` attribute, since the state of the+  singleton wasn't changed. To circumvent this, the logic in `__init__` and+  `set_params` involves a private helper that makes sure the singleton's RNG is+  consumed. Please refer to the notebook.++- The `__repr__()` will need to special-case the `random_state` attribute to+  avoid printing a long tuple.++- We need to store about 620 integers. This is however negligible w.r.t. e.g.+  the size of a typical dataset++- It does not fix the issue about third-party estimators only accepting+  integers. This can however be fixed in each meta-estimator, independently.++Alternative solutions+=====================++Store a seed instead of a state+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~++Instead of storing a state from `get_state()`, we could store a randomly+generated seed::++    def __init__(self, ..., random_state=None):+        self.random_state = check_random_state(random_state).randint(0, BIG_INT)++Then instead of using `set_state` we could just use+`rng = RandomState(seed=self.random_state)` in `fit` or `split`.++Advantages:++- It also fixes the third-party estimators issue, since we would be passing+  self.random_state which is an int+- It's cheaper than storing 620 ints+- We don't need to artificially consume the singleton's RNG since it is+  de-facto consumed anyway.++Drawbacks:++- Since we draw a seed in init (and in `set_params()`), `clone` will not+  work as expected. In particular with `my_clone = clone(est)`, my_clone and+  est cannot have the same `random_state` attribute. This is the same for+  `my_clone.set_params(random_state=est.get_params()['random_state'])`. The+  seed will have to be drawn in `set_params`, thus leading to a different+  `random_state` attribute.++- It is not backward compatible between versions. For example if you passed+  an int in version A (say 0.22), then in version B (with the new+  behaviour), your estimator will not start with the same RNG when `fit` is+  called the first time. Same for splitters.++Store the state in fit/split instead of in init+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~++Instead of storing the output of `get_state()` in `__init__`, we could store it+the first time `fit()` is called. For example::++    def fit(self):  # or split()+        self._random_state = getattr(self, '_random_state', check_random_state(self.random_state).get_state())+        rng = np.random.RandomState()+        rng.set_state(self._random_state)+        # ...++The advantage is that we respect our convention with `__init__`.++However, `fit` idempotency isn't respected anymore: the first call to `fit`+clearly influences all the other ones.++This also introduces a private attribute, so we would need more intrusive+changes to `set_params`, `get_params`, and `clone`.

behavior of an instance now depends on its past in a non-trivial way. This is not desirable.

I'm genuinely confused that you think this is such an issue when it completely applies to the current design

The current design gives behaviors that are statistically identical, in particular with the same level of entropy, which is what matters most. A user does not need to know anything aside from an objects' attribute to know how it behave from a statistical side.

With the above design, where random state gets fixed during fit, when a user is handed a model, he or she needs to know whether it has been fitted or not to know how it will behave.

For instance, if I am going to use cross_val_score on a RandomForest, if the RandomForest has not been been fitted, cross_val_score will create different clones and in the different loop iterations, the RandomForest will then sample different splits. If the RandomForest has been fitted, in the different loop iterations the RandomForest has the same internal state, and there is less entropy between the folds.

Now, technically, what I just said is not true with the above pseudo-code, because _random_state is a private attribute that is not copied during clone and cross_val_score uses clone correctly. However, I believe the proposition is to modify clone so that this is no longer the case, thus breaking cross_val_score.

Such a design could lead to a genuinely dangerous situation: the code will behave statistically differently (different amount of entropy) whether the user is given an object that has never been fit or an object that has already been fit once. A non advanced user will fall in this trap. Even in advanced user will struggle to address the problem cleanly, as he or she will have to inspect the private attributes of the object for this purpose.

With regards to design not copying the private attribute _random_state in the clone, I do not think that it runs in the problem, so it may be a suitable one, though we should give it a good though, of course!

NicolasHug

comment created time in 13 days

PullRequestReviewEvent

pull request commentscikit-learn/scikit-learn

[WIP] Online implementation of non-negative matrix factorization

I would set r=0.7. It's theoretically motivated, and I seem to remember that it gave us good results.

cmarmo

comment created time in 13 days

pull request commentscikit-learn/scikit-learn

[WIP] Online implementation of non-negative matrix factorization

That's a super nice plot, @cmarmo !! It really validates the implementation.

It suggests to take batch sizes of 2000 and forgetting factors of .5. Is there any theory on which values are acceptable for forgetting factor? When an algorithm converge in theory for a parameter taken in an interval it is often useful to take a value for this parameter in the middle.

I'm a bit worried that for convergence we need to take huge values of the batch size (typical batch sizes are in the hundreds, not thousands). It seems to point to a variance reduction not happening right. Indeed, large batch sizes reduce the noise in the estimates per batch.

cmarmo

comment created time in 13 days

push eventdirty-cat/dirty_cat

Amanda Dsouza

commit sha b094fc4926fd4001fe0ef18fb4e7d9b7ac2f3a51

replace sklearn assert function with pytest and changes.rst

view details

Amanda Dsouza

commit sha 1ae36838a138248ab17614ff0e43af09bb3fed6d

release version

view details

Gael Varoquaux

commit sha 6898d7d94cf67f18e566dd32c0117ebfe0575ae2

Merge pull request #135 from amy12xx/minor_fixes replace sklearn assert function with pytest and changes.rst

view details

push time in 14 days

PR merged dirty-cat/dirty_cat

replace sklearn assert function with pytest and changes.rst

Changes:

  • Replaced sklearn's assert_raise_message() with pytest raises in target and similarity encoder.
  • update to changes.rst for minhash handle_missing functionality.
+23 -22

0 comment

3 changed files

amy12xx

pr closed time in 14 days

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentdirty-cat/dirty_cat

replace sklearn assert function with pytest and changes.rst

+Release 0.0.8+=============++* **MinHashEncoder**: Added a "handle_missing" attribute to allow encoding+  with missing values.+

I think that I released 0.0.7 after you added this, hence it should be added to the 0.0.7.

amy12xx

comment created time in 15 days

PullRequestReviewEvent

created tagdirty-cat/dirty_cat

tag0.0.7

Encoding methods for dirty categorical variables

created time in 15 days

issue commentnilearn/nilearn

Add header optional argument to unmask function

At a quick glance, it seems right. What didn't work?

eurunuela

comment created time in 17 days

issue commentnilearn/nilearn

Add header optional argument to unmask function

You should be using the niftimasker object for this purpose. It will make sure that information is not lost.

⁣Sent from my phone. Please forgive typos and briefness.​

On Sep 2, 2020, 16:18, at 16:18, "Eneko Uruñuela" notifications@github.com wrote:

<!-- Hi! If you have: -1 Questions about how to use Nilearn or -2 Need analysis suggestions & recommendations?

A bunch of fMRI researchers hang out at Neurostars (http://neurostars.org/). Post those questions there. Add the tag nilearn (we get an email from Neurostars if you do).

Posting them here makes life more complicated for the Nilearn developers.
-->

<!-- For the Feature Request, Include the following:

What would you like changed/added and why? What would be the benefit? Does the change make something easier to use? Clarifies something? If it is a new feature, what is the benefit? -->

I've been going crazy with the unmask function in nilearn lately. I can successfully mask my data from 4D into 2D using the apply_mask function. Now, I'd like to move my results back to the 4D space with the unmask function, but whatever I generate does not have the same header or the same TR as the input data. This is a problem for visualizing the results with AFNI.

I believe adding an optional header argument to the unmask function could potentially make saving unmasked data much easier and straightforward.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/nilearn/nilearn/issues/2562

eurunuela

comment created time in 18 days

issue commentscikit-learn/scikit-learn

Feature Request: Include p-values attribute for logistic regression

Right. I think we're asking for p-values for regression models, which have ample theory supporting p-values.

Only when the model is well-conditioned, p << n, and it is estimated via maximum likelihood (OLS).

The point being: we can give p-values only in very narrow settings. In these settings, statsmodels works very well. Why not use it? I use statsmodels to answer this need.

bhishanpdl

comment created time in 20 days

issue commentnilearn/nilearn

BUG: nilearn.plotting forces matplotlib backend change to Agg on macOS

Beyond that, I'm still not entirely clear on why there is even a check for darwin as the platform. Shouldn't it rather be a check that the platform is Linux but there is no display, so just sys.platform == 'linux' and 'DISPLAY' not in environ?

There are other unices than macosx and linux. Maybe these are no much used these days. I used to encounter them in high performance and academic environments.

larsoner

comment created time in 20 days

issue commentscikit-learn/scikit-learn

Version 1.0 of scikit-learn

Breaking changes that might be interesting in the future

• make pipeline clone • allow .fit.transform != fit_transform (not technically an incompatible change)

I'm +1 with both of these :)

MartinThoma

comment created time in 20 days

issue commentscikit-learn/scikit-learn

Feature Request: Include p-values attribute for logistic regression

A bit of both in terms of misuse versus consequences.

With regards to an experience users not finding my comments compelling, I would like to stress that I, as an expert on statistical control of high dimensional models, I most often do not know how to give controlled p-values, and I claim that solutions that exist work in very narrow settings. Basically: there is no good answer.

Libraries that give p-values tend to raise warning or errors in these settings. Also, they implement only models for which there is a theory for p-values.

bhishanpdl

comment created time in 20 days

issue commentscikit-learn/scikit-learn

Feature Request: Include p-values attribute for logistic regression

Right, but p values are something that are expected to be well checked (they are a guard in medical research). Implementing them is asking for trouble.

bhishanpdl

comment created time in 20 days

issue commentscikit-learn/scikit-learn

Feature Request: Include p-values attribute for logistic regression

Because in general computing correct p-values requires assumptions on the data that are not met by the data used in machine learning (no multicollinearity, enough data compared to the dimensionality...).

A lot of the literature in model interpretation is uncontroled and unstable.

bhishanpdl

comment created time in 20 days

pull request commentscikit-learn/scikit-learn

[WIP] Online implementation of non-negative matrix factorization

@cmarmo : looking at the graphs, I believe that I understand that the online code goes much much faster, but optimizes less well. This is expected: the online optimizer is good for the "expected loss", which is out of sample, and not the empirical loss, which is on the samples seen.

It would be interesting to add to the benchmarks an out-of-sample loss: on new data, computing the loadings for a given set of components. This is a measure of the expected loss. I haven't looked at the code, so I do not know how easy it is to optimize.

Finally, in terms of graph, I think that I would suggest plotting the loss as a function of time, and displaying the sample size using circles of different sizes (plt.plot for the lines and plt.scatter for the circles).

cmarmo

comment created time in 20 days

issue commentnilearn/nilearn

BUG: nilearn.plotting forces matplotlib backend change to Agg on macOS

The motivation for doing this was that non technical users ssh into a remote  computer, run calculations, and then things crash after two hours of computation because there is no X server available.

We tried to put some guards so that the switch happens only in a case where it's likely to be useful. Maybe those guards aren't fine enough. I believe that they are in plotting/init.py

⁣Sent from my phone. Please forgive typos and briefness.​

On Aug 27, 2020, 23:55, at 23:55, jeromedockes notifications@github.com wrote:

My 2c is that in general libraries should not set the matplotlib backend at all but respect whatever the given user's configuration is (even if it's just "use the matplotlib default")

I agree. (if CI was the motivation for doing this we can set the MPLBACKEND env variable)

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/nilearn/nilearn/issues/2560#issuecomment-682210850

larsoner

comment created time in 24 days

issue commentINRIA/scikit-learn-mooc

Consistent British or American spelling

I understand the default on the internet seems to be American spelling

American!

(even though it is wrong 🙄 )

I don't know how to write the kiwi accent.

lucyleeow

comment created time in a month

push eventGaelVaroquaux/gaelvaroquaux.github.io

Gael Varoquaux

commit sha 1445822bf9e4c91d79a90b906fe5b8cb399ea243

Make install

view details

push time in a month

push eventdirty-cat/dirty_cat

Amanda Dsouza

commit sha af76377b573b8cc6efe08baeff3b427e523e9f39

handle missing values

view details

Amanda Dsouza

commit sha eb2510cfc0091003ba9fba457fbef318e6872fe9

docstring fix

view details

Amanda Dsouza

commit sha 6c2b6b8f465854a10dac897027c01282c5470247

minhash handle missing

view details

Amanda Dsouza

commit sha eb9100c46e88529d0f8fce11feb4e6b90cb99331

code review fixes

view details

Gael Varoquaux

commit sha dfab87a2157bffea7224c275e1ba22d7c15b946d

Merge branch 'master' into minhash_handle_missing

view details

Gael Varoquaux

commit sha 05e31e26e12980ea2f7c4fbbf0cb1f4d9b7a9ff9

Merge pull request #132 from amy12xx/minhash_handle_missing [MRG] minhash handle missing values

view details

push time in a month

PR merged dirty-cat/dirty_cat

[MRG] minhash handle missing values

References Issue #124 Add support for missing values in the encoders

Changes: Added missing values support for minhash encoder

+122 -42

2 comments

2 changed files

amy12xx

pr closed time in a month

pull request commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

This is great! Merging! Thank you!!

amy12xx

comment created time in a month

push eventdirty-cat/dirty_cat

twsthomas

commit sha 0996bbafbc028275f1e5010f57527a556390d883

minor fix ref

view details

Gael Varoquaux

commit sha 466d347276c34c6bb09dbb612159018b78a7be30

Merge pull request #134 from TwsThomas/fix_ref5 fix reference in examples 05

view details

push time in a month

PR merged dirty-cat/dirty_cat

fix reference in examples 05

Fix inner :ref: in examples 05

+2 -2

0 comment

1 changed file

TwsThomas

pr closed time in a month

push eventamy12xx/dirty_cat

twsthomas

commit sha c176f492d980e1142b17487d118f54d8925c9756

init

view details

Patricio Cerda

commit sha 6b8662002b960d3f05d3b9b862f6be0fb3b8de5b

Update Journal information for publication "Encoding high-cardinality string categorical variables"

view details

twsthomas

commit sha 7724de5137a408578373a523769e7541ba7aa534

update docstring

view details

twsthomas

commit sha bdaa1ab5a28384c176acb341cfb84ad3dbba1807

minor

view details

twsthomas

commit sha 8de9993681ed3095cfe9f8fba360ee8ff2080564

rm deaultf pararms + set default to minhash

view details

twsthomas

commit sha 5a953532d8767897dc3bf2f003b93f2507b0ce3d

add ref

view details

Thomas Schmitt

commit sha c8cb91ca078392992bda83756167f83072b296cc

Update dirty_cat/minhash_encoder.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

view details

Patricio Cerda

commit sha 9f87c01613f3b0cebc49ef6aad500881190bd856

Update README.rst

view details

twsthomas

commit sha b4e34e066c6adddaeeead5eb342635f0cada082e

wip, try fixing ref

view details

twsthomas

commit sha 3b8bf2013e6c8101344cbda76bf922514da98715

Merge branch 'master' of https://github.com/dirty-cat/dirty_cat into raise_ncomp

view details

twsthomas

commit sha f79048ede2a62b0d7e91fd5a0f0b374f58655194

Merge branch 'raise_ncomp' of https://github.com/TwsThomas/dirty_cat into raise_ncomp

view details

Gael Varoquaux

commit sha da9a1456c182e674acad7250bf4c3e6c0dbdaad1

Merge pull request #133 from TwsThomas/raise_ncomp Fix: Minhash encoder does not perform well on the dirty-cat examples

view details

Gael Varoquaux

commit sha dfab87a2157bffea7224c275e1ba22d7c15b946d

Merge branch 'master' into minhash_handle_missing

view details

push time in a month

push eventdirty-cat/dirty_cat

twsthomas

commit sha c176f492d980e1142b17487d118f54d8925c9756

init

view details

twsthomas

commit sha 7724de5137a408578373a523769e7541ba7aa534

update docstring

view details

twsthomas

commit sha bdaa1ab5a28384c176acb341cfb84ad3dbba1807

minor

view details

twsthomas

commit sha 8de9993681ed3095cfe9f8fba360ee8ff2080564

rm deaultf pararms + set default to minhash

view details

twsthomas

commit sha 5a953532d8767897dc3bf2f003b93f2507b0ce3d

add ref

view details

Thomas Schmitt

commit sha c8cb91ca078392992bda83756167f83072b296cc

Update dirty_cat/minhash_encoder.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

view details

twsthomas

commit sha b4e34e066c6adddaeeead5eb342635f0cada082e

wip, try fixing ref

view details

twsthomas

commit sha 3b8bf2013e6c8101344cbda76bf922514da98715

Merge branch 'master' of https://github.com/dirty-cat/dirty_cat into raise_ncomp

view details

twsthomas

commit sha f79048ede2a62b0d7e91fd5a0f0b374f58655194

Merge branch 'raise_ncomp' of https://github.com/TwsThomas/dirty_cat into raise_ncomp

view details

Gael Varoquaux

commit sha da9a1456c182e674acad7250bf4c3e6c0dbdaad1

Merge pull request #133 from TwsThomas/raise_ncomp Fix: Minhash encoder does not perform well on the dirty-cat examples

view details

push time in a month

PR merged dirty-cat/dirty_cat

Fix: Minhash encoder does not perform well on the dirty-cat examples

Fix #127

In example 02, puting n_components = 300 gives results as good as one_hot_encoder (.85) but it take more time. I put n_components = 100 to have decent results.

+17 -14

2 comments

3 changed files

TwsThomas

pr closed time in a month

issue closeddirty-cat/dirty_cat

Minhash encoder does not perform well on the dirty-cat examples

To solve this, the parameter n_components of the MinHashEncoder should be set to 30 at least.

closed time in a month

pcerda

pull request commentdirty-cat/dirty_cat

Fix: Minhash encoder does not perform well on the dirty-cat examples

Great! Merging. Thanks!!

TwsThomas

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

Fix: Minhash encoder does not perform well on the dirty-cat examples

 def make_pipeline(encoding_method): # Plotting the results # -------------------- # Finally, we plot the scores on a boxplot:-# We notice that the MinHashEncoder performs poorly compared to other encoding-# methods. There are two reasons for that: the MinHashEncoder performs better-# with tree-based models than linear models (see example 03), and the-# low-dimensionality of encodings (increasing n_components improves-# performances.+# We notice that the MinHashEncoder does not performs as well compared to +# other encoding methods.+# There are two reasons for that: the MinHashEncoder performs better+# with tree-based models than linear models (+# :ref:`see example 03<sphx_glr_auto_examples_03_fit_predict_plot_midwest_survey.html#sphx-glr-auto-examples-03-fit-predict-plot-midwest-survey-py>`)

The pointer to the ref is actually not the name of the file. You need to look at the source code of the generated file to figure it out. I believe that it is sphx_glr_auto_examples_03_fit_predict_plot_midwest_survey.py

TwsThomas

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

Fix: Minhash encoder does not perform well on the dirty-cat examples

 class MinHashEncoder(BaseEstimator, TransformerMixin):     Parameters     ----------     n_components : int
    n_components : int, default=30
TwsThomas

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

Fix: Minhash encoder does not perform well on the dirty-cat examples

 encoder_dict = {     'one-hot': OneHotEncoder(handle_unknown='ignore', sparse=False),     'similarity': SimilarityEncoder(similarity='ngram'),-    'minhash': MinHashEncoder(n_components=10, ngram_range=(2, 4),+    'minhash': MinHashEncoder(n_components=30, ngram_range=(2, 4),

Same comment on the parameters that take their default values.

Also, in the light of this PR, I suggest to put a default value for n_components, and set it at 30.

TwsThomas

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

Fix: Minhash encoder does not perform well on the dirty-cat examples

 def make_pipeline(encoding_method): # Plotting the results # -------------------- # Finally, we plot the scores on a boxplot:-# We notice that the MinHashEncoder performs poorly compared to other encoding-# methods. There are two reasons for that: the MinHashEncoder performs better-# with tree-based models than linear models (see example 03), and the-# low-dimensionality of encodings (increasing n_components improves-# performances.+# We notice that the MinHashEncoder does not performs as well compared to +# other encoding methods.+# There are two reasons for that: the MinHashEncoder performs better+# with tree-based models than linear models (see example 03), and also

We could use a :ref: syntax in to link to example 03.

TwsThomas

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

Fix: Minhash encoder does not perform well on the dirty-cat examples

     'one-hot': OneHotEncoder(handle_unknown='ignore', sparse=False),     'similarity': SimilarityEncoder(similarity='ngram'),     'target': TargetEncoder(handle_unknown='ignore'),-    'minhash': MinHashEncoder(n_components=10, ngram_range=(2, 4),+    'minhash': MinHashEncoder(n_components=100, ngram_range=(2, 4),
    'minhash': MinHashEncoder(n_components=100,

If I am not wrong, this is the default parameter, so we can remove it (as well as the ones after), with the goal to make the example simpler.

TwsThomas

comment created time in a month

pull request commentdirty-cat/dirty_cat

Fix: Minhash encoder does not perform well on the dirty-cat examples

Sounds good!!

Maybe we should put a discussion somewhere that the number of components is a tradeoff: numbers around 300 tend to lead to good prediction performance, but with more computational cost.

This could go at least in the docstring for the corresponding parameter, but maybe also in the example where we use a large number of components.

TwsThomas

comment created time in a month

pull request commentscikit-learn/scikit-learn

[MRG+3] FEA Add PolynomialCountSketch to Kernel Approximation module

Woohoo! Great! Exciting!

lopeLH

comment created time in a month

issue commentscikit-learn/scikit-learn

Discuss Boostrap 632

Bootstrap 632 fails completely on models with massive overfit (as 1 nearest neighbor). We would need bootstrap 632+, but this one is computationally intractable.

See the following post for an explanation of the failure of 632: https://stats.stackexchange.com/questions/96739/what-is-the-632-rule-in-bootstrapping

amueller

comment created time in a month

issue commentsphinx-gallery/sphinx-gallery

Reference :ref:`sphx_glr_auto_examples` goes to SciKit-Learn examples

revealed that I was indeed using the wrong reference.

That a strange manifestation of such an error.

riedgar-ms

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 def test_get_unique_ngrams():     enc = MinHashEncoder(n_components=2)     ngrams = enc.get_unique_ngrams(string, ngram_range)     assert ngrams == true_ngrams-        ++ def profile_encoder(Encoder, hashing='fast', minmax_hash=False):     # not an unit test-    +     from dirty_cat import datasets     import pandas as pd     employee_salaries = datasets.fetch_employee_salaries()-    data = pd.read_csv(employee_salaries['path'])-    X = data['Employee Position Title'].tolist()+    data = employee_salaries['data']+    X = data['employee_position_title'].tolist()     X = X * 10     t0 = time.time()-    encoder = Encoder(n_components=50, hashing=hashing, minmax_hash=minmax_hash)+    encoder = Encoder(n_components=50, hashing=hashing,+                      minmax_hash=minmax_hash)     encoder.fit(X)     y = encoder.transform(X)     assert y.shape == (len(X), 50)     eta = time.time() - t0     return eta  +@pytest.mark.parametrize("input_type", ['list', 'numpy', 'pandas'])+@pytest.mark.parametrize("missing", ['aaa', 'error', ''])+@pytest.mark.parametrize("hashing", ["fast", "murmur"])

Sorry to nitpick, but the tensor-product of options here leads to a expensive combinatorial: this test is taking more time than it should. Could would downsample here? Some think like:

@pytest.mark.parametrize("input_type,missing,hashing", [
                                               ['list', 'aaa', 'fast'],
                                               ...
                                           ])

for a few well-chosen values

amy12xx

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 import numpy as np import time+import pytest+from sklearn.utils._testing import assert_raise_message

I worry that we are importing from a private submodule of sklearn, and this may break without warning.

I think that I would prefer that we use the pattern listed here: https://docs.pytest.org/en/stable/assert.html#assertions-about-expected-exceptions

amy12xx

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 def transform(self, X):             X = X[:, 0]         # Check if first item has str or np.str_ type         assert isinstance(X[0], str), "ERROR: Input data is not string."+        if self.minmax_hash:+            assert self.n_components % 2 == 0,\+                    "n_components should be even when minmax_hash=True"+        if self.hashing == 'murmur':+            assert not(self.minmax_hash),\+                   "minmax_hash not implemented with murmur"+        if self.handle_missing not in ['error', '']:+            template = ("handle_missing should be either 'error' or "+                        "'', got %s")+            raise ValueError(template % self.handle_missing)         X_out = np.zeros((len(X), self.n_components))+        X = X.reshape(-1)          # TODO Parallel run here-        for i, x in enumerate(X):-            if x not in self.hash_dict:-                self.hash_dict[x] = self.get_hash(x)+        nan_idx = []++        if self.hashing == 'fast':+            for i, x in enumerate(X):+                if isinstance(x, float):+                    nan_idx.append(i)+                elif x not in self.hash_dict:+                    self.hash_dict[x] = self.get_fast_hash(x)+                    X_out[i, :] = self.hash_dict[x]+                else:+                    X_out[i, :] = self.hash_dict[x]++        elif self.hashing == 'murmur':+            for i, x in enumerate(X):+                if isinstance(x, float):+                    nan_idx.append(i)+                elif x not in self.hash_dict:+                    self.hash_dict[x] = self.minhash(+                        x,+                        n_components=self.n_components,+                        ngram_range=self.ngram_range)+                    X_out[i, :] = self.hash_dict[x]

Same comment about avoiding a look up into hash_dict

amy12xx

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 def transform(self, X):             X = X[:, 0]         # Check if first item has str or np.str_ type         assert isinstance(X[0], str), "ERROR: Input data is not string."+        if self.minmax_hash:+            assert self.n_components % 2 == 0,\+                    "n_components should be even when minmax_hash=True"+        if self.hashing == 'murmur':+            assert not(self.minmax_hash),\+                   "minmax_hash not implemented with murmur"+        if self.handle_missing not in ['error', '']:+            template = ("handle_missing should be either 'error' or "+                        "'', got %s")+            raise ValueError(template % self.handle_missing)         X_out = np.zeros((len(X), self.n_components))+        X = X.reshape(-1)          # TODO Parallel run here-        for i, x in enumerate(X):-            if x not in self.hash_dict:-                self.hash_dict[x] = self.get_hash(x)+        nan_idx = []++        if self.hashing == 'fast':+            for i, x in enumerate(X):+                if isinstance(x, float):+                    nan_idx.append(i)+                elif x not in self.hash_dict:+                    self.hash_dict[x] = self.get_fast_hash(x)+                    X_out[i, :] = self.hash_dict[x]

I think that the following would work:

                    X_out[i, :] = self.hash_dict[x] = self.get_fast_hash(x)

It should be faster, as it would avoid a value lookup in X_out, which is probably not completely cheap.

amy12xx

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 def transform(self, X):             X = X[:, 0]         # Check if first item has str or np.str_ type         assert isinstance(X[0], str), "ERROR: Input data is not string."+        if self.minmax_hash:+            assert self.n_components % 2 == 0,\+                    "n_components should be even when minmax_hash=True"+        if self.hashing == 'murmur':+            assert not(self.minmax_hash),\+                   "minmax_hash not implemented with murmur"+        if self.handle_missing not in ['error', '']:+            template = ("handle_missing should be either 'error' or "+                        "'', got %s")+            raise ValueError(template % self.handle_missing)         X_out = np.zeros((len(X), self.n_components))+        X = X.ravel()

OK. I worry about data being put in with a shape of (N, P), where the ravel could have "interesting" consequences. Maybe this cannot happen, and we can worry about it later, if we hit the problem.

amy12xx

comment created time in a month

issue commentdirty-cat/dirty_cat

AttributeError: 'tuple' object has no attribute 'shape'

Great! Thanks a lot!!

AC-Meira

comment created time in a month

pull request commentscikit-learn/scikit-learn

DOC on issue triaging process

Thanks you @NicolasHug !!

GaelVaroquaux

comment created time in a month

pull request commentscikit-learn/scikit-learn

Try to streamline the issue triaging process

@cmarmo is right. I'm currently swamped and having a hard time to get back to this PR.

@thomasjpfan @NicolasHug , are you OK merging this, and @cmarmo does a PR to iterate on top?

GaelVaroquaux

comment created time in a month

pull request commentscikit-learn/scikit-learn

[MRG] Adding explained variances to sparse pca

You're really a good sport @Balatex ! Thanks a lot for pushing this. Sorry it took us so long to realize the fundamental issue.

Batalex

comment created time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha ffa81a98e79cb3a84f4d94d1455f404a88240402

Figure

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha d13ae3f01e07997ea194e8b3916cc3efaea0592d

readme

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha bc1292d5f3acacdba1615ed5131591b6bab91b3b

sty

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha 4b8572ae14d69a14b8257a12f219f16013fd364d

sty

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha ce830cb5c4f176829fde2da070d53ab652ecb20d

sty

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha 981c36bdca458bb35a3abd421c4c72d55fab5ec1

sty

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha 18002486cad235af3cf9d3c074c36852c880b343

sty

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha d8f395d731fe491ee347cfb93987ca261b8e8b17

Update README

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha a8472ea0aa7c5995ed59d111048953b576f6cda8

Didactic notebook

view details

push time in a month

push eventGaelVaroquaux/test-split-prediction-models

Gael Varoquaux

commit sha 8dc4ee06ebd9ee3adf3e6b523046bc5e92520185

Add a simple example

view details

push time in a month

create barnchGaelVaroquaux/test-split-prediction-models

branch : master

created branch time in a month

created repositoryGaelVaroquaux/test-split-prediction-models

Evidence for prediction requires separating train and test data

created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 def test_get_unique_ngrams():     enc = MinHashEncoder(n_components=2)     ngrams = enc.get_unique_ngrams(string, ngram_range)     assert ngrams == true_ngrams-        ++ def profile_encoder(Encoder, hashing='fast', minmax_hash=False):     # not an unit test-    +     from dirty_cat import datasets     import pandas as pd     employee_salaries = datasets.fetch_employee_salaries()-    data = pd.read_csv(employee_salaries['path'])-    X = data['Employee Position Title'].tolist()+    data = employee_salaries['data']+    X = data['employee_position_title'].tolist()     X = X * 10     t0 = time.time()-    encoder = Encoder(n_components=50, hashing=hashing, minmax_hash=minmax_hash)+    encoder = Encoder(n_components=50, hashing=hashing,+                      minmax_hash=minmax_hash)     encoder.fit(X)     y = encoder.transform(X)     assert y.shape == (len(X), 50)     eta = time.time() - t0     return eta  +def _test_missing_values(input_type, missing):+    X = ['Red',+         np.nan,+         'green',+         'blue',+         'green',+         'green',+         'blue',+         float('nan')]+    n = 3+    z = np.zeros(n)++    if input_type == 'numpy':+        X = np.array(X, dtype=object)+    elif input_type == 'pandas':+        pd = pytest.importorskip("pandas")+        X = pd.DataFrame(X)++    for hashing in ['fast', 'murmur']:+        encoder = MinHashEncoder(n_components=n, hashing=hashing,+                                 minmax_hash=False, handle_missing=missing)+        if missing == 'error':+            encoder.fit(X)+            if input_type in ['numpy', 'pandas']:+                msg = ("Found missing values in input data; set "+                       "handle_missing='' to encode with missing values")+                assert_raise_message(ValueError, msg, encoder.transform, X)+        elif missing == '':+            encoder.fit(X)+            y = encoder.transform(X)+            if input_type == 'list':+                assert np.allclose(y[1], y[-1])+            else:+                assert np.array_equal(y[1], z)+                assert np.array_equal(y[-1], z)+        else:+            msg = "handle_missing should be either 'error' or '', got %s" % missing+            assert_raise_message(ValueError, msg, encoder.fit_transform, X)+            return+++def test_missing_values():

Maybe here we should be using pytest.parametrize:

https://docs.pytest.org/en/stable/parametrize.html

amy12xx

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 def transform(self, X):             X = X[:, 0]         # Check if first item has str or np.str_ type         assert isinstance(X[0], str), "ERROR: Input data is not string."+        if self.minmax_hash:+            assert self.n_components % 2 == 0,\+                    "n_components should be even when minmax_hash=True"+        if self.hashing == 'murmur':+            assert not(self.minmax_hash),\+                   "minmax_hash not implemented with murmur"+        if self.handle_missing not in ['error', '']:+            template = ("handle_missing should be either 'error' or "+                        "'', got %s")+            raise ValueError(template % self.handle_missing)         X_out = np.zeros((len(X), self.n_components))+        X = X.ravel()          # TODO Parallel run here-        for i, x in enumerate(X):-            if x not in self.hash_dict:-                self.hash_dict[x] = self.get_hash(x)+        nan_idx = []++        if self.hashing == 'fast':+            for i, x in enumerate(X):+                if isinstance(x, float):+                    nan_idx.append(i)+                elif x not in self.hash_dict:+                    self.hash_dict[x] = self.get_fast_hash(x)

I think that we need to populate X_out in this loop and remove the one below. The reason being that we can exceed the cache of hash_dict (a bug in the current code).

amy12xx

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 def transform(self, X):             X = X[:, 0]         # Check if first item has str or np.str_ type         assert isinstance(X[0], str), "ERROR: Input data is not string."+        if self.minmax_hash:+            assert self.n_components % 2 == 0,\+                    "n_components should be even when minmax_hash=True"+        if self.hashing == 'murmur':+            assert not(self.minmax_hash),\+                   "minmax_hash not implemented with murmur"+        if self.handle_missing not in ['error', '']:+            template = ("handle_missing should be either 'error' or "+                        "'', got %s")+            raise ValueError(template % self.handle_missing)         X_out = np.zeros((len(X), self.n_components))+        X = X.ravel()

This ravel is new, isn't it?

Why do we need it? I wonder if it won't have side effects.

amy12xx

comment created time in a month

Pull request review commentdirty-cat/dirty_cat

[MRG] minhash handle missing values

 def get_hash(self, string):         array, shape (n_components, )             The encoded string, using specified encoding scheme.         """-        if self.hashing == 'fast':-            if self.minmax_hash:-                assert self.n_components % 2 == 0,\-                       "n_components should be even when minmax_hash=True"-                return np.concatenate([ngram_min_hash(string, self.ngram_range,-                                                      seed, return_minmax=True)-                                      for seed in range(self.n_components//2)])-            else:-                return np.array([ngram_min_hash(string, self.ngram_range, seed)-                                for seed in range(self.n_components)])+        if self.minmax_hash:+            return np.concatenate([ngram_min_hash(string, self.ngram_range,+                                   seed, return_minmax=True)+                                   for seed in range(self.n_components//2)])+        else:+            return np.array([ngram_min_hash(string, self.ngram_range, seed)+                            for seed in range(self.n_components)]) -        elif self.hashing == 'murmur':-            assert not(self.minmax_hash),\-                   "minmax_hash not implemented with murmur"-            return self.minhash(+    def get_murmur_hash(self, string):+        """+        Encode a string with murmur hashing function.+        murmur only supports min_hash encoding.+        Parameters+        ----------+        string : str+            The string to encode.+        Returns+        -------+        array, shape (n_components, )+            The encoded string, using specified encoding scheme.+        """+        return self.minhash(

When I see a function that small, I find that we have one level of indirection too much. Can we just inline the "minhash" method in here and remove it?

amy12xx

comment created time in a month

push eventGaelVaroquaux/gaelvaroquaux.github.io

Gael Varoquaux

commit sha c3abaecfca19580278ac7e1c072548d00a13e15c

Make install

view details

push time in a month

push eventGaelVaroquaux/gaelvaroquaux.github.io

Gael Varoquaux

commit sha 8fdb61be624c56a580f0d538e0c4e4714ca3e902

Make install

view details

push time in a month

more