profile
viewpoint
Andreas Mueller amueller Columbia University Data Science Institute NYC http://amueller.github.io Scikit-learn core-developer, Research Scientist at the Columbia Data Science Institute.

amueller/COMS4995-s19 240

COMS W4995 Applied Machine Learning - Spring 19

amueller/COMS4995-s18 152

COMS W4995 Applied Machine Learning - Spring 18

amueller/advanced_training 102

Advanced Scikit-learn training session

amueller/applied_ml_spring_2017 82

Website and material for the FIXME course on Practical Machine Learning

amueller/COMS4995-s20 53

COMS W4995 Applied Machine Learning - Spring 20

amueller/advanced_git_nyu_2016 40

Advanced git and github course material

amueller/dabl 18

Data Analysis Baseline Library

amueller/advanced-sklearn-boston-nlp-2016 17

Material and slides for Boston NLP meetup May 23rd 2016

amueller/damascene-python-and-matlab-bindings 11

Python and matlab bindings for the Damascene CUDA implementation of gPB

issue commentscikit-learn/scikit-learn

Include MIDAS in Sklearn

Yes indeed. Which of the existing algorithms would you say it's improving upon?

To me it looks like to be an algorithms for an entirely new task, anomaly detection in edge streams, so I don't see how this would improve over existing tools in scikit-learn.

Also, scikit-learn doesn't have an API for streaming data or graph data (except as represented as sparse matrices, but I don't see how edge labeling could fit into the API there).

bhatiasiddharth

comment created time in 2 hours

push eventdabl/dabl.github.io

dabl

commit sha 4d9b6654102f91bb5f474c5253ea35ae5607b280

Pushing the docs to dev/ for branch: master, commit c7f0d437540c35eabfa9470839adadb2cb613ce8

view details

push time in 2 hours

PR closed dabl/dabl

added seaborn jointplot

added seaborn jointplot as well as function to remove rows contatining outliers from a two column dataset

+70 -9

6 comments

4 changed files

ADMoreau

pr closed time in 2 hours

push eventdabl/dabl

Leonardo Gonzalez

commit sha c7f0d437540c35eabfa9470839adadb2cb613ce8

cache for float matching (#191) * fixed typos and improved wording in docs * shortened more ticks names and labels * cache for float matching Reduces unnecessary multiple calls to _float_matching by storing results of previous function calls in a module-level cache. Eliminates multiple warnings during recursive use of _float_matching. Standardizes use of col for column name and X_col for the actual column. * cleanup * debugging output for failed test * fixed models test used hash instead of id reorganized imports removed debug code * add module-wide map of mixed-type warnings Multiple warnings would be issued during execution of _find_string_floats through use of X.apply(_float_matching). Added a module-wide map to look up and limit number of mixed-type warnings to one per column name. * added test_dirty_float_single_warning Regression check to prevent multiple mixed-type warnings issued per column. After adding module-wide map, ONE warning is shown in notebook, but NO warnings are shown for regression test. Therefore, test checks for # of warnings < 2 * remove unnecessary already_warned parameter now using module-wide map * fixed flake8 errors * reverse test order for single-warning test Co-authored-by: Leonardo Gonzalez <50969409+paa-leonardo@users.noreply.github.com>

view details

push time in 2 hours

PR merged dabl/dabl

cache for float matching

Fixes #164

Reduces unnecessary multiple calls to _float_matching by storing results of previous function calls in a module-level cache.

Eliminates multiple warnings during recursive use of _float_matching.

Standardizes use of col for column name and X_col for the actual column.

+72 -29

10 comments

2 changed files

encryptedcommerce

pr closed time in 2 hours

issue closeddabl/dabl

calling clean on mixed type dirty float gives three warnings

import numpy as np
import pandas as pd
from dabl import clean

rng = np.random.RandomState(0)
cont_clean = ["{:2.2f}".format(x) for x in rng.uniform(size=100)]

dirty3 = pd.Series(cont_clean)
dirty3[::20] = [("missing", "but weird")] * 5

X = pd.DataFrame({'dirty3': dirty3})
clean(X)

so probably we're doing too much work. also a bit noisy.

closed time in 2 hours

amueller

pull request commentdabl/dabl

cache for float matching

Looks good, thanks. Sorry for the slow reply!

encryptedcommerce

comment created time in 2 hours

pull request commentdabl/dabl

fix one-hot encoding in EasyPreprocessor

Thanks for investigating. Sorry for the slow reply! Can you make sure that all of your changes are explicitly tested please?

sanjaradylov

comment created time in 2 hours

issue commentscikit-learn/enhancement_proposals

SLEP 007 feature-name generation: adding a constructor argument for verbosity

Would you be happy rephrasing it as in #38 ?

amueller

comment created time in 2 hours

PR opened scikit-learn/enhancement_proposals

make more explicit that verbose_feature_names is not required

Addressing my concern in #32, this rephrases the verbose_feature_names to be an optional extension.

+23 -14

0 comment

1 changed file

pr created time in 2 hours

create barnchamueller/enhancement_proposals

branch : verbose_feature_names_edits

created branch time in 2 hours

push eventdabl/dabl.github.io

dabl

commit sha 5316f3767904599698e760638b77add91d085c59

Pushing the docs to dev/ for branch: master, commit 35c7f873c40b57c5b2bb04b99ecc8aaec3fdef0f

view details

push time in 2 hours

issue commentdabl/dabl

drop outliers in regression target for plotting

Great!

amueller

comment created time in 2 hours

issue commentdabl/dabl

Add 2d density plots as alternative to scatterplot

That looks cool. Though maybe it would be even better to also color the contour plot?

amueller

comment created time in 2 hours

issue closeddabl/dabl

Suggestion - link to pandas_profiling in README

The Pandas Profiling package is great for initial univariate analysis (distributions, missing data, outliers) and for generating a data report. I teach it alongside dabl, maybe link to it from the README as it is complimentary?

https://github.com/pandas-profiling/pandas-profiling

closed time in 2 hours

ianozsvald

issue commentdabl/dabl

Suggestion - link to pandas_profiling in README

Fixed in #213.

ianozsvald

comment created time in 2 hours

push eventdabl/dabl

kathrynle20

commit sha 35c7f873c40b57c5b2bb04b99ecc8aaec3fdef0f

Update README.md (#213) * Update README.md I added the complimentary Pandas Profiling link, installation line and a small description. * Update README.md Co-authored-by: Andreas Mueller <t3kcit@gmail.com>

view details

push time in 2 hours

PR merged dabl/dabl

Update README.md

I added the complimentary Pandas Profiling link, installation line and a small description.

+11 -0

3 comments

1 changed file

kathrynle20

pr closed time in 2 hours

push eventkathrynle20/dabl

Andreas Mueller

commit sha a446e0a1d092f715d977149711ff7698c2e1caf8

Update README.md

view details

push time in 2 hours

issue closeddabl/dabl

very imbalanced datasets make bar plots useless

we should adjust scales by 10x if that helps maybe?

Or rather: have different axes for each one and plot the correct axes. Maybe we should see how much slower that makes it?

closed time in 2 hours

amueller

issue commentdabl/dabl

very imbalanced datasets make bar plots useless

fixed in #173

amueller

comment created time in 2 hours

pull request commentamueller/word_cloud

Add extra arguments to CLI

Thanks for the PR. Can you please add tests and make sure CI passes? I think there is some issue with Python3.4 which should probably just be dropped so don't worry about that one too much.

dm-logv

comment created time in 4 hours

pull request commentscikit-learn/enhancement_proposals

Accept SLEP013

+1 on accepting the SLEP. I agree with @jnothman but I'm unsure about the process of changing the SLEP during vote.

adrinjalali

comment created time in 4 hours

pull request commentscikit-learn/enhancement_proposals

SLEP 014 Pandas in Pandas out

I think the scope is only transform. If that's not clear from the SLEP it needs to be clarified. Or are you saying it should also be predict etc? I'm not sure how I feel about that ;)

thomasjpfan

comment created time in 5 hours

push eventamueller/columbia-website

Andreas Mueller

commit sha 29e27b529ad9d07371cb7ddf0950530615676e44

add links for gradient boosting

view details

push time in 5 hours

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha 43a93c63b9e5d3ec046cacfd00b5494bce6036eb

fix typo in matplotlib slide

view details

push time in 5 hours

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha f3bd66574d2b52df75024621391ab9f43fc5d25e

minor polish for gradient boosting lecture

view details

push time in 6 hours

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha 315c8bbb95eb999e1333c6e9da08c7466e8ef9e8

add some stuff on catboost

view details

push time in 8 hours

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha 4f5f6971abd4d8ca6f9e3cfb6661f6db542a8884

polish on boosting slides, add more on binning

view details

push time in 8 hours

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha 12869c7a0dbaf064a4fc2c5ea778f2f1812c5c39

rename lecture 08 to gradient boosting

view details

push time in 9 hours

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha 8949ae171734ad24066b2bfe1be00d7f3eefe7f4

remove calibration from gradient boosting lecture

view details

push time in 9 hours

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha 7c3f3ea4db32e93aabb9c275bcbad59713a5e90a

add gradient descent and SGD to gradient boosting

view details

Andreas Mueller

commit sha 0f97ef5385729423da9bfe56f092e3e70b8db8e5

fix squared loss formula in linear regression models

view details

push time in 10 hours

push eventscikit-learn/enhancement_proposals

adrinjalali

commit sha c991dce789c773a6f619ee4e3c7fe760dbe6d3bf

initial writeup of the slep

view details

adrinjalali

commit sha 0b25e2f5e78ba0b7df6f53a2e0c77d2876134965

clarify on xarray

view details

adrinjalali

commit sha ef37bab9c3e7c9bf95b087f2250a8d2db98b33ce

address more comments

view details

adrinjalali

commit sha 85513f70b4bf7dedd90adfa0a686c3bf680a7ee2

Merge remote-tracking branch 'upstream/master' into slep012/DataArray

view details

adrinjalali

commit sha 48ce7f40175c37fbaa95ed36aec1cd640787bcea

add headers

view details

Andreas Mueller

commit sha 37ab0c104adeeea681fc507a50f46ae89c25b42c

Merge pull request #25 from adrinjalali/slep012/DataArray SLEP012 - DataArray

view details

push time in a day

PR merged scikit-learn/enhancement_proposals

SLEP012 - DataArray

This SLEP discusses a new data structure which would carry some meta-data along the data itself. The feature names are the first set of meta-data we would implement, but it could grow from that point.

Things we need to decide to continue the write up of this SLEP:

  • scope 1: do we want to have other meta-data here in the future? A NamedArray with only feature names may be another SLEP?
  • scope 2: should this slep stick to feature names, or should it discuss other meta-data
  • implementation details: do we want to include implementation details here?
  • the name: It started as NamedArray, but we seem to lean towards including other meta-data than feature names. Therefore we probably should use another name. DataArray is a placeholder. Happy to change.
+136 -0

36 comments

2 changed files

adrinjalali

pr closed time in a day

pull request commentscikit-learn/enhancement_proposals

SLEP012 - DataArray

Sure. I think the proposal is reasonably complete, even though I don't expect a vote on it in the current form very soon.

adrinjalali

comment created time in a day

issue commentscikit-learn/scikit-learn

RFC: prefixing output feature names in ColumnTransformer with passthrough

I don't have a strong opinion. I thought the name of the transformer was remainder (say if you want to set a hyper-parameter), so I expected the feature to be prefixed by remainder in this case. But I'm happy to go with what the majority wants.

adrinjalali

comment created time in a day

pull request commentscikit-learn/enhancement_proposals

SLEP012 - DataArray

maybe coordinate with @thomasjpfan who I think is working on a new SLEP.

adrinjalali

comment created time in a day

pull request commentscikit-learn/enhancement_proposals

SLEP012 - DataArray

I'm happy to merge it as it is. And doing pandas or xarray is substantially more work I think. Though I guess if we could compartmentalize it into some helper functions for wrapping and unwrapping it might not be so bad.

adrinjalali

comment created time in a day

pull request commentscikit-learn/scikit-learn

RFC Implement Pipeline get feature names

This PR is very incomplete and doesn't work. You can check out #13307 for all the issues that happen if you actually try to do this. You're assuming a pipeline. What if the outermost layer is any other meta-estimator?

amueller

comment created time in a day

issue commentscikit-learn/scikit-learn

Deprecate _estimator_type, replace by estimator tag

That was my intention. Right now both the _pairwise and _estimator_type are already used in the code for cross-validation purposes.

amueller

comment created time in a day

pull request commentscikit-learn/enhancement_proposals

SLEP012 - DataArray

I agree with @adrinjalali only that I would actually use pandas, not xarray. I think having less dependencies and types is better. The main reason I had for xarray vs pandas was the zero-copy issue, and I come to give that less weight (given that the future of pandas is unclear and that we copy a lot anyway).

Sorry for going back and forth on this. Main question: should we have a separate slep for using an existing "array" type (pandas it not really an array, I know). That will make voting and discussion harder and harder. I wonder if maybe discussing this on a call would be easier so we can sync up? I would love to hear @GaelVaroquaux's, @jnothman's and @jorisvandenbossche's take.

adrinjalali

comment created time in a day

push eventdabl/dabl.github.io

dabl

commit sha 8efc086365fec05844a73d0ca3d191cab4f600e5

Pushing the docs to dev/ for branch: master, commit fdcf83098f1a2055aebadf1314c45b8bcb5145f3

view details

push time in a day

push eventdabl/dabl.github.io

dabl

commit sha e88f331f372cc474b25ea46a20aa970a9d77edaf

Pushing the docs to dev/ for branch: master, commit dc51e5d445cd774d318eaedbf4255c1ff0a2cf85

view details

push time in a day

push eventdabl/dabl

Selçuk Gülcan

commit sha fdcf83098f1a2055aebadf1314c45b8bcb5145f3

fix EasyProcessor get_feature_names (#216) * test case added for issue 211 * fix EasyPreprocessor get_feature_names function gives error when new columns are added by SimpleImputer * added assert statement to simple preprocessor imputed features test

view details

push time in a day

PR merged dabl/dabl

fix EasyProcessor get_feature_names

Fix #211

Reason: SimpleImputer adds extra columns for features having NaN values. This behavior breaks OneHotEncoder get_feature_names call.

+21 -2

4 comments

2 changed files

Shathra

pr closed time in a day

issue closeddabl/dabl

EasyPreprocessor get_feature_names seems broken

from dabl.datasets import load_ames
from dabl import SimpleRegressor
ames = load_ames()
sr = SimpleRegressor().fit(ames, target_col='SalePrice')
sr.est_[0].get_feature_names()
ValueError: input_features should have length equal to number of features (61), got 46

closed time in a day

amueller

pull request commentdabl/dabl

fix EasyProcessor get_feature_names

I'm definitely a bit inconsistent with the granularity of testing. Generally I think it's ok to test things that are implementation details if they help you to ensure correctness. If we deliberately change the implementation, the test will break but we will expect it to break and we can change the test. But if the test breaks when we didn't intent to break it, something went wrong.

Shathra

comment created time in a day

push eventdabl/dabl

Selçuk Gülcan

commit sha dc51e5d445cd774d318eaedbf4255c1ff0a2cf85

add legend parameter to mosaic_plot function, default to True (#209) * add legend parameter to mosaic_plot function, default to True * mosaic_plot call in supervised plot changed so that only the first feature plot has a legend

view details

push time in a day

PR merged dabl/dabl

add legend parameter to mosaic_plot function, default to True

Related discussion #207

Set True as default value. I'm not sure but it feels like default value should be True as they are util functions.

+13 -5

6 comments

2 changed files

Shathra

pr closed time in a day

pull request commentdabl/dabl

add legend parameter to mosaic_plot function, default to True

Sweet, thanks!

Shathra

comment created time in a day

issue commentscikit-learn/scikit-learn

TfidfVectorizer handles multiple text columns

Of the cases that @rth listed we have 1 covered with column transformer (slightly awkward but not too bad), 2 is string concatenation but 3 is not trivial right now, right? I'm not sure if 3 is that common, I would probably hold off on doing something special for it.

I'm not entirely sure if having a wrapper for 1 is useful, but I think adding something to the user guide at the CountVectorizer would be great.

zachmayer

comment created time in a day

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha cec11cb99f6d426234e13ca59bb7eca41f822a53

add fixme on feature names for regression

view details

push time in a day

issue commentscikit-learn/scikit-learn

Deprecate _estimator_type, replace by estimator tag

I think so.

amueller

comment created time in 2 days

issue openedscikit-learn/enhancement_proposals

SLEP 001: why do we need trans_modify?

cc @GaelVaroquaux

Coming back to SLEP 1 I don't see / remember the need for trans_modify. I'm now not sure why we need this. The motivation the SLEP gives is

Creating y in a pipeline makes error measurement harder For some usecases, test time needs to modify the number of samples (for instance data loading from a file)

I think that makes it much harder and I don't think it's as necessary as the training-time version.

Similarly I'm not sure I understand the motivation for partial_fit_modify.

My main motivation in this would be to distinguish training time and test time, and that only requires a new method that basically replaces fit_transform within a pipeline or other meta-estimator.

Not sure I like fit_modify for that. My thoughts right now would be forward or maybe fit_forward (though that sounds too much like feed-forward - how about feed lol). modify sounds like an in-place operation to me. D3M uses produce which is quite generic but might work (probably fit_produce, produce is their version of both predict and transform)

created time in 2 days

pull request commentscikit-learn/enhancement_proposals

SLEP005: Resampler API

I can't believe I'm saying this, but I think I'm coming back to something more like SLEP 1 (#2). I think there are two core parts of both this and SLEP 1 and #12: one is to distinguish training time transformations from test time transformations, the other is to return more complex objects. SLEP 1 does both, which I think is why I found it confusing at the time.

SLEP 1 solves a broader range of issues than this SLEP, I don't remember if this was meant as a replacement?

glemaitre

comment created time in 2 days

issue commentscikit-learn/enhancement_proposals

SLEP needed: fit_transform does something other than fit(X).transform(X)

actually SLEP 001 (#2) basically implements something like this, only SLEP 001 also has a test-time version of it, which I think is a bit weird, or rather it conflates two issues: distinguishing training vs test time, and returning a complex object / tuple.

amueller

comment created time in 2 days

Pull request review commentscikit-learn/enhancement_proposals

SLEP005: Resampler API

+.. _slep_005:++=============+Resampler API+=============++:Author: Oliver Rausch (oliverrausch99@gmail.com),+         Christos Aridas (char@upatras.gr),+         Guillaume Lemaitre (g.lemaitre58@gmail.com)+:Status: Draft+:Created: created on, in 2019-03-01+:Resolution: <url>++Abstract+--------++We propose the inclusion of a new type of estimator: resampler. The+resampler will change the samples in ``X`` and ``y`` and return both+``Xt`` and ``yt``. In short:++* a new verb/method that all resamplers must implement is introduced:+  ``fit_resample``.+* resamplers are able to reduce and/or augment the number of samples in+  ``X`` and ``y`` during ``fit``, but will perform no changes during+  ``predict``.+* to facilitate this behavior a new meta-estimator (``ResampledTrainer``) that+  allows for the composition of resamplers and estimators is proposed.+  Alternatively we propose changes to ``Pipeline`` that also enable similar+  compositions.+++Motivation+----------++Sample reduction or augmentation are common parts of machine-learning+pipelines. The current scikit-learn API does not offer support for such+use cases.++Possible Usecases+.................++* Sample rebalancing to correct bias toward class with large cardinality.+* Outlier rejection to fit a clean dataset.+* Sample reduction e.g. representing a dataset by its k-means centroids.+* Currently semi-supervised learning is not supported by scoring-based+  functions like ``cross_val_score``, ``GridSearchCV`` or ``validation_curve``+  since the scorers will regard "unlabeled" as a separate class. A resampler+  could add the unlabeled samples to the dataset during fit time to solve this+  (note that this could also be solved by a new cv splitter).+* NaNRejector (drop all samples that contain nan).+* Dataset augmentation (like is commonly done in DL).++Implementation+--------------++API and Constraints+...................++* Resamplers implement a method ``fit_resample(X, y, **kwargs)``, a pure+  function which returns ``Xt, yt, kwargs`` corresponding to the resampled+  dataset, where samples may have been added and/or removed.+* An estimator may only implement either ``fit_transform`` or ``fit_resample``+  if support for ``Resamplers`` in ``Pipeline`` is enabled+  (see Sect. "Limitations").+* Resamplers may not change the order, meaning, dtype or format of features+  (this is left to transformers).+* Resamplers should also handled (e.g. resample, generate anew, etc.) any+  kwargs.++Composition+-----------++A key part of the proposal is the introduction of a way of composing resamplers+with predictors. We present two options: ``ResampledTrainer`` and modifications+to ``Pipeline``.++Alternative 1: ResampledTrainer+...............................++This metaestimator composes a resampler and a predictor. It+behaves as follows:++* ``fit(X, y)``: resample ``X, y`` with the resampler, then fit the predictor+  on the resampled dataset.+* ``predict(X)``: simply predict on ``X`` with the predictor.+* ``score(X)``: simply score on ``X`` with the predictor.++See PR #13269 for an implementation.++One benefit of the ``ResampledTrainer`` is that it does not stop the resampler+having other methods, such as ``transform``, as it is clear that the+``ResampledTrainer`` will only call ``fit_resample``.++There are complications around supporting ``fit_transform``, ``fit_predict``+and ``fit_resample`` methods in ``ResampledTrainer``. ``fit_transform`` support+is only possible by implementing ``fit_transform(X, y)`` as ``fit(X,+y).transform(X)``, rather than calling ``fit_transform`` of the predictor.+``fit_predict`` would have to behave similarly.  Thus ``ResampledTrainer``+would not work with non-inductive estimators (TSNE, AgglomerativeClustering,+etc.) as their final step.  If the predictor of a ``ResampledTrainer`` is+itself a resampler, it's unclear how ``ResampledTrainer.fit_resample`` should+behave.  These caveats also apply to the Pipeline modification below.++Example Usage:+~~~~~~~~~~~~~~++.. code-block:: python++    est = ResampledTrainer(RandomUnderSampler(), SVC())+    est = make_pipeline(+        StandardScaler(),+        ResampledTrainer(Birch(), make_pipeline(SelectKBest(), SVC()))+    )+    est = ResampledTrainer(+        RandomUnderSampler(),+        make_pipeline(StandardScaler(), SelectKBest(), SVC()),+    )+    clf = ResampledTrainer(+        NaNRejector(), # removes samples containing NaN+        ResampledTrainer(RandomUnderSampler(),+            make_pipeline(StandardScaler(), SGDClassifier()))+    )++Alternative 2: Prediction Pipeline+..................................++As an alternative to ``ResampledTrainer``, ``Pipeline`` can be modified to+accomodate resamplers.  The essence of the operation is this: one or more steps+of the pipeline may be a resampler. When fitting the Pipeline, ``fit_resample``+will be called on each resampler instead of ``fit_transform``, and the output+of ``fit_resample`` will be used in place of the original ``X``, ``y``, etc.,+to fit the subsequent step (and so on).  When predicting in the Pipeline,+the resampler will act as a passthrough step.++Limitations+~~~~~~~~~~~++.. rubric:: Prohibiting ``transform`` on resamplers++It may be problematic for a resampler to provide ``transform`` if ``Pipeline``s+support resampling:++1. It is unclear what to do at test time if a resampler has a transform+   method.+2. Adding ``fit_resample`` to the API of an an existing transformer may+   drastically change its behaviour in a ``Pipeline``.++For this reason, it may be best to reject resamplers supporting ``transform``+from being used in a Pipeline.++.. rubric:: Prohibiting ``transform`` on resampling Pipelines++Providing a ``transform`` method on a Pipeline that contains a resampler+presents several problems:++1. A resampling ``Pipeline`` needs to use a special code path for+   ``fit_transform`` that would call ``fit(X, y, **kw).transform(X)`` on the+   ``Pipeline``.  Ordinarily a ``Pipeline`` would pass the transformed data to+   ``fit_transform`` of the left step. If the ``Pipeline`` contains a+   resampler, it rather needs to fit the ``Pipeline`` excluding the last step,+   then transform the original training data until the last step, then+   ``fit_transform`` the last step. This means special code paths for pipelines+   containing resamplers; the effect of the resampler is not localised in terms+   of code maintenance.+2. As a result of issue 1, appending a step to the transformation ``Pipeline``+   means that the transformer which was previously last, and previously trained+   on the full dataset, will now be trained on the resampled dataset.+3. As a result of issue 1, the last step cannot be ``'passthrough'`` as in+   other transformer pipelines.++For this reason, it may be best to disable ``fit_transform`` and ``transform``+on the Pipeline. A resampling ``Pipeline`` would therefore not be usable as a+transformation within a ``FeatureUnion`` or ``ColumnTransformer``. Thus the+``ResampledTrainer`` would be strictly more expressive than a resampling+``Pipeline``.++.. rubric:: Handling ``fit`` parameters++Sample props or weights cannot be routed to steps downstream of a resampler in+a Pipeline, unless they too are resampled. To support this, a resampler+would need to be passed all props that are required downstream, and+``fit_resample`` should return resampled versions of them. Note that these+must be distinct from parameters that affect the resampler's fitting.+That is, consider the signature ``fit_resample(X, y=None, props=None, sample_weight=None)``.

Btw my argument is that it should be resampled and returned. But that means the signature of fit_resample is unclear. Also, adding props would be a backward-incompatible change. We could return a named tuple or dict, though that's not very idiomatic sklearn.

glemaitre

comment created time in 2 days

issue commentscikit-learn/enhancement_proposals

SLEP needed: fit_transform does something other than fit(X).transform(X)

I just realized that this is the place where I should have put https://github.com/scikit-learn/scikit-learn/issues/15553

Basically what MLR does is that fit returns the training-time version of transform which basically solves the problem without introducing a new verb/method.

However, method chaining is so idiomatic in sklearn that I don't really see how to get around this. We could also "just" introduce a different verb/method say forward that does training time transformation and returns X and y that's used in meta-estimators (mostly pipeline).

That would also solve #15.

amueller

comment created time in 2 days

Pull request review commentscikit-learn/enhancement_proposals

SLEP012: n_features_out_

+.. _slep_012:++======================================+SLEP012: ``n_features_out_`` attribute+======================================++:Author: Adrin Jalali+:Status: Under Review+:Type: Standards Track+:Created: 2020-02-12++Abstract+########++This SLEP proposes the introduction of a public ``n_features_out_`` attribute+for most transformers (where relevant).++Motivation+##########++Knowing the number of features that a transformer outputs is useful for+inspection purposes.++Solution+########++The proposed solution is for the ``n_features_out_`` attribute to be set once a+call to ``fit`` is done. In most cases the value of ``n_features_out_`` is the

Seems nitpicky ;) How about "in many cases"? But any of the three variants seem fine to me. Usually it corresponds to some attribute, the scalers often have scale for example and you're saying e.g. not i.e. so it's only an example...

adrinjalali

comment created time in 2 days

pull request commentscikit-learn/scikit-learn

use unequal group sizes in example

I agree with @jnothman, I think some usecases have groups that are only one class. @ogrisel I guess you're thinking of having different experiment conditions that are applied to all subjects? But if the study is observational you don't have that necessarily.

amueller

comment created time in 2 days

issue openedscikit-learn/scikit-learn

Deprecate _estimator_type, replace by estimator tag

I think we should deprecate _estimator_type which predates estimator_tags. It's private but we should probably still deprecate it as it can lead to weird errors downstream. If we change it and downstream is not updating it, it would lead to classifiers using KFold cross-validation, which can be really bad if the target is sorted.

created time in 2 days

Pull request review commentscikit-learn/scikit-learn

[MRG] MNT n_features_in_ attribute with _validate_data method

 def _get_tags(self):                 collected_tags.update(more_tags)         return collected_tags +    def _check_n_features(self, X, reset):+        """Set the `n_features_in_` attribute, or check against it.++        Parameters+        ----------+        X : {ndarray, sparse matrix} of shape (n_samples, n_features)+            The input samples.+        reset : bool+            If True, the `n_features_in_` attribute is set to `X.shape[1]`.+            Else, the attribute must already exist and the function checks+            that it is equal to `X.shape[1]`.+        """+        n_features = X.shape[1]++        if reset:+            self.n_features_in_ = n_features+        else:+            if not hasattr(self, 'n_features_in_'):+                raise RuntimeError(+                    "The reset parameter is False but there is no "+                    "n_features_in_ attribute. Is this estimator fitted?"+                )+            if n_features != self.n_features_in_:+                raise ValueError(+                    'X has {} features, but this {} is expecting {} features '+                    'as input.'.format(n_features, self.__class__.__name__,+                                       self.n_features_in_)+                )++    def _validate_data(self, X, y=None, reset=True, **check_params):+        """Validate input data and set or check the `n_features_in_` attribute.++        Parameters+        ----------+        X : {array-like, sparse matrix, dataframe} of shape \+                (n_samples, n_features)+            The input samples.+        y : array-like of shape (n_samples,), default=None+            The targets. If None, `check_array` is called on `X` and+            `check_X_y` is called otherwise.

It could be weird to start using estimator tags for this use-case. Up to now, we used estimator tags to run some specific common tests.

Inheritance then?

I think it would be good if there was a programmatic way to figure out if an estimator requires y (cc @thomasjpfan). I think it's a valid question whether we want to use estimator tags for that. While it would be the first time we use estimator_tags in an estimator, I think it's a good use. Right now, we actually use _estimator_type when we call check_cv. While _estimator_type is not an estimator tag, I think it actually should be.

Whether we want to make this decision a prereq for this PR is another question. So maybe just always returning X, y is the simpler solution? I would be ok with either.

I opened #16468 for tracking adding a tag on requiring y for fitting.

NicolasHug

comment created time in 2 days

issue openedscikit-learn/scikit-learn

Allow determining whether a model is supervised programmatically

I think it would be good if we could determine whether a model requires y in fitting in a programmatic way. This came up in scikit-learn/scikit-learn/pull/16112/files#r378281885 but it also relevant for AutoML systems or generally for interacting with many estimators in a programmatic way.

This is somewhat related to the _estimator_type. I think we should have a new tag is_supervised or maybe supervised for this.

created time in 2 days

pull request commentscikit-learn/enhancement_proposals

slep000, slep workflow

looks good. I like the PEP wording and maybe would include a shortened version of it.

adrinjalali

comment created time in 2 days

Pull request review commentscikit-learn/enhancement_proposals

slep000, slep workflow

+.. _slep_000:++==============================+SLEP000: SLEP and its workflow+==============================++:Author: Adrin Jalali+:Status: Draft+:Type: Process+:Created: 2020-02-13++Abstract+########++This SLEP specifies details related to SLEP submission, review, and acceptance+process.++Motivation+##########++Without a predefined workflow, the discussions around a SLEP can be long and+consume a lot of energy for both the author(s) and the reviewers. The lack of a+known workflow also results in the SLEPs to take months (if not years) before+it is merged as ``Under Review``. The purpose of this SLEP is to lubricate and+ease the process of working on a SLEP, and make it a more enjoyable and+productive experience. This SLEP borrows the process used in PEPs and NEPs+which means there will be no ``Under Review`` status.+++What is a SLEP?+###############++SLEP stands for Scikit-Learn Enhancement Proposal, inspired from Python PEPs or+Numpy NEPs. A SLEP is a design document providing information to the+scikit-learn community, or describing a new feature for scikit-learn or its+processes or environment. The SLEP should provide a concise technical+specification of the proposed solution, and a rationale for the feature.++We intend SLEPs to be the primary mechanisms for proposing major new features,+for collecting community input on an issue, and for documenting the design+decisions that have gone into scikit-learn. The SLEP author is responsible for+building consensus within the community and documenting dissenting opinions.++Because the SLEPs are maintained as text files in a versioned repository, their+revision history is the historical record of the feature proposal.++SLEP Audience+#############++The typical primary audience for SLEPs are the core developers of+``scikit-learn`` and technical committee, as well as contributors to the+project. However, these documents also serve the purpose of documenting the+changes and decisions to help users understand the changes and why they are+made. The SLEPs are available under `Scikit-learn enhancement proposals+<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/>`_.++SLEP Types+##########++There are three kinds of SLEPs:++1. A Standards Track SLEP describes a new feature or implementation for+scikit-learn.++2. An Informational SLEP describes a scikit-learn design issue, or provides+general guidelines or information to the scikit-learn community, but does not+propose a new feature. Informational SLEPs do not necessarily represent a+scikit-learn community consensus or recommendation, so users and implementers+are free to ignore Informational SLEPs or follow their advice.++3. A Process SLEP describes a process surrounding scikit-learn, or proposes a+change to (or an event in) a process. Process SLEPs are like Standards Track+SLEPs but apply to areas other than the scikit-learn library itself. They may+propose an implementation, but not to scikit-learn’s codebase; they require+community consensus. Examples include procedures, guidelines, changes to the+decision-making process, and changes to the tools or environment used in+scikit-learn development. Any meta-SLEP is also considered a Process SLEP.+++SLEP Workflow+#############++A SLEP starts with an idea, which usually is discussed in an issue or a pull+request on the main repo before submitting a SLEP. It is generally a good idea+for the author of the SLEP to gauge the viability and the interest of the+community before working on a SLEP, mostly to save author's time.++A SLEP must have one or more champions: people who write the SLEP following the+SLEP template, shepherd the discussions around it, and seek consensus in the+community.++The proposal should be submitted as a draft SLEP via a GitHub pull request to a+``slepXXX`` directory with the name ``proposal.rst`` where ``XXX`` is an+appropriately assigned three-digit number (e.g., ``slep000/proposal.rst``). The+draft must use the `SLEP — Template and Instructions+<https://github.com/scikit-learn/enhancement_proposals/blob/master/slep_template.rst>`_+file.++Once the PR for the SLEP is created, a post should be made to the mailing list+containing the sections up to “Backward compatibility”, with the purpose of+limiting discussion there to usage and impact. Discussion on the pull request+will have a broader scope, also including details of implementation.++At the earliest convenience, the PR should be merged (regardless of whether it+is accepted during discussion). Additional PRs may be made by the champions to

Maybe just add one sentence here like "It is expected the author and reviewers ensure the SLEP is complete and sound before merging."?

adrinjalali

comment created time in 2 days

Pull request review commentscikit-learn/enhancement_proposals

slep000, slep workflow

+.. _slep_000:++==============================+SLEP000: SLEP and its workflow+==============================++:Author: Adrin Jalali+:Status: Draft+:Type: Process+:Created: 2020-02-13++Abstract+########++This SLEP specifies details related to SLEP submission, review, and acceptance+process.++Motivation+##########++Without a predefined workflow, the discussions around a SLEP can be long and+consume a lot of energy for both the author(s) and the reviewers. The lack of a+known workflow also results in the SLEPs to take months (if not years) before+it is merged as ``Under Review``. The purpose of this SLEP is to lubricate and+ease the process of working on a SLEP, and make it a more enjoyable and+productive experience. This SLEP borrows the process used in PEPs and NEPs+which means there will be no ``Under Review`` status.+++What is a SLEP?+###############++SLEP stands for Scikit-Learn Enhancement Proposal, inspired from Python PEPs or+Numpy NEPs. A SLEP is a design document providing information to the+scikit-learn community, or describing a new feature for scikit-learn or its+processes or environment. The SLEP should provide a concise technical+specification of the proposed solution, and a rationale for the feature.++We intend SLEPs to be the primary mechanisms for proposing major new features,+for collecting community input on an issue, and for documenting the design+decisions that have gone into scikit-learn. The SLEP author is responsible for+building consensus within the community and documenting dissenting opinions.++Because the SLEPs are maintained as text files in a versioned repository, their+revision history is the historical record of the feature proposal.++SLEP Audience+#############++The typical primary audience for SLEPs are the core developers of+``scikit-learn`` and technical committee, as well as contributors to the+project. However, these documents also serve the purpose of documenting the+changes and decisions to help users understand the changes and why they are+made. The SLEPs are available under `Scikit-learn enhancement proposals+<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/>`_.++SLEP Types+##########++There are three kinds of SLEPs:++1. A Standards Track SLEP describes a new feature or implementation for+scikit-learn.++2. An Informational SLEP describes a scikit-learn design issue, or provides+general guidelines or information to the scikit-learn community, but does not+propose a new feature. Informational SLEPs do not necessarily represent a+scikit-learn community consensus or recommendation, so users and implementers+are free to ignore Informational SLEPs or follow their advice.++3. A Process SLEP describes a process surrounding scikit-learn, or proposes a+change to (or an event in) a process. Process SLEPs are like Standards Track+SLEPs but apply to areas other than the scikit-learn library itself. They may+propose an implementation, but not to scikit-learn’s codebase; they require+community consensus. Examples include procedures, guidelines, changes to the+decision-making process, and changes to the tools or environment used in+scikit-learn development. Any meta-SLEP is also considered a Process SLEP.+++SLEP Workflow+#############++A SLEP starts with an idea, which usually is discussed in an issue or a pull+request on the main repo before submitting a SLEP. It is generally a good idea+for the author of the SLEP to gauge the viability and the interest of the+community before working on a SLEP, mostly to save author's time.++A SLEP must have one or more champions: people who write the SLEP following the+SLEP template, shepherd the discussions around it, and seek consensus in the+community.++The proposal should be submitted as a draft SLEP via a GitHub pull request to a+``slepXXX`` directory with the name ``proposal.rst`` where ``XXX`` is an+appropriately assigned three-digit number (e.g., ``slep000/proposal.rst``). The+draft must use the `SLEP — Template and Instructions+<https://github.com/scikit-learn/enhancement_proposals/blob/master/slep_template.rst>`_+file.++Once the PR for the SLEP is created, a post should be made to the mailing list+containing the sections up to “Backward compatibility”, with the purpose of+limiting discussion there to usage and impact. Discussion on the pull request+will have a broader scope, also including details of implementation.++At the earliest convenience, the PR should be merged (regardless of whether it+is accepted during discussion). Additional PRs may be made by the champions to+update or expand the SLEP, or by maintainers to set its status, discussion URL,+etc.++Standards Track SLEPs (see bellow) consist of two parts, a design document and+a reference implementation. It is generally recommended that at least a+prototype implementation be co-developed with the SLEP, as ideas that sound+good in principle sometimes turn out to be impractical when subjected to the+test of implementation. Often it makes sense for the prototype implementation+to be made available as PR to the scikit-learn repo (making sure to+appropriately mark the PR as a WIP).++Review and Resolution+---------------------++SLEPs are discussed on the mailing list or the PRs modifying the SLEP. The+possible paths of the status of SLEPs are as follows:++.. image:: pep-0001-process_flow.png+   :alt: SLEP process flow diagram++All SLEPs should be created with the ``Draft`` status.++Eventually, after discussion, there may be a consensus that the SLEP should be+accepted – see the next section for details. At this point the status becomes+``Accepted``.++Once a SLEP has been ``Accepted``, the reference implementation must be+completed. When the reference implementation is complete and incorporated into+the main source code repository, the status will be changed to ``Final``.++To allow gathering of additional design and interface feedback before+committing to long term stability for a feature or API, a SLEP may also be+marked as ``Provisional``. This is short for“Provisionally Accepted”, and

Is the formatting here correct?

adrinjalali

comment created time in 2 days

Pull request review commentscikit-learn/enhancement_proposals

slep000, slep workflow

+.. _slep_000:++==============================+SLEP000: SLEP and its workflow+==============================++:Author: Adrin Jalali+:Status: Draft+:Type: Process+:Created: 2020-02-13++Abstract+########++This SLEP specifies details related to SLEP submission, review, and acceptance+process.++Motivation+##########++Without a predefined workflow, the discussions around a SLEP can be long and+frustrating for both the author(s) and the reviewers. The lack of a known+workflow also results in the SLEPs to take months (if not years) before it is+merged as ``Under Review``. The purpose of this SLEP is to lubricate and ease+the process of working on a SLEP, and make it a more enjoyable and productive+experience. It also used the process used in PEPs and NEPs which means there+will be no ``Under Review`` status.+++What is a SLEP?+###############++SLEP stands for Scikit-Learn Enhancement Proposal, inspired from Python PEPs or+Numpy NEPs. A SLEP is a design document providing information to the+scikit-learn community, or describing a new feature for scikit-learn or its+processes or environment. The SLEP should provide a concise technical+specification of the proposed solution, and a rationale for the feature.++We intend SLEPs to be the primary mechanisms for proposing major new features,+for collecting community input on an issue, and for documenting the design+decisions that have gone into scikit-learn. The SLEP author is responsible for+building consensus within the community and documenting dissenting opinions.++Because the SLEPs are maintained as text files in a versioned repository, their+revision history is the historical record of the feature proposal.++SLEP Audience+#############++The typical primary audience for SLEPs are the core developers of+``scikit-learn`` and technical committee, as well as contributors to the+project. However, these documents also serve the purpose of documenting the+changes and decisions to help users understand the changes and why they are+made. The SLEPs are available under `Scikit-learn enhancement proposals+<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/>`_.++The proposal should be submitted as a draft SLEP via a GitHub pull request to a+``slepXXX`` directory with the name ``proposal.rst`` where ``XXX`` is an+appropriately assigned three-digit number (e.g., ``slep000/proposal.rst``). The+draft must use the `SLEP — Template and Instructions+<https://github.com/scikit-learn/enhancement_proposals/blob/master/slep_template.rst>`_+file.++Once the PR for the SLEP is created, a post should be made to the mailing list+containing the sections up to “Backward compatibility”, with the purpose of+limiting discussion there to usage and impact. Discussion on the pull request+will have a broader scope, also including details of implementation.++At the earliest convenience, the PR should be merged (regardless of whether it+is accepted during discussion). Additional PRs may be made by the champions to+update or expand the SLEP, or by maintainers to set its status, discussion URL,+etc.++Standards Track SLEPs (see bellow) consist of two parts, a design document and+a reference implementation. It is generally recommended that at least a+prototype implementation be co-developed with the SLEP, as ideas that sound+good in principle sometimes turn out to be impractical when subjected to the+test of implementation. Often it makes sense for the prototype implementation+to be made available as PR to the scikit-learn repo (making sure to+appropriately mark the PR as a WIP).++SLEP Types+##########++There are three kinds of SLEPs:++1. A Standards Track SLEP describes a new feature or implementation for+scikit-learn.++2. An Informational SLEP describes a scikit-learn design issue, or provides+general guidelines or information to the scikit-learn community, but does not+propose a new feature. Informational SLEPs do not necessarily represent a+scikit-learn community consensus or recommendation, so users and implementers+are free to ignore Informational SLEPs or follow their advice.++3. A Process SLEP describes a process surrounding scikit-learn, or proposes a+change to (or an event in) a process. Process SLEPs are like Standards Track+SLEPs but apply to areas other than the scikit-learn library itself. They may+propose an implementation, but not to scikit-learn’s codebase; they require+community consensus. Examples include procedures, guidelines, changes to the+decision-making process, and changes to the tools or environment used in+scikit-learn development. Any meta-SLEP is also considered a Process SLEP.+++SLEP Workflow+#############++A SLEP starts with an idea, which usually is discussed in an issue or a pull+request on the main repo before submitting a SLEP. It is generally a good idea+for the author of the SLEP to gauge the viability and the interest of the+community before working on a SLEP, mostly to save author's time.++A SLEP must have one or more champions: people who write the SLEP following the+SLEP template, shepherd the discussions around it, and seek consensus in the+community.++Review and Resolution+---------------------++SLEPs are discussed on the mailing list or the PRs modifying the SLEP. The+possible paths of the status of SLEPs are as follows:++.. image:: pep-0001-process_flow.png+   :alt: SLEP process flow diagram++All SLEPs should be created with the ``Draft`` status.++Eventually, after discussion, there may be a consensus that the SLEP should be+accepted – see the next section for details. At this point the status becomes+``Accepted``.++Once a SLEP has been ``Accepted``, the reference implementation must be+completed. When the reference implementation is complete and incorporated into+the main source code repository, the status will be changed to ``Final``.++To allow gathering of additional design and interface feedback before+committing to long term stability for a feature or API, a SLEP may also be+marked as ``Provisional``. This is short for“Provisionally Accepted”, and+indicates that the proposal has been accepted for inclusion in the reference+implementation, but additional user feedback is needed before the full design+can be considered ``Final``. Unlike regular accepted SLEPs, provisionally+accepted SLEPs may still be ``Rejected`` or ``Withdrawn`` even after the+related changes have been included in a scikit-learn release.++Wherever possible, it is considered preferable to reduce the scope of a+proposal to avoid the need to rely on the ``Provisional`` status (e.g. by+deferring some features to later SLEPs), as this status can lead to version+compatibility challenges in the wider scikit-learn ecosystem.++A SLEP can also be assigned status ``Deferred``. The SLEP author or a core+developer can assign the SLEP this status when no progress is being made on the+SLEP.++A SLEP can also be ``Rejected``. Perhaps after all is said and done it was not+a good idea. It is still important to have a record of this fact. The+``Withdrawn`` status is similar; it means that the SLEP author themselves has+decided that the SLEP is actually a bad idea, or has accepted that a competing+proposal is a better alternative.++When a SLEP is ``Accepted``, ``Rejected``, or ``Withdrawn``, the SLEP should be+updated accordingly. In addition to updating the status field, at the very+least the ``Resolution`` header should be added with a link to the relevant+thread in the mailing list archives or where the discussion happened.++SLEPs can also be ``Superseded`` by a different SLEP, rendering the original+obsolete. The ``Replaced-By`` and ``Replaces`` headers should be added to the+original and new SLEPs respectively.++``Process`` SLEPs may also have a status of ``Active`` if they are never meant+to be completed, e.g. SLEP 1 (this SLEP).++How a SLEP becomes Accepted+---------------------------++A SLEP is ``Accepted`` by the voting mechanism defined in the `governance model+<https://scikit-learn.org/stable/governance.html?highlight=governance>`_. We+need a concrete way to tell whether consensus has been reached. When you think+a SLEP is ready to accept, create a PR changing the status of the SLEP to+``Accepted``, then send an email to the scikit-learn mailing list with a+subject like:++    Proposal to accept SLEP #<number>: <title>++In the body of your email, you should:++- link to the latest version of the SLEP, and a link to the PR accepting the+  SLEP.++- briefly describe any major points of contention and how they were resolved,++- include a sentence like: “The vote will be closed in a month i.e. on+  <the_date>.”++Generally the SLEP author will be the one to send this email, but anyone can do+it; the important thing is to make sure that everyone knows when a SLEP is on+the verge of acceptance, and give them a final chance to respond.++In general, the goal is to make sure that the community has consensus, not+provide a rigid policy for people to try to game. When in doubt, err on the+side of asking for more feedback and looking for opportunities to compromise.++If the final comment and voting period passes with the required majority, then+the SLEP can officially be marked ``Accepted``. The ``Resolution`` header+should link to the PR accepting the SLEP.++If the vote does not achieve a required majority, then the SLEP remains in+``Draft`` state, discussion continues as normal, and it can be proposed for+acceptance again later once the objections are resolved.

I agree with @adrinjalali I think the mechanisms described here should be enough.

adrinjalali

comment created time in 2 days

issue commentamueller/word_cloud

Is there a way to put a border to the shape?

It might mean you have an old version. What is wordcloud.__version__?

jsmedmar

comment created time in 2 days

issue commentscikit-learn/enhancement_proposals

SLEP 007 feature-name generation: adding a constructor argument for verbosity

I'm happy to do a simple solution that we can easily implement. I think my initial thought was not to change the feature names in any of the one-to-one transformers and that's what I had implemented, so it would be verbose_feature_names=False. If we want to implement that and delay further additions to the future I'm fine with that. Right now I read the SLEP as adding a constructor argument to a large number of estimators and I was skeptical about that.

amueller

comment created time in 2 days

issue commentscikit-learn/enhancement_proposals

SLEP 007 feature-name generation: adding a constructor argument for verbosity

Sorry my phone seems to have given me grief and cut off a sentence. What I meant to say was:

It seems orthogonal to the working of the estimator.

In other words most of the parameters (all but verbose really) influence what the model does. This one would not. That's why I'm hesitant to add it to the constructor. And yes, my concern is that in my understanding of the SLEP we would have to add it in many many places.

not really. My intention was to suggest the possibility of having such a parameter to transformers, wherever we think there's demand for more than one way of feature generation.

From the SLEP it sounded to me like that was an intrinsic aspect of the feature name API. If I misunderstood that, then maybe it should be clarified. So you're suggesting that we figure out whether to add this parameter or not on a case-by-case basis? How do we decide where we add it and what the default is?

amueller

comment created time in 2 days

Pull request review commentdabl/dabl

fix EasyProcessor get_feature_names

 def test_easy_preprocessor_transform():     pipe.fit(X_train, y_train)     pipe.predict(X_train)     pipe.predict(X_val)+++def test_simple_preprocessor_imputed_features():+    # Issue: 211++    data = pd.DataFrame({'A': [0, 1, 2, 1, np.NaN]}, dtype=int)+    types = detect_types(data, type_hints={'A': 'categorical'})++    ep = EasyPreprocessor(types=types)+    ep.fit(data)+    ep.get_feature_names()

can you maybe add an assert that checks what we expect? That's not a perfect test but it will at least let us know if we change something accidentally.

Shathra

comment created time in 2 days

pull request commentdabl/dabl

add legend parameter to mosaic_plot function, default to True

Yes exactly, I think that's most useful. What do you think?

Shathra

comment created time in 2 days

push eventamueller/columbia-website

Andreas Mueller

commit sha 6e57d69fc31e530b5ca5b9f0a71ad1ccb7fa1233

add links for 6, 7, add office hours

view details

push time in 2 days

issue commentamueller/word_cloud

Tikz output

@ChristofKaufmann is there an issue with using the SVG? If there's a big benefit of latex over SVG we can include it, but it wasn't obvious to me. SVG has been merged in the meantime.

Chimrod

comment created time in 2 days

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha b841f467bbc89b20a0941e7c389b267f0df279fd

update and fix tree slides a bit.

view details

push time in 2 days

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha 6baafecc827145f2267610ef69892e6c0dfa66a1

tree slides

view details

push time in 2 days

push eventamueller/COMS4995-s20

Andreas Mueller

commit sha d27be39a5cd309f00c32a64647ab2ada9abd7662

minor polish slides 5 and 7

view details

Andreas Mueller

commit sha 6402108034c2e8ffb0fae621fe9a044a507723a3

Merge branch 'master' of github.com:amueller/COMS4995-s20

view details

push time in 2 days

PR opened amueller/word_cloud

Doc build style
+11 -1

0 comment

2 changed files

pr created time in 5 days

push eventamueller/word_cloud

Andreas Mueller

commit sha 23c936a5baf3b94c84eb85a04e1fbacdef3cd35a

fix the css for docs

view details

push time in 5 days

create barnchamueller/word_cloud

branch : doc_build_style

created branch time in 5 days

pull request commentdabl/dabl

add legend parameter to mosaic_plot function, default to True

Sorry for the slow reply, I've been traveling.

Shathra

comment created time in 5 days

pull request commentdabl/dabl

add legend parameter to mosaic_plot function, default to True

Thanks! Right now it looks like it adds a legend to each mosaic plot in plot. so the call in plot probably needs to be adjusted: https://1096-148831482-gh.circle-artifacts.com/0/doc/auto_examples/plot/plot_splice.html

Shathra

comment created time in 5 days

push eventdabl/dabl.github.io

dabl

commit sha 1fed48b462bbd64c69c09d6599b3cc1634b5b7b2

Pushing the docs to dev/ for branch: master, commit c4bfc23ba2be11763a2600c7d2a7a0059cb2251c

view details

push time in 5 days

Pull request review commentdabl/dabl

discard regression target values that cannot be converted to float

 def clean(X, type_hints=None, return_types=False,         # we should know what these are but maybe running this again is fine?         types_df = detect_types(X_df)         types = pd.concat([types[~types.dirty_float], types_df])++        # discard dirty float targets that cant be converted to float+        if target_col is not None and types_p['dirty_float'][target_col] and\+                np.isnan(X["{}_dabl_continuous".format(target_col)]).any():

Is the last and necessary? Wouldn't it also work without it?

praths007

comment created time in 5 days

push eventdabl/dabl

Prathmesh Savale

commit sha c4bfc23ba2be11763a2600c7d2a7a0059cb2251c

Treat nans in reg targets (#215) * drop nan from regression plot target values * flake8 edit * add call to plot in test

view details

push time in 5 days

PR merged dabl/dabl

Treat nans in reg targets

This PR should fix #210.

an alternative to this could be a private _method to check the target column and drop nan values. This method will be private to plot in utils.py and called from supervised.py wherever required.

+36 -2

2 comments

2 changed files

praths007

pr closed time in 5 days

issue closeddabl/dabl

missing values in regression targets crash plot

import numpy as np
from dabl import plot
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
y[::50] = np.NaN
plot(X, y)

On some matplotlib version it crashes in the histogram, otherwise it crashes in the feature importance. Maybe discarding them at the very beginning and throwing a warning is fine.

closed time in 5 days

amueller

pull request commentdabl/dabl

Treat nans in reg targets

Thanks!

praths007

comment created time in 5 days

issue commentdabl/dabl

drop outliers in regression target for plotting

sorry for being cryptic, I meant in the topo_2_1 dataset which you can get with fetch_mldata('topo_2_1'). If you plot that with plot right now you'll see the outliers mess it up and that would be great to fix.

amueller

comment created time in 5 days

issue commentamueller/word_cloud

Is there a way to put a border to the shape?

Well you can just set contour_width= a number bigger than 0 and that should work.

jsmedmar

comment created time in 5 days

pull request commentscikit-learn/enhancement_proposals

SLEP011 Do not make mandatory TC vote during SLEP vote

So should we merge this? Or 000?

glemaitre

comment created time in 5 days

Pull request review commentscikit-learn/enhancement_proposals

slep000, slep workflow

+.. _slep_000:++==============================+SLEP000: SLEP and its wofkflow+==============================++:Author: Adrin Jalali+:Status: Draft+:Type: Process+:Created: 2020-02-13++Abstract+########++This SLEP specifies details related to SLEP submission, review, and acceptance+process.++Motivation+##########++Without a predefined workflow, the discussions around a SLEP can be long and+frustrating for both the author and the reviewers. The lack of a known workflow+also results in the SLEPs to take months (if not years) before it is accepted+as ``Draft``. The purpose of this SLEP is to lubricate and ease the+process of working on a SLEP, and make it a more enjoyable and productive+experience.+++What is a SLEP?+###############++SLEP stands for Scikit-Learn Enhancement Proposal. A SLEP is a design document+providing information to the scikit-learn community, or describing a new+feature for scikit-learn or its processes or environment. The SLEP should+provide a concise technical specification of the feature and a rationale for+the feature.++We intend SLEPs to be the primary mechanisms for proposing major new features,+for collecting community input on an issue, and for documenting the design+decisions that have gone into scikit-learn. The SLEP author is responsible for+building consensus within the community and documenting dissenting opinions.++Because the SLEPs are maintained as text files in a versioned repository, their+revision history is the historical record of the feature proposal.++SLEP Audience+#############++The typical primary audience for SLEPs are the core developers of+``scikit-learn`` and technical committee, as well as contributors to the+project. However, these documents also serve the purpose of documenting the+changes and decisions to help users understand the changes and why they are+made.++The proposal should be submitted as a draft SLEP via a GitHub pull request to a+``slepXXX`` directory with the name ``proposal.rst`` where ``XXX`` is an+appropriately assigned three-digit number (e.g., ``slep000/proposal.rst``). The+draft must use the SLEP — Template and Instructions file.++Once the PR for the SLEP is in place, a post should be made to the mailing list+containing the sections up to “Backward compatibility”, with the purpose of+limiting discussion there to usage and impact. Discussion on the pull request

Given that this is also the NEP process that seems fine, I think?

adrinjalali

comment created time in 5 days

Pull request review commentscikit-learn/enhancement_proposals

slep000, slep workflow

+.. _slep_000:++==============================+SLEP000: SLEP and its wofkflow+==============================++:Author: Adrin Jalali+:Status: Draft+:Type: Process+:Created: 2020-02-13++Abstract+########++This SLEP specifies details related to SLEP submission, review, and acceptance+process.++Motivation+##########++Without a predefined workflow, the discussions around a SLEP can be long and+frustrating for both the author and the reviewers. The lack of a known workflow+also results in the SLEPs to take months (if not years) before it is accepted+as ``Draft``. The purpose of this SLEP is to lubricate and ease the+process of working on a SLEP, and make it a more enjoyable and productive+experience.+++What is a SLEP?+###############++SLEP stands for Scikit-Learn Enhancement Proposal. A SLEP is a design document+providing information to the scikit-learn community, or describing a new+feature for scikit-learn or its processes or environment. The SLEP should+provide a concise technical specification of the feature and a rationale for+the feature.++We intend SLEPs to be the primary mechanisms for proposing major new features,+for collecting community input on an issue, and for documenting the design+decisions that have gone into scikit-learn. The SLEP author is responsible for+building consensus within the community and documenting dissenting opinions.++Because the SLEPs are maintained as text files in a versioned repository, their+revision history is the historical record of the feature proposal.++SLEP Audience+#############++The typical primary audience for SLEPs are the core developers of+``scikit-learn`` and technical committee, as well as contributors to the+project. However, these documents also serve the purpose of documenting the+changes and decisions to help users understand the changes and why they are+made.++The proposal should be submitted as a draft SLEP via a GitHub pull request to a+``slepXXX`` directory with the name ``proposal.rst`` where ``XXX`` is an+appropriately assigned three-digit number (e.g., ``slep000/proposal.rst``). The+draft must use the SLEP — Template and Instructions file.++Once the PR for the SLEP is in place, a post should be made to the mailing list+containing the sections up to “Backward compatibility”, with the purpose of+limiting discussion there to usage and impact. Discussion on the pull request+will have a broader scope, also including details of implementation.++At the earliest convenience, the PR should be merged (regardless of whether it

So the NEP says " Additional PRs may be made by the Author to update or expand the NEP, or by maintainers to set its status, discussion URL, etc.", meaning that the SLEP author is the only one who is supposed to change it, right?

I think it's fine to merge drafts, and it's up to the author to ensure the draft is mature enough to serve as a basis for discussion.

I'm not entirely sure how the process for NEPs and PEPs works in terms of the back and forth between the author and the community. I feel all the important points should be part of the SLEP so losing the detailed discussion on a PR might be ok?

Maybe we can ping the numpy people to tell us about their experience? If I look at " Discussion on the pull request will have a broader scope, also including details of implementation." it sounds like the main discussion does happen on the pull request.

adrinjalali

comment created time in 5 days

Pull request review commentscikit-learn/enhancement_proposals

slep000, slep workflow

+.. _slep_000:++==============================+SLEP000: SLEP and its workflow+==============================++:Author: Adrin Jalali+:Status: Draft+:Type: Process+:Created: 2020-02-13++Abstract+########++This SLEP specifies details related to SLEP submission, review, and acceptance+process.++Motivation+##########++Without a predefined workflow, the discussions around a SLEP can be long and+frustrating for both the author(s) and the reviewers. The lack of a known+workflow also results in the SLEPs to take months (if not years) before it is+merged as ``Under Review``. The purpose of this SLEP is to lubricate and ease+the process of working on a SLEP, and make it a more enjoyable and productive+experience. It also used the process used in PEPs and NEPs which means there

"uses"? I'm not sure I understand the sentence as it is written right now.

adrinjalali

comment created time in 5 days

Pull request review commentscikit-learn/enhancement_proposals

Slep007 - feature names, their generation and the API

+ .. _slep_007:++===========================================+Feature names, their generation and the API+===========================================++:Author: Adrin Jalali+:Status: Under Review+:Type: Standards Track+:Created: 2019-04++Abstract+########++This SLEP proposes the introduction of the ``feature_names_in_`` attribute for+all estimators, and the ``feature_names_out_`` attribute for all transformers.+We here discuss the generation of such attributes and their propagation through+pipelines. Since for most estimators there are multiple ways to generate+feature names, this SLEP does not intend to define how exactly feature names+are generated for all of them.++Motivation+##########++``scikit-learn`` has been making it easier to build complex workflows with the+``ColumnTransformer`` and it has been seeing widespread adoption. However,+using it results in pipelines where it's not clear what the input features to+the final predictor are, even more so than before. For example, after fitting+the following pipeline, users should ideally be able to inspect the features+going into the final predictor::+++    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)++    # We will train our classifier with the following features:+    # Numeric Features:+    # - age: float.+    # - fare: float.+    # Categorical Features:+    # - embarked: categories encoded as strings {'C', 'S', 'Q'}.+    # - sex: categories encoded as strings {'female', 'male'}.+    # - pclass: ordinal integers {1, 2, 3}.++    # We create the preprocessing pipelines for both numeric and categorical data.+    numeric_features = ['age', 'fare']+    numeric_transformer = Pipeline(steps=[+        ('imputer', SimpleImputer(strategy='median')),+        ('scaler', StandardScaler())])++    categorical_features = ['embarked', 'sex', 'pclass']+    categorical_transformer = Pipeline(steps=[+        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),+        ('onehot', OneHotEncoder(handle_unknown='ignore'))])++    preprocessor = ColumnTransformer(+        transformers=[+            ('num', numeric_transformer, numeric_features),+            ('cat', categorical_transformer, categorical_features)])++    # Append classifier to preprocessing pipeline.+    # Now we have a full prediction pipeline.+    clf = Pipeline(steps=[('preprocessor', preprocessor),+                          ('classifier', LogisticRegression())])++    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)++    clf.fit(X_train, y_train)+++However, it's impossible to interpret or even sanity-check the+``LogisticRegression`` instance that's produced in the example, because the+correspondence of the coefficients to the input features is basically+impossible to figure out.++This proposal suggests adding two attributes to fitted estimators:+``feature_names_in_`` and ``feature_names_out_``, such that in the+abovementioned example ``clf[-1].feature_names_in_`` and+``clf[-2].feature_names_out_`` will be::++    ['num__age',+     'num__fare',+     'cat__embarked_C',+     'cat__embarked_Q',+     'cat__embarked_S',+     'cat__embarked_missing',+     'cat__sex_female',+     'cat__sex_male',+     'cat__pclass_1',+     'cat__pclass_2',+     'cat__pclass_3']++Ideally the generated feature names describe how a feature is generated at each+stage of a pipeline. For instance, ``cat__sex_female`` shows that the feature+has been through a categorical preprocessing pipeline, was originally the+column ``sex``, and has been one hot encoded and is one if it was originally+``female``. However, this is not always possible or desirable especially when a+generated column is based on many columns, since the generated feature names+will be too long, for example in ``PCA``. As a rule of thumb, the following+types of transformers may generate feature names which corresponds to the+original features:++- Leave columns unchanged, *e.g.* ``StandardScaler``+- Select a subset of columns, *e.g.* ``SelectKBest``+- create new columns where each column depends on at most one input column,+  *e.g* ``OneHotEncoder``+- Algorithms that create combinations of a fixed number of features, *e.g.*+  ``PolynomialFeatures``, as opposed to all of+  them where there are many. Note that verbosity considerations and+  ``verbose_feature_names`` as explained later can apply here.++This proposal talks about how feature names are generated and not how they are+propagated.++verbose_feature_names+*********************++``verbose_feature_names`` controls the verbosity of the generated feature names+and it can be ``True`` or ``False``. Alternative solutions could imagine:

I'm pretty sure the previous version was not grammatical, I thought it was a typo.

adrinjalali

comment created time in 5 days

issue commentscikit-learn/enhancement_proposals

SLEP 007 feature-name generation: adding a constructor argument for verbosity

cc @glemaitre @NicolasHug I guess?

amueller

comment created time in 5 days

issue openedscikit-learn/enhancement_proposals

SLEP 007 feature-name generation: adding a constructor argument for verbosity

Basically right now SLEP 007 suggests to add a constructor parameter to all transformers that are not feature selectors, right?

I see the motivation and I think it's actually hard to come up with a better solution (I don't have one right now), but I'm also concerned with adding feature name specific things to the constructor. It seems orthogonal to the workh

I think the alternative that @adrinjalali suggested was having callbacks for formatting feature names (I don't remember the details tbh), but that was pretty complex.

Maybe we could have a method that people could overwrite or something like that? I'm not sure. To me this is the one contentious part of the SLEP.

created time in 5 days

issue closedscikit-learn/enhancement_proposals

SLEP007: Meta-estimators section expansion

@amueller has the following concern on the meta-estimators section of the SLEP:

Shouldn't we list all meta-estimators that are transformers? What about FeatureUnion and RFECV? I guess maybe we're talking about meta-estimators that are not feature selectors, because those are easy.

I'm personally not sure. I kinda prefer to have a guideline which can be used to the meta estimators, rather than listing them in the SLEP.

closed time in 5 days

adrinjalali

issue commentscikit-learn/enhancement_proposals

SLEP007: Meta-estimators section expansion

I'm ok with the current phrasing of the SLEP in that regard.

adrinjalali

comment created time in 5 days

issue commentnumpy/numpy

Feature request: change hist bins default to 'auto'

My impression and my argument was that people don't realize that they are missing something, so I think similar to yours. I think the counter-argument was that no default always works and people should experiment. I don't think I'm a good person to summarize that conversation though ;)

amueller

comment created time in 5 days

more