profile
viewpoint
Andreas Mueller amueller Microsoft NYC http://amueller.github.io Scikit-learn core-developer, Principal Research SDE @Microsoft

amueller/COMS4995-s19 288

COMS W4995 Applied Machine Learning - Spring 19

amueller/COMS4995-s18 155

COMS W4995 Applied Machine Learning - Spring 18

amueller/COMS4995-s20 151

COMS W4995 Applied Machine Learning - Spring 20

amueller/advanced_training 109

Advanced Scikit-learn training session

amueller/applied_ml_spring_2017 84

Website and material for the FIXME course on Practical Machine Learning

amueller/aml 55

Applied Machine Learning with Python

amueller/dabl 48

Data Analysis Baseline Library

amueller/advanced_git_nyu_2016 41

Advanced git and github course material

amueller/advanced-sklearn-boston-nlp-2016 18

Material and slides for Boston NLP meetup May 23rd 2016

issue closeddabl/dabl

Error 'TypeError: 'function' object is not iterable' in function 'plot'

Hello,

for allmost all functions of the package 'dabl' (such as 'clean' and 'plot' for example) I keep getting the following error:

TypeError: 'function' object is not iterable

This does not seem to be a problem of my specific data, because when I run the following example code (from the website):

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from dabl import plot
from dabl.utils import data_df_from_bunch

wine_bunch = load_wine()
wine_df = data_df_from_bunch(wine_bunch)

plot(wine_df, 'target')

I then get the following output on my machine:


  File "<ipython-input-61-92b3b7736897>", line 1, in <module>
    plot(wine_df, 'target')

  File "C:\Daten\Anaconda3\lib\site-packages\dabl\plot\supervised.py", line 485, in plot
    X, types = clean(X, type_hints=type_hints, return_types=True)

  File "C:\Daten\Anaconda3\lib\site-packages\dabl\preprocessing.py", line 380, in clean
    lambda x: str(x))

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\accessor.py", line 115, in f
    return self._delegate_method(name, *args, **kwargs)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 2204, in _delegate_method
    res = method(*args, **kwargs)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 940, in rename_categories
    cat.categories = new_categories

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\categorical.py", line 408, in categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 154, in __init__
    self._finalize(categories, ordered, fastpath=False)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 181, in _finalize
    fastpath=fastpath)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\dtypes\dtypes.py", line 319, in _validate_categories
    categories = Index(categories, tupleize_cols=False)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 374, in __new__
    subarr = _asarray_tuplesafe(data, dtype=object)

  File "C:\Daten\Anaconda3\lib\site-packages\pandas\core\common.py", line 379, in _asarray_tuplesafe
    values = list(values)

TypeError: 'function' object is not iterable

However, when I installed dabl it did not complain (it said: 'Successfully built dabl'). Is this a bug? Is there maybe an obvious workaround so that I can still use the package?

Thank you for your time

Hannes

closed time in 19 hours

chanes76

issue commentdabl/dabl

Error 'TypeError: 'function' object is not iterable' in function 'plot'

closing as no reply.

chanes76

comment created time in 19 hours

push eventdabl/dabl.github.io

dabl

commit sha 62bb7d3bc9abbf184a8edc492b0d6ca561d5e18a

Pushing the docs to dev/ for branch: master, commit 9a780fd9c49a8538f7e50dd6a548344bde644bbb

view details

push time in 19 hours

issue closeddabl/dabl

drop outliers in regression target for plotting

Nice example of the plotting having issues with outliers is topo_2_1

closed time in 19 hours

amueller

issue commentdabl/dabl

drop outliers in regression target for plotting

Fixed in #233, thanks @SvoONs !

amueller

comment created time in 19 hours

push eventdabl/dabl

Andreas Mueller

commit sha 9a780fd9c49a8538f7e50dd6a548344bde644bbb

bump master to 0.2.0-dev (#264) * prepare for 0.1.10 release * fix regex for deploy of docs (#260) * fix regex for deploy of docs * add sphinx_issues to doc build * Fix doc push path (#262) * fix regex for deploy of docs * add sphinx_issues to doc build * push to correct path * try to fix doc folder name * actually next version is 0.1.9 lol * bump master to 0.2.0-dev

view details

push time in 19 hours

PR merged dabl/dabl

bump master to 0.2.0-dev
+26 -11

0 comment

8 changed files

amueller

pr closed time in 19 hours

PR opened dabl/dabl

OneHotEncoder drop if_binary

Closes #259.

+28 -13

0 comment

9 changed files

pr created time in 21 hours

create barnchamueller/dabl

branch : ohe_drop_binary

created branch time in 21 hours

PR opened dabl/dabl

bump master to 0.2.0-dev
+26 -11

0 comment

8 changed files

pr created time in 21 hours

create barnchamueller/dabl

branch : bump_02

created branch time in 21 hours

push eventdabl/dabl.github.io

Andreas C Mueller

commit sha be5e178344e18e61a19575f8f17abce6b0d5dff0

fix version to 0.1.9

view details

push time in 21 hours

push eventdabl/dabl.github.io

dabl

commit sha ce7b62644bce49972f5ca0937dd4e771cf7b10b1

Pushing the docs to 0.1.9/ for branch: 0.1.9, commit 95c1c1cb5d16d4b083c548fa2d3aa0328e873493

view details

push time in 21 hours

delete branch dabl/dabl

delete branch : 0.1.10

delete time in 21 hours

create barnchdabl/dabl

branch : 0.1.9

created branch time in 21 hours

issue closeddabl/dabl

Render stable documentation for release

We should add a stable documentation rendering that uses the latest release and make that the default website.

closed time in 21 hours

amueller

issue commentdabl/dabl

Render stable documentation for release

done in #263 and related.

amueller

comment created time in 21 hours

push eventdabl/dabl.github.io

Andreas C Mueller

commit sha 69827269ee4320bf64404abac5939d4b78f65c3d

delete badly named folder, make default url reroute to sable

view details

push time in 21 hours

push eventdabl/dabl.github.io

dabl

commit sha bdc259452df557d8d0b5452b185f8dc1b7bd1c25

Pushing the docs to 0.1.10/ for branch: 0.1.10, commit e6b97babf6047320163b55412d6186959e5509a6

view details

push time in 21 hours

push eventdabl/dabl

Andreas Mueller

commit sha e6b97babf6047320163b55412d6186959e5509a6

try to fix doc folder name (#263)

view details

push time in 21 hours

PR merged dabl/dabl

try to fix doc folder name
+8 -0

0 comment

1 changed file

amueller

pr closed time in 21 hours

PR opened dabl/dabl

try to fix doc folder name
+8 -0

0 comment

1 changed file

pr created time in 21 hours

create barnchamueller/dabl

branch : circle_folder_name

created branch time in 21 hours

push eventamueller/dabl

Andreas C Mueller

commit sha 0f1b415755bca0a0956c9f1441160eea52a80a10

gah

view details

push time in 21 hours

push eventdabl/dabl

Andreas Mueller

commit sha 3e5ba0e51022576c91817a39d2d84e5340dd3e13

Fix doc push path (#262) * fix regex for deploy of docs * add sphinx_issues to doc build * push to correct path

view details

push time in a day

PR merged dabl/dabl

Fix doc push path
+7 -13

0 comment

5 changed files

amueller

pr closed time in a day

PR opened dabl/dabl

Fix doc push path
+7 -13

0 comment

5 changed files

pr created time in a day

create barnchamueller/dabl

branch : fix_doc_push_path

created branch time in a day

push eventdabl/dabl.github.io

dabl

commit sha 26cf9b8bb25797abbb31a32bf63190afab6ba465

Pushing the docs to 0.1./ for branch: 0.1.10, commit 556f72a5922ce16193d4ea673bef6a011313d9c8

view details

push time in a day

push eventdabl/dabl

Andreas Mueller

commit sha 556f72a5922ce16193d4ea673bef6a011313d9c8

fix regex for deploy of docs (#260) * fix regex for deploy of docs * add sphinx_issues to doc build

view details

push time in a day

PR merged dabl/dabl

fix regex for deploy of docs
+7 -5

0 comment

4 changed files

amueller

pr closed time in a day

push eventdabl/dabl.github.io

dabl

commit sha a9e7759be0eecf38e07166acf7c68bd2056b3983

Pushing the docs to dev/ for branch: master, commit a5cd0e9dab28cfb0e30c973e7723920f22979f23

view details

push time in a day

push eventdabl/dabl

SvoONs

commit sha a5cd0e9dab28cfb0e30c973e7723920f22979f23

Drop outliers in target column in plot_regression_continuous (#233)

view details

push time in a day

PR merged dabl/dabl

Drop outliers in target column in plot_regression_continuous

This pull request adds a step to drop outliers in the target column for regression plots in order to improve the usability of such plots in case of outliers in the target column. This is done by simply calling _find_inliers on the target column as it is already done per default for features.

The plot for the dataset topo_2_1 mentioned in #214 with the menioned fix looks the following: example_plot_fix

+27 -0

7 comments

2 changed files

SvoONs

pr closed time in a day

pull request commentdabl/dabl

Drop outliers in target column in plot_regression_continuous

Thanks, this looks good. Sorry for the slow reply!

SvoONs

comment created time in a day

issue openeddabl/dabl

really long axes names are cut off, even if there's space

If you have a really long xlabel, it will be cut off at a fixed point, independent of how large the axis actually is.

created time in a day

push eventamueller/dabl

Andreas C Mueller

commit sha 50f1e9b6e75f563e8b581a67b6c1762c18364507

add sphinx_issues to doc build

view details

push time in a day

PR opened dabl/dabl

fix regex for deploy of docs
+1 -1

0 comment

1 changed file

pr created time in a day

push eventamueller/dabl

Andreas C Mueller

commit sha cc35a6372f2850b21268531d4be2c718bb2e2190

fix regex for deploy of docs

view details

push time in a day

pull request commentmicrosoft/MLOS

Make Mlos.Server.Agent use CommandLineParser for configurable endpoints

Is there a reasonably simple way to test this?

bpkroth

comment created time in a day

delete branch dabl/dabl

delete branch : halving_fix

delete time in a day

create barnchdabl/dabl

branch : 0.1.10

created branch time in a day

create barnchamueller/dabl

branch : 0.1.10

created branch time in a day

push eventdabl/dabl.github.io

dabl

commit sha 59c02673b86b3fdbefcafada7de6bccee3f6f5b0

Pushing the docs to dev/ for branch: master, commit bf48fd00199d0fe437ce23e14284c058bf78d865

view details

push time in a day

issue commentdabl/dabl

AnyClassifier preprocessing and OpenML automlbenchmark integration

Can you check if this is fixed in master now? Thanks!

hp2500

comment created time in a day

push eventdabl/dabl

Andreas Mueller

commit sha bf48fd00199d0fe437ce23e14284c058bf78d865

MRG start using detect types (#256) * start using detect types * restore previous behavior on mixed integers * add test for detecting / preprocessing mix of bool and nan

view details

push time in a day

PR merged dabl/dabl

MRG start using detect types

Hopefully fixes #250.

+28 -8

0 comment

2 changed files

amueller

pr closed time in a day

issue closeddabl/dabl

AnyClassifier preprocessing and OpenML automlbenchmark integration

I have been trying to integrate AnyClassifier as an automl framework to https://github.com/openml/automlbenchmark in order to run experiments in a controlled environment. I feel like I have an OK grasp of the automlbenchmark repo now and have successfully integrated standalone sklearn algorithms as frameworks and ran the test benchmark. While I did find a way to make dabl work as well, automlbenchmark doesn't seem to play well with the dabl preprocessing functions. Here is an example traceback:

Traceback (most recent call last):
  File "/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl_develop/automlbenchmark/automlbenchmark/amlb/benchmark.py", line 420, in run
    meta_result = framework.run(self._dataset, task_config)
  File "/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl_develop/automlbenchmark/automlbenchmark/frameworks/dabl/__init__.py", line 4, in run
    return run(*args, **kwargs)
  File "/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl_develop/automlbenchmark/automlbenchmark/frameworks/dabl/exec.py", line 30, in run
    predictor.fit(X_train, y_train)
  File "/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl_develop/dabl/dabl/models.py", line 438, in fit
    types = detect_types(X, type_hints=self.type_hints)
  File "/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl_develop/dabl/dabl/preprocessing.py", line 321, in detect_types
    clean_float_string, dirty_float = _find_string_floats(
  File "/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl_develop/dabl/dabl/preprocessing.py", line 158, in _find_string_floats
    is_float = X.apply(_float_matching)
  File "/miniconda3/envs/dabl_test/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply
    return op.get_result()
  File "/miniconda3/envs/dabl_test/lib/python3.8/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/miniconda3/envs/dabl_test/lib/python3.8/site-packages/pandas/core/apply.py", line 295, in apply_standard
    result = libreduction.compute_reduction(
  File "pandas/_libs/reduction.pyx", line 620, in pandas._libs.reduction.compute_reduction
  File "pandas/_libs/reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
  File "/Users/hp2500/Google Drive/STUDY/Columbia/Research/dabl_develop/dabl/dabl/preprocessing.py", line 20, in _float_matching
    is_floaty = X_col.str.match(_FLOAT_REGEX)
  File "/miniconda3/envs/dabl_test/lib/python3.8/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "/miniconda3/envs/dabl_test/lib/python3.8/site-packages/pandas/core/accessor.py", line 187, in __get__
    accessor_obj = self._accessor(obj)
  File "/miniconda3/envs/dabl_test/lib/python3.8/site-packages/pandas/core/strings.py", line 2041, in __init__
    self._inferred_dtype = self._validate(data)
  File "/miniconda3/envs/dabl_test/lib/python3.8/site-packages/pandas/core/strings.py", line 2098, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!

My workaround consists of formatting all variables as strings before passing them to AnyClassifier() in the automlbenchmark exec.py file. However, I was surprised by the errors, since I have tested AnyClassifier() on a few datasets outside of automlbenchmark and never had similar issues. I just wanted to flag this in case you think it needs attention. I also thought about the option to add an argument to AnyClassifier(), such that users can indicate whether they want to use it with or without EasyPreprocessor(), as some users might be only interested in the portfolios and not the preprocessing part.

Best, Heinrich

closed time in a day

hp2500

push eventamueller/dabl

Andreas C Mueller

commit sha c4290309837ebd8163abf78349d79b99cb395331

more lint

view details

push time in a day

push eventamueller/dabl

Andreas C Mueller

commit sha 0d9740818a5e32478315d109b10737511b7fa1f1

lint

view details

push time in a day

push eventamueller/dabl

Andreas C Mueller

commit sha 759b9dbfcce7aace97d335764124dcc69d71cd16

intermediate broken stuff

view details

Andreas C Mueller

commit sha 4c4f915d200c612b0971f71bc1907570ffb58f58

undo weird stuff

view details

Andreas C Mueller

commit sha 27e5135adfb133ffc9a2d13aad1a617aedd0723d

add test for detecting / preprocessing mix of bool and nan

view details

push time in a day

issue openedscikit-learn/scikit-learn

Error on using None for missing values in SimpleImputer with boolean arrays

from sklearn.impute import SimpleImputer
import pandas as pd
X2 = pd.DataFrame({'a': [True, False, True, False, None]})
SimpleImputer(strategy='most_frequent').fit_transform(X2)

TypeError: '<' not supported between instances of 'NoneType' and 'bool'

Interestingly this works when the remaining types are float, or when using np.NaN for the missing value.

Potentially related to #17625, where I suggested treating None as np.NaN.

FYI X2.a.unique() gives the expected result, np.unique(X2.a) errors. I'm not sure if using the __array_function__ protocol might help us here, we might be casting to numpy arrays before hitting the unique.

created time in a day

issue openeddabl/dabl

Use drop=binary in OneHotEncoder

created time in a day

issue commentdabl/dabl

AnyClassifier preprocessing and OpenML automlbenchmark integration

Do you know which dataset caused this error? I'm trying to reproduce.

hp2500

comment created time in a day

issue commentdabl/dabl

Bug with dabl.explain()

Thanks for reporting! Indeed the explain function is not very robust yet. Scikit-learn makes mapping input to output columns a bit hard, which will hopefully be improved by https://github.com/scikit-learn/scikit-learn/pull/16772

I'll see what I can do in the meantime, dabl also needs some updates for the current version on sklearn, which I'll probably try to make work first.

Luerken

comment created time in a day

PullRequestReviewEvent

pull request commentmicrosoft/MLOS

added feature imputation to RERF predict()

I'm not entirely sure I understand the issue, but would it be possible to reuse a sklearn ColumnTransformer to preserve the mapping? Or maybe alternatively write an Adaptor that does something similar using the dimension information?

edcthayer

comment created time in a day

push eventamueller/word_cloud

jamesgagg

commit sha 5f23ed47727eb97c45883fd34f98cd62a1c8a55b

regex change to address issue #495 (#585) * Update wordcloud.py * Update wordcloud.py changed to reflect the feedback from amueller in https://github.com/amueller/word_cloud/pull/585 * Update test_wordcloud.py default patterns test added re single character word cloud support

view details

push time in a day

PR merged amueller/word_cloud

regex change to address issue #495

Changed r"\w[\w']+" to r"\w[\w']*" so that single words are matched too, as per issue #495.
Note: the user will need to be mindful that single letters words such as 'a' and 'i' will not appear if the default stopword list is used.

+16 -1

7 comments

2 changed files

jamesgagg

pr closed time in a day

pull request commentamueller/word_cloud

regex change to address issue #495

great, thank you!

jamesgagg

comment created time in a day

Pull request review commentmicrosoft/MLOS

Adding Config Stores

 #
 from mlos.Examples.SmartCache.CacheImplementations.XruCache import XruCache
 from mlos.Spaces import DiscreteDimension, Point, SimpleHypergrid
-from mlos.Spaces.Configs.DefaultConfigMeta import DefaultConfigMeta
+from mlos.Spaces.Configs.ComponentConfigStore import ComponentConfigStore
 
-class LruCacheConfig(metaclass=DefaultConfigMeta):
-    CONFIG_SPACE = SimpleHypergrid(
+LruCacheConfigStore = ComponentConfigStore(

the Python way would be underscores, I think: https://www.python.org/dev/peps/pep-0008/#global-variable-names But if consistency with C# style is a concern I think CamelCase wouldn't be too disruptive? Usually that's reserved for classes.

byte-sculptor

comment created time in 2 days

PullRequestReviewEvent
PullRequestReviewEvent

PR opened microsoft/MLOS

DRAFT DEBUGGING Cpp notebook
+230 -6

0 comment

7 changed files

pr created time in 2 days

push eventamueller/MLOS

Andreas C Mueller

commit sha ec5caaa631268aa153576277480572506d09c81c

debuggggging

view details

push time in 2 days

PullRequestReviewEvent

pull request commentmicrosoft/MLOS

add support for zsh

😍

bpkroth

comment created time in 2 days

pull request commentmicrosoft/MLOS

added feature imputation to RERF predict()

If it still tests the thing it was originally designed to test then sure. Otherwise maybe add a separate test with N=1?

edcthayer

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

+# Troubleshooting Tips++Here are some common tips for troubleshooting various issues.++## Contents++- [Troubleshooting Tips](#troubleshooting-tips)+  - [Contents](#contents)+  - [Editor Integrations](#editor-integrations)+    - [VSCode in WSL](#vscode-in-wsl)+      - ["Missing .Net SDK" message when executing `code .` in WSL](#missing-net-sdk-message-when-executing-code--in-wsl)++## Editor Integrations++### VSCode in WSL++#### "Missing .Net SDK" message when executing `code .` in WSL++The [Omnisharp](https://github.com/OmniSharp/omnisharp-vscode/wiki/Troubleshooting:-'The-.NET-Core-SDK-cannot-be-located.'-errors) plugin for VSCode may have trouble finding the `dotnet` setup locally for the MLOS repo in `tools/`, even if you source the `scripts/init.linux.sh` script to setup your `PATH` environment.++To workaround this issue, you can [install `dotnet` system wide for your WSL2 distro](https://docs.microsoft.com/en-us/dotnet/core/install/linux).++Here are the instructions for Ubuntu 20.04:\+<https://docs.microsoft.com/en-us/dotnet/core/install/linux-ubuntu#2004->

Ok actually works now :)

bpkroth

comment created time in 2 days

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

+# Troubleshooting Tips++Here are some common tips for troubleshooting various issues.++## Contents++- [Troubleshooting Tips](#troubleshooting-tips)+  - [Contents](#contents)+  - [Editor Integrations](#editor-integrations)+    - [VSCode in WSL](#vscode-in-wsl)+      - ["Missing .Net SDK" message when executing `code .` in WSL](#missing-net-sdk-message-when-executing-code--in-wsl)++## Editor Integrations++### VSCode in WSL++#### "Missing .Net SDK" message when executing `code .` in WSL++The [Omnisharp](https://github.com/OmniSharp/omnisharp-vscode/wiki/Troubleshooting:-'The-.NET-Core-SDK-cannot-be-located.'-errors) plugin for VSCode may have trouble finding the `dotnet` setup locally for the MLOS repo in `tools/`, even if you source the `scripts/init.linux.sh` script to setup your `PATH` environment.++To workaround this issue, you can [install `dotnet` system wide for your WSL2 distro](https://docs.microsoft.com/en-us/dotnet/core/install/linux).++Here are the instructions for Ubuntu 20.04:\+<https://docs.microsoft.com/en-us/dotnet/core/install/linux-ubuntu#2004->

Could still not make this work after successfully running these instructions.

bpkroth

comment created time in 2 days

create barnchamueller/MLOS

branch : cpp_notebook

created branch time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 main(             CyclicalWorkload(2048, smartCache);         } +        // After having run a workload for a while, we want to check for a new+        // config suggestion from an optimizer.+        // In this case we make it a blocking call.++        // First, create some condition variables to help signal when the new+        // config is ready to be consumed.+        //         bool isConfigReady = false;         std::mutex waitForConfigMutex;         std::condition_variable waitForConfigCondVar; -        // Setup a callback.+        // Also, setup a callback lambda function for handling the+        // SharedConfigUpdatedFeedbackeMessage we expect to receive from the+        // agent after we request a config update with a RequestNewConfigurationMessage.+        //+        // Note: this lambda will be invoked by the background task setup above+        // for processing messages on the feedback channel.         //         ObjectDeserializationCallback::Mlos::Core::SharedConfigUpdatedFeedbackMessage_Callback =             [&waitForConfigMutex, &waitForConfigCondVar,              &isConfigReady](Proxy::Mlos::Core::SharedConfigUpdatedFeedbackMessage&& msg)             {-                // Ignore the message.+                // The contents of the message are irrelevant in this case.+                // It's just a signal that the RequestNewConfigurationMessage+                // has been processed by the agent.                 //                 UNUSED(msg); +                // So, we will just notify the waiting loop (below) that the+                // message has been processed now and is ready to be read.+                //                 std::unique_lock<std::mutex> lock(waitForConfigMutex);                 isConfigReady = true;                 waitForConfigCondVar.notify_all();             }; -        // Send a request to obtain a new configuration.+        // Send a request to obtain a new configuration from the optimizer.+        //+        // Note: the message (as defined in+        // SmartCache.SettingsRegistry/CodeGen/SmartCache.cs) has no members, so+        // there's no details to fill in here (and it is just zero-initialized).+        // It's simply a signal to send to the external agent to request a new+        // config be populated in the shared memory region.         //         SmartCache::RequestNewConfigurationMessage msg = { 0 };         mlosContext.SendTelemetryMessage(msg); +        // Now, we wait for the external agent to respond to our request.

If no optimizer is registered then... ? it'll send a message without touching the shared memory?

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 public static void Main(string[] args)                 //
                 AppContext.SetSwitch("System.Net.Http.SocketsHttpHandler.Http2UnencryptedSupport", true);
 
+                // This populates a variable for the various settings registry
+                // callback handlers to use (by means of their individual
+                // AssemblyInitializers) to know how they can connect with the
+                // optimizer.
+                //
+                // See Also: AssemblyInitializer.cs within the SettingsRegistry
+                // assembly project in question.
+                //
                 Uri optimizerAddressUri = new Uri("http://localhost:50051");
                 MlosContext.OptimizerFactory = new MlosOptimizer.BayesianOptimizerFactory(optimizerAddressUri);
             }
 
-            // Create circular buffer shared memory before running the target process.
+            // Create (or open) the circular buffer shared memory before running the target process.
             //
             MainAgent.InitializeSharedChannel();
 
-            // Active learning, almost done. In active learning the MlosAgentServer controls the workload against the target component.
+            // Active learning mode.
+            // In active learning the MlosAgentServer controls the workload against the target component.

I don't see how/where the workload is controlled. It just launches an executable, right?

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 // for license information.
 // </copyright>
 // -----------------------------------------------------------------------
+//
+// This file contains the data structures that represent both the component
+// settings as well as the component specific messages it can send to and
+// receive from the agent.
+//
+// Note: they are annotated with C# attributes that allow the MLOS code
+// generation process to walk the structures using reflection.
+//
+// The .csproj file for this SettingsRegistry assembly specially declares this
+// file as relevant (SettingsRegistryDef) to for Mlos.SettingsSystem.CodeGen to
+// invoke that process.
+//
+// Then according to those rules the code generation output by default is
+// sent to out/Mlos.CodeGen.out/SmartCache/SettingsProvider_gen_*.*
+//
+// Different settings registries are separated to different folders (and
+// optionally namespaces - see SmartCache/CodeGen/AssemblyInfo.cs) to allow
+// multiple settings registries to be included in the same project.
+//
+// Mlos.SettingsSystem.CodeGen outputs for multiple languages (e.g. C++, C#,
+// etc.)
+//
+// For C++ the codegen are expected to be used with #include statements.
// For C++ the codegen is expected to be used with #include statements.

?

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 // @File: Main.cpp // // Purpose:-//      <description>+//  The main entrypoint to the SmartCache example.+//+//  It's meant as an end-to-end microbenchmark example for the C++ version of the+//  SmartCache Python Notebook example.+//+//  It provides different cache replacement policies and cache sizes as tunables+//  for an optimizer to tune for different workloads. // // Notes: //      <special-instructions> // //********************************************************************* +// Include all the common headers for the application+// (including Mlos.Core and component settings registry code generation output)+// #include "stdafx.h" +// Include platform specific implementations of some Mlos functions.+// Only needed in one compilation unit for a given target.+// #include "MlosPlatform.Std.inl"  using namespace Mlos::Core; using namespace SmartCache; -// Include codegen files.-//-#include "GlobalDispatchTable.h"- #include "SmartCacheImpl.h" #include "Workloads.h" +// A basic Windows result code handler.

The curious reader might ask: aren't we running this on linux? ;)

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 main(     //     ISharedChannel& feedbackChannel = mlosContext.FeedbackChannel(); +    // This background thread uses a lambda to monitor the feedback channel for+    // new messages and process them using the callbacks registered for each+    // message type in the global dispatch table.+    //     std::future<bool> feedbackChannelReader = std::async(

how is the return value used? just do indicate that the thread finished?

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 constexpr uint32_t SmartCache::ObjectDeserializationHandler::DispatchTableBaseIn     return static_cast<uint32_t>(Mlos::Core::ObjectDeserializationHandler::DispatchTableElementCount()); } +// Note: As additional settings registries are included in the same project the+// DispatchTableBaseIndex()s should be added together for each prior component+// corresponding to the DispatchTables that are concatenated below.+// See SmartSharedChannel/GlobalDispatchTable.h for an example.++// Registers each of the code generated messages for the channel message handler

Maybe this sentence should go first? My first question looking at this file is "what is a dispatch table"? Adding some more context might also help.

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 public static void Main(string[] args) 
             Task grpcServerTask = CreateHostBuilder(Array.Empty<string>()).Build().RunAsync(cancellationTokenSource.Token);
 
+            // Start the MainAgent message processing loop as a background thread.
+            //
+            // In MainAgent.RunAgent we loop on the shared memory control and
+            // telemetry channels looking for messages and dispatching they to
            // telemetry channels looking for messages and dispatching them to

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 private static void RequestNewConfigurationMessageHandler(SmartCacheProxy.Reques 
                     string currentConfigJsonString = JsonSerializer.Serialize(currentConfigDictionary, JsonOptions);
 
-                    // Register an observation.
+                    // Send (register) the observation with the optimizer.

"send with the optimizer" doesn't really work? with/to?

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 public static void Main(string[] args) 
             Task grpcServerTask = CreateHostBuilder(Array.Empty<string>()).Build().RunAsync(cancellationTokenSource.Token);
 
+            // Start the MainAgent message processing loop as a background thread.
+            //
+            // In MainAgent.RunAgent we loop on the shared memory control and
+            // telemetry channels looking for messages and dispatching they to
+            // their registered callback handlers.
+            //
+            // The set of recognized messages are dynamically registered using
+            // the RegisterSettingsAssembly method which is called through the
+            // handler for the RegisterAssemblyRequestMessage.
+            //
+            // Once registered, the SettingsAssemblyManager uses reflect to
            // Once registered, the SettingsAssemblyManager uses reflection to

not sure?

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 public static void Main(string[] args) 
             Task grpcServerTask = CreateHostBuilder(Array.Empty<string>()).Build().RunAsync(cancellationTokenSource.Token);
 
+            // Start the MainAgent message processing loop as a background thread.
+            //
+            // In MainAgent.RunAgent we loop on the shared memory control and
+            // telemetry channels looking for messages and dispatching they to
+            // their registered callback handlers.
+            //
+            // The set of recognized messages are dynamically registered using
            // The set of recognized messages is dynamically registered using

bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 main(     //     ISharedChannel& feedbackChannel = mlosContext.FeedbackChannel(); +    // This background thread uses a lambda to monitor the feedback channel for+    // new messages and process them using the callbacks registered for each+    // message type in the global dispatch table.+    //     std::future<bool> feedbackChannelReader = std::async(         std::launch::async,         [&feedbackChannel]     {+        // GlobalDispatchTable defines the set of recognized messages by this+        // application.+        // See GlobalDispatchTable.h for details.+        //         auto globalDispatchTable = GlobalDispatchTable();++        // This starts a loop to handle reading messages from the feedback+        // channel, looking them up in the dispatch table, and calling the+        // callbacks associated with them.+        //         feedbackChannel.ProcessMessages(globalDispatchTable.data(), globalDispatchTable.size());          return true;     }); -    // Register Mlos.SmartCache Settings assembly.+    // Register the SmartCache.SettingsRegistry assembly with the external agent.+    //+    // This prepares the external agent to begin handling messages from our+    // smart component on new telemetry and feedback channels.+    //+    // To do that it sends a RegisterAssemblyRequestMessage to the agent on the+    // control channel that includes the name of settings registry (annotated C#+    // data structures used for code generation) assembly (dll) for this smart+    // component.+    //+    // When the (C#) agent receives that message it dynamically loads the+    // specified dll into its address space and calls an AssemblyInitializer+    // static class within that dll to setup the message handlers (callbacks).+    //+    // See Also: SmartCache.SettingsRegistry/AssemblyInitializer.cs     //     hr = mlosContext.RegisterSettingsAssembly(         "SmartCache.SettingsRegistry.dll",         SmartCache::ObjectDeserializationHandler::DispatchTableBaseIndex());     CheckHR(hr); -    // Create shared component configuration.+    // Create a component configuration object.+    // This will be stored in a shared memory region below for use by both the+    // component and the external agent.     //     Mlos::Core::ComponentConfig<SmartCache::SmartCacheConfig> config(mlosContext);      // Initialize config with default values.     //+    // TODO: Eventually we expect these default values to be initialized from+    // the SettingsRegistry code generation process themselves.+    //     config.ConfigId = 1;     config.EvictionPolicy = SmartCache::CacheEvictionPolicy::LeastRecentlyUsed;     config.CacheSize = 100; +    // Checks to see if there's already a shared memory region for storing the+    // config for this component and if not creates it.+    //     hr = mlosContext.RegisterComponentConfig(config);     CheckHR(hr); -    // Create an intelligent component.+    // Create an instance of our SmartCache component to tune.+    //+    // Note that we pass it a ComponentConfig instance, which also includes our+    // MlosContext instance, so that the component can internally send telemtry
    // MlosContext instance, so that the component can internally send telemetry
bpkroth

comment created time in 2 days

Pull request review commentmicrosoft/MLOS

Document SmartCache C++ example code

 // @File: Main.cpp // // Purpose:-//      <description>+//  The main entrypoint to the SmartCache example.+//+//  It's meant as an end-to-end microbenchmark example for the C++ version of the+//  SmartCache Python Notebook example.+//+//  It provides different cache replacement policies and cache sizes as tunables+//  for an optimizer to tune for different workloads. // // Notes: //      <special-instructions> // //********************************************************************* +// Include all the common headers for the application+// (including Mlos.Core and component settings registry code generation output)+// #include "stdafx.h" +// Include platform specific implementations of some Mlos functions.
// Include platform specific implementations of some MLOS functions.
bpkroth

comment created time in 2 days

PullRequestReviewEvent
PullRequestReviewEvent

issue commentamueller/word_cloud

Words do not form within mask

Thanks for your thoughts. Did that fix the error?

I'm happy to accept PRs for 1 and 3.

For 2, this is really not the suggested input and I'm not sure why someone would sort their text. Can you maybe give some input on that? We could try to detect this special case and adjust the behavior. I haven't done that so far because I really don't understand why people sort their text.

re 4) indeed, clearly NP-hard, so it's just greedy. Actually not even optimally greedy but just pick a random place where it would fit. Basically the stupidest way possible.

forthrin

comment created time in 2 days

pull request commentmicrosoft/MLOS

added feature imputation to RERF predict()

Thanks! Can you maybe add a test that explicitly tests this?

edcthayer

comment created time in 2 days

pull request commentscikit-learn/scikit-learn

CLN Deprecations position args in PartialDependenceDisplay.plot

I vote 2 I think? that seems most consistent and doesn't cost us much?

thomasjpfan

comment created time in 3 days

pull request commentscikit-learn/scikit-learn

ENH Deprecates _pairwise attribute and adds pairwise to estimator tags

do you wanna fix conflicts?

thomasjpfan

comment created time in 3 days

pull request commentscikit-learn/scikit-learn

CLN Only check for n_features_in_ only when it exists

I think it should be fine to have people use Normalizer without calling fit though it makes the logic a bit more tricky. If you call transform twice on differently shaped data, should it error? I guess there's no reason to?

thomasjpfan

comment created time in 3 days

Pull request review commentscikit-learn/scikit-learn

ENH Adds Column name consistency

 def check_requires_y_none(name, estimator_orig, strict_mode=True):             warnings.warn(warning_msg, FutureWarning)  +def check_dataframe_column_names_consistency(name, estimator_orig):+    try:+        import pandas as pd+    except ImportError:+        raise SkipTest("pandas is not installed: not checking "+                       "column name consistency for pandas")++    def _construct_dataframe(X, columns):+        return pd.DataFrame(X, columns=columns)+    _check_column_name_consistency(name, estimator_orig, _construct_dataframe,+                                   "dataframe")+++def check_dataarray_column_names_consistency(name, estimator_orig):+    try:+        import xarray as xr+    except ImportError:+        raise SkipTest("xarray is not installed: not checking "+                       "column name consistency for xarray")++    def _construct_xarray(X, columns):+        return xr.DataArray(X, dims=('index', 'columns'),+                            coords={'columns': columns})+    _check_column_name_consistency(name, estimator_orig, _construct_xarray,+                                   "xarray")+++def _check_column_name_consistency(name, estimator_orig, construct_X,+                                   array_name):+    estimator = clone(estimator_orig)++    X_orig, _ = make_regression(random_state=0, n_features=10)+    X_orig = _enforce_estimator_tags_x(estimator, X_orig)+    X_orig = _pairwise_estimator_convert_X(X_orig, estimator)++    n_samples, n_features = X_orig.shape+    names = np.array([f"col_{i}" for i in range(n_features)])+    X = construct_X(X_orig, names)++    rng = np.random.RandomState(0)+    if is_regressor(estimator):+        y = rng.normal(size=n_samples)+    else:+        y = rng.randint(low=0, high=2, size=n_samples)+    y = _enforce_estimator_tags_y(estimator, y)

Doesn't this do the int casting already?

thomasjpfan

comment created time in 3 days

Pull request review commentscikit-learn/scikit-learn

ENH Adds Column name consistency

 def transform(self, X):         # Need not validate X again as it would have already been validated         # in the Imputer calling MissingIndicator         if not self._precomputed:-            X = self._validate_input(X, in_fit=True)

Was that a bug? can you add a regression test?

thomasjpfan

comment created time in 3 days

Pull request review commentscikit-learn/scikit-learn

ENH Adds Column name consistency

 def check_requires_y_none(name, estimator_orig, strict_mode=True):             warnings.warn(warning_msg, FutureWarning)  +def check_dataframe_column_names_consistency(name, estimator_orig):+    try:+        import pandas as pd+    except ImportError:+        raise SkipTest("pandas is not installed: not checking "+                       "column name consistency for pandas")++    def _construct_dataframe(X, columns):+        return pd.DataFrame(X, columns=columns)+    _check_column_name_consistency(name, estimator_orig, _construct_dataframe,+                                   "dataframe")+++def check_dataarray_column_names_consistency(name, estimator_orig):+    try:+        import xarray as xr+    except ImportError:+        raise SkipTest("xarray is not installed: not checking "+                       "column name consistency for xarray")++    def _construct_xarray(X, columns):+        return xr.DataArray(X, dims=('index', 'columns'),+                            coords={'columns': columns})+    _check_column_name_consistency(name, estimator_orig, _construct_xarray,+                                   "xarray")+++def _check_column_name_consistency(name, estimator_orig, construct_X,+                                   array_name):+    estimator = clone(estimator_orig)++    X_orig, _ = make_regression(random_state=0, n_features=10)+    X_orig = _enforce_estimator_tags_x(estimator, X_orig)+    X_orig = _pairwise_estimator_convert_X(X_orig, estimator)++    n_samples, n_features = X_orig.shape+    names = np.array([f"col_{i}" for i in range(n_features)])+    X = construct_X(X_orig, names)++    rng = np.random.RandomState(0)+    if is_regressor(estimator):

Surely there's a helper for that?

thomasjpfan

comment created time in 3 days

Pull request review commentscikit-learn/scikit-learn

ENH Adds Column name consistency

 def test_repr_html_wraps():     with config_context(display='diagram'):         output = tree._repr_html_()         assert "<style>" in output+++def test_feature_names_in():+    # Simple checks for feature_names_in+    pd = pytest.importorskip("pandas")+    iris = datasets.load_iris()+    df = pd.DataFrame(iris.data, columns=iris.feature_names)+    y = iris.target++    class NoOpTransformer(TransformerMixin, BaseEstimator):+        def fit(self, X, y):+            self._validate_data(X)+            return self++        def transform(self, X):+            self._validate_data(X, reset=False)+            return X++    trans = NoOpTransformer().fit(df, y)+    assert_array_equal(trans.feature_names_in_, df.columns)++    msg = "The column names should match those that were passed"

test fitting on numpy and transforming pandas?

thomasjpfan

comment created time in 3 days

Pull request review commentscikit-learn/scikit-learn

ENH Adds Column name consistency

+import pytest

The name of this file is weird, probably because this has been backported from the other PR?

thomasjpfan

comment created time in 3 days

Pull request review commentscikit-learn/scikit-learn

ENH Adds Column name consistency

+import pytest++import numpy as np+from numpy.testing import assert_array_equal++from sklearn.utils._array_out import _get_feature_names+++def _construct_array(array_type, column_names):+    X = np.array([[1, 2, 3], [4, 5, 6]], dtype=float)++    if array_type == "dataframe":+        pd = pytest.importorskip("pandas")+        return pd.DataFrame(X, columns=column_names)+    elif array_type == "dataarray":+        xr = pytest.importorskip("xarray")+        return xr.DataArray(X, dims=('index', 'columns'),+                            coords={'columns': column_names})+++@pytest.mark.parametrize("array_type", ["dataframe", "dataarray"])+def test_pandas_get_feature_names(array_type):

can you test with integer column names as well? and maybe a mix of integer, string and object column names? you know, for fun?

thomasjpfan

comment created time in 3 days

Pull request review commentscikit-learn/scikit-learn

ENH Adds Column name consistency

 def _check_n_features(self, X, reset):                                        self.n_features_in_)                 ) +    def _check_feature_names(self, X, reset=True):+        """Validate feature names and set or check the `feature_names_in_`+        attribute.++        Parameters+        ----------+        X : {dataframe-like} of shape (n_samples, n_features)+            The input samples.+        reset : bool, default=True+            Whether to reset the `feature_names_in_` attribute.+            If False, the Input will be checked for consistency with data+            provided when reset was last True.+        """++        if reset:+            self.feature_names_in_ = _get_feature_names(X)+            return++        fitted_feature_names = getattr(self, "feature_names_in_", None)+        if fitted_feature_names is None:+            # no feature names to check+            return++        feature_names_in = _get_feature_names(X)

nitpick: maybe name this variable something else? like feature_names_new?

thomasjpfan

comment created time in 3 days

Pull request review commentscikit-learn/scikit-learn

ENH Adds Column name consistency

 def _check_n_features(self, X, reset):                                        self.n_features_in_)                 ) +    def _check_feature_names(self, X, reset=True):+        """Validate feature names and set or check the `feature_names_in_`+        attribute.++        Parameters+        ----------+        X : {dataframe-like} of shape (n_samples, n_features)+            The input samples.+        reset : bool, default=True+            Whether to reset the `feature_names_in_` attribute.+            If False, the Input will be checked for consistency with data
            If False, the input will be checked for consistency with feature_names of data
thomasjpfan

comment created time in 3 days

more