profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/ajschumacher/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Aaron Schumacher ajschumacher Washington, DC http://planspace.org/ data, machines, science, learning

ajschumacher/.emacs.d 7

"dot files" aka configuration aka customization

ajschumacher/cles 6

Common-Language Effect Size

ajschumacher/clean_data_with_R 4

resources for a talk about cleaning data with R

ajschumacher/askreduce 3

distributed questions

ajschumacher/cnfg 3

config without the vowels, or, simple Python config in your home directory

ajschumacher/annbot 1

because the world needs bots

ajschumacher/candidates 1

getting some data out of a PDF

ajschumacher/cjcc_to_data 1

extract data from CJCC Resource Locator

issue commentajschumacher/ajschumacher.github.io

treeees

XGBoost

  • handles sparsity
  • weighted quantile sketch
  • regularization
  • out-of-core
ajschumacher

comment created time in 26 minutes

issue commentajschumacher/ajschumacher.github.io

xgboost

as part of #290?

ajschumacher

comment created time in 36 minutes

issue commentajschumacher/ajschumacher.github.io

treeees

CatBoost

  • oblivious trees
  • category interactions
  • permutation for category score transform and boosting
ajschumacher

comment created time in 42 minutes

startedpympler/pympler

started time in an hour

startedthoppe/NansAreNumbers

started time in an hour

issue commentajschumacher/ajschumacher.github.io

de-categorizing categorical data

https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

There seems to be no reason to use One-Hot Encoding over Numeric Encoding.

(one-hot not good for trees)

ajschumacher

comment created time in 19 hours

push eventajschumacher/ajschumacher.github.io

Aaron Schumacher

commit sha 641e8ab8c284b617162a6194fd0aac99c224f8e6

WoE (closes #288)

view details

push time in 19 hours

issue closedajschumacher/ajschumacher.github.io

weight of evidence (WOE)

https://contrib.scikit-learn.org/category_encoders/woe.html

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html?m=1#Weight-of-Evidence-and-Information-Value-in-Python-SAS-and-R

http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf

closed time in 19 hours

ajschumacher

issue openedajschumacher/ajschumacher.github.io

feature selection

"Information Value" (IV)

Shapley values

permutation tests etc.

Xgboost Feature Importance Computed in 3 Ways with Python https://mljar.com/blog/feature-importance-xgboost/

created time in 20 hours

startedevidence-dev/evidence

started time in a day

issue openedajschumacher/ajschumacher.github.io

treeees

LightGBM

  • smart categorical splits
  • leaf-wise tree growth
  • focuses on high-gradient training examples
  • uses exclusive feature bundling

created time in a day

issue commentajschumacher/ajschumacher.github.io

de-categorizing categorical data

For a categorical feature with high cardinality (#category is large), it often works best to treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or by embedding the categories in a low-dimensional numeric space.

https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html

ajschumacher

comment created time in a day

push eventajschumacher/mljar-supervised

Aaron Schumacher

commit sha 84c30520ad9c6c0e56d3b99a6f41305de0fb4853

typo: "Atomatic" -> "Automatic"

view details

push time in a day

fork ajschumacher/mljar-supervised

Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning :rocket:

https://mljar.com

fork in a day

startedmljar/mljar-supervised

started time in a day

startedcatboost/catboost

started time in a day

starteddmitru/pines

started time in a day

issue commentajschumacher/ajschumacher.github.io

de-categorizing categorical data

in catboost paper https://arxiv.org/pdf/1706.09516.pdf

Further, there is a similar issue in standard algorithms of preprocessing categorical features. One of the most effective ways [6, 25] to use them in gradient boosting is converting categories to their target statistics. A target statistic is a simple statistical model itself, and it can also cause target leakage and a prediction shift.

[6] B. Cestnik et al. Estimating probabilities: a crucial task in machine learning. In ECAI, volume 90, pages 147–149, 1990.

[25] D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27–32, 2001. http://helios.mm.di.uoa.gr/~rouvas/ssi/sigkdd/sigkdd.vol3.1/barreca.pdf

ajschumacher

comment created time in 2 days

issue openedajschumacher/ajschumacher.github.io

missing data

in xgboost https://arxiv.org/abs/1603.02754 they treat "missing" as a special value that doesn't order with all others, and they try all splits with it on either side

Sparsity-aware Split Finding In many real-world problems, it is quite common for the input x to be sparse. There are multiple possible causes for sparsity: 1) presence of missing values in the data; 2) frequent zero entries in the statistics; and, 3) artifacts of feature engineering such as one-hot encoding. It is impor- tant to make the algorithm aware of the sparsity pattern in the data. In order to do so, we propose to add a default direction in each tree node, which is shown in Fig. 4. When a value is missing in the sparse matrix x, the instance is classified into the default direction.

created time in 2 days

issue commentajschumacher/ajschumacher.github.io

de-categorizing categorical data

weight of evidence (WOE) is one of these too; see #288

ajschumacher

comment created time in 2 days

issue openedajschumacher/ajschumacher.github.io

weight of evidence (WOE)

https://contrib.scikit-learn.org/category_encoders/woe.html

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html?m=1#Weight-of-Evidence-and-Information-Value-in-Python-SAS-and-R

http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf

created time in 2 days

issue openedajschumacher/ajschumacher.github.io

de-categorizing categorical data

Here's a method they use for the xgboost paper https://arxiv.org/abs/1603.02754:

Since a tree based model is better at handling continuous features, we preprocess the data by calculating the statistics of average CTR and count of ID features on the first ten days, replacing the ID fea- tures by the corresponding count statistics during the next ten days for training.

created time in 2 days

issue commentajschumacher/ajschumacher.github.io

combining percentiles

But you can approximately combine quantile info! From the xgboost paper https://arxiv.org/abs/1603.02754:

IMG_0006

Quantile summary (without weights) is a classical problem in the database community [14, 24]. However, the ap- proximate tree boosting algorithm reveals a more general problem – finding quantiles on weighted data. To the best of our knowledge, the weighted quantile sketch proposed in this paper is the first method to solve this problem. The weighted quantile summary is also not specific to the tree learning and can benefit other applications in data science and machine learning in the future.

[14] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 58–66, 2001.

[24] Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007

ajschumacher

comment created time in 2 days

issue openedajschumacher/ajschumacher.github.io

xgboost

https://arxiv.org/abs/1603.02754

created time in 2 days

issue commentajschumacher/ajschumacher.github.io

modernizing books

I like how the https://staffeng.com/ book uses QR codes with its refs. Would be cool to move them into the margins for easier access. I think it doesn't have notes, just links; supporting both is also probably good.

ajschumacher

comment created time in 2 days

issue openedajschumacher/ajschumacher.github.io

sum of squares

As in computational def…

Diminishing returns, as in CLT… understanding shape of curve


Herfindahl–Hirschman Index turns out to just be sum of squares! And it's the same for Simpson diversity index, etc.! https://en.wikipedia.org/wiki/Herfindahl%E2%80%93Hirschman_Index

https://en.wikipedia.org/wiki/Concentration_ratio "CR4" (for example) is just the sum of the shares of the top 4!

created time in 2 days

push eventajschumacher/ajschumacher.github.io

Aaron Schumacher

commit sha fa1d863a9375c6a995f25499dd7f0e86657f1cca

add solution (thanks Erica!)

view details

push time in 2 days

issue commentajschumacher/ajschumacher.github.io

linear algebra

oh also: c8948588ad3d4f354d49b6faf10fb9db8ed18160

ajschumacher

comment created time in 3 days

push eventajschumacher/ajschumacher.github.io

Aaron Schumacher

commit sha c8948588ad3d4f354d49b6faf10fb9db8ed18160

determinant

view details

push time in 3 days