profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/piskvorky/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Radim Řehůřek piskvorky @RaRe-Technologies Geoje-do https://twitter.com/radimrehurek Creator of gensim, founder at @RaRe-Technologies. Lover of history and beginnings in general.

piskvorky/data_science_python 57

Source code for the "Practical Data Science in Python" tutorial

piskvorky/flann 13

Fast Library for Approximate Nearest Neighbors

piskvorky/pattern 6

Web mining module for Python

MacPython/gensim-wheels 1

Repository to build and test gensim wheels on OSX

piskvorky/annoy 1

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

piskvorky/askbot-devel 1

ASKBOT is a StackOverflow-like Q&A forum, based on CNPROG.

piskvorky/dexter 1

Dexter is a framework that implements some popular algorithms and provides all the tools needed to develop any entity linking technique.

piskvorky/fuzzysearch-demo 1

Demo of fuzzy (nearest neighbor) search with dense data

issue commentRaRe-Technologies/gensim

save_facebook_model() - AssertionError

Here is a snippet that can be reproduced with the attached data: train_voc.zip

import json
import gensim

with open("train_voc.json", "r") as file:
    sents = json.load(file)
model = gensim.models.fasttext.load_facebook_model("cc.de.300.bin")
model.build_vocab(sents, update=True)
model.train(sentences=sents, total_examples = len(sents), epochs=2)
gensim.models.fasttext.save_facebook_model(model, "cc.de.300.tuned.bin")

Link to the German model: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz

What I get in my debugging state: 2021-03-01_19-16-07

imendibo

comment created time in 43 minutes

pull request commentRaRe-Technologies/gensim

Build/test on Python 3.9

is this something that I would be able to take a look at as it would be great to get the 3.9 Wheels on pypi.org?

gojomo

comment created time in an hour

issue commentRaRe-Technologies/gensim

save_facebook_model() - AssertionError

Does the problem reproduce if you try in the 4.0.0-beta, as can be installed locally for you per instructions at https://groups.google.com/g/gensim/c/5mAeWrQN7lg/m/eMhYpktCAAAJ?

If it does not, then its likely an already-fixed bug – after all, this issue was closed-as-fixed after a test case was created that showed the problem in 3.8.x, and didn't in develop – and the recommendation would just be: use the 4.0.0 prerelease(s) as a workaround.

imendibo

comment created time in an hour

issue commentRaRe-Technologies/gensim

save_facebook_model() - AssertionError

I understand that you cannot share commercial data. Can you reproduce the problem using other data that you can share?

imendibo

comment created time in 6 hours

delete branch RaRe-Technologies/smart_open

delete branch : client

delete time in 6 hours

push eventRaRe-Technologies/smart_open

Michael Penkov

commit sha d8f1602f22fe72cbc2a4318bc85eb1215720a0dc

Refactor S3, replace high-level resource/session API with low-level client API (#583) * wip * wip * all tests passing * update documentation * fix test * fixup * update documentation * fixup in doc * Update smart_open/s3.py Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz> * update migration guide * fixup * fixup * fixup * fixup * fixup * fixup * Update MIGRATING_FROM_OLDER_VERSIONS.rst Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz> * more examples * Update MIGRATING_FROM_OLDER_VERSIONS.rst Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz>

view details

push time in 6 hours

PR merged RaRe-Technologies/smart_open

Refactor S3, replace high-level resource/session API with low-level client API breaks-backward-compatibility

While functionally they are the same, the session/resource stuff is not safe to use across multiple threads and subprocesses, and the client stuff is. The former is a bit easier to use directly than the latter, but this does not impact the user, because smart_open is doing all the work.

I also refactored the way we pass keyword parameters to boto3. Previously, we had a separate parameter for each boto3 function. This was a pain, because this made parameter lists longer for each new function, e.g.

  • resource_kwargs
  • multipart_upload_kwargs
  • singlepart_upload_kwargs
  • object_kwargs

This PR moves all of the above parameters into a single nested dict and introduces a wrapper that transparently injects the parameters into the required function call.

This does break backwards compatibility, so will need a major version bump when releasing.

+487 -382

6 comments

7 changed files

mpenkov

pr closed time in 6 hours

issue commentRaRe-Technologies/gensim

save_facebook_model() - AssertionError

I can't share the data to make the code reproducible because it is commercial data. I removed the paths because they have no influence on the code but added a name of the fasttext model that I am using. A column "text" in a csv file contains strings with tokens that are separated with ", ".

train_text_df = pd.read_csv("text_train_data.csv", index_col=[0])
sents = []
for text_val in train_text_df["text"].values:
         sents.append(text_val.split(", "))
# sents = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'flies']]
model = load_facebook_model("cc.de.300.bin")
model.build_vocab(sents, update=True)
model.train(sentences=sents, total_examples = len(sents), epochs=10)
gensim.models.fasttext.save_facebook_model(model, "cc.de.300.tuned.bin"))
imendibo

comment created time in 6 hours

pull request commentRaRe-Technologies/smart_open

Refactor S3, replace high-level resource/session API with low-level client API

Updated https://github.com/RaRe-Technologies/smart_open/blob/client/MIGRATING_FROM_OLDER_VERSIONS.rst, please have a look.

mpenkov

comment created time in 6 hours

issue commentRaRe-Technologies/gensim

save_facebook_model() - AssertionError

Thank you. What are the values of config.RESOURCE_PATH and config.MODEL_PATH?

Ideally, your reproducible example should be something that I (or another developer) can run here locally on my machine and get the same result as you do. We can then try to do something about the bug.

imendibo

comment created time in 6 hours

push eventRaRe-Technologies/smart_open

Michael Penkov

commit sha eaf8b385fa489a3ecd1c3e21017a2ce05e73e32a

more examples

view details

push time in 6 hours

issue commentRaRe-Technologies/gensim

save_facebook_model() - AssertionError

Sure `train_text_df = pd.read_csv(os.path.join("..", config.RESOURCE_PATH, "text_train_data.csv"), index_col=[0]) sents = [] for text_val in train_text_df["text"].values: sents.append(text_val.split(", "))

sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'semi-groups']]

model = load_facebook_model(os.path.join("..", config.MODEL_PATH)) model.build_vocab(sents, update=True) model.train(sentences=sents, total_examples = len(sents), epochs=10) gensim.models.fasttext.save_facebook_model(model, os.path.join("..", config.RESOURCE_PATH, "cc.de.300.tuned.bin"))`

Followed the tutorial from https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model

imendibo

comment created time in 8 hours

pull request commentRaRe-Technologies/gensim

Remove Poincare and references to deleted WordRank model

Yup, that's why I closed this PR without merging it.

mpenkov

comment created time in 9 hours

Pull request review commentRaRe-Technologies/smart_open

Refactor S3, replace high-level resource/session API with low-level client API

+Migrating to the new client-based S3 API+========================================++Version of smart_open prior to 5.0.0 used the boto3 `resource API_` for communicating with S3.+This API was easy to integrate for smart_open developers, but this came at a cost: it was not thread- or multiprocess-safe.+Furthermore, as smart_open supported more and more options, the transport parameter list grew, making it less maintainable.+Starting with version 5.0.0, smart_open uses the `client API`_ instead of the resource API.+Functionally, the little changes for the smart_open user. +The only difference is in passing transport parameters to the S3 backend.++More specifically, the following S3 transport parameters are no longer supported:++- `multipart_upload_kwargs`+- `object_kwargs`+- `resource`+- `resource_kwargs`+- `session`+- `singlepart_upload_kwargs`++If you weren't using the above parameters, nothing changes for you.+However, if you were using any of the above, then you need to adjust your code.+Here are some quick recipes below.++If you were previously passing `session`, then construct an S3 client from the session and pass that instead.+For example, before::++    smart_open.open('s3://bucket/key', transport_params={'session': session})++After::++    smart_open.open('s3://bucket/key', transport_params={'client': session.client('s3')})+++If you were passing `resource`, then replace the resource with a client, and pass that instead.+For example, before::++    resource = session.resource('s3', **resource_kwargs)+    smart_open.open('s3://bucket/key', transport_params={'resource': resource})++After::++    client = session.client('s3')+    smart_open.open('s3://bucket/key', transport_params={'client': client})++If you were passing any of the `*_kwargs` parameters, you will need to include them in `client_kwargs`, keeping in mind the following transformations.++========================== ====================================== ==========================+Parameter name             Resource API method                    Client API function+========================== ====================================== ==========================+`multipart_upload_kwargs`  `s3.Object.initiate_multipart_upload`_ `create_multipart_upload`_+`object_kwargs`            `s3.Object.get`_                       `get_object`_+`resource_kwargs`          ???                                    ???

Still a work in progress, it has been fixed since.

mpenkov

comment created time in 9 hours

push eventRaRe-Technologies/smart_open

Michael Penkov

commit sha 0dcfe1827cc9cf19a3e8f99ed28edb23cf4e0b90

Update MIGRATING_FROM_OLDER_VERSIONS.rst Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz>

view details

push time in 9 hours

issue commentRaRe-Technologies/gensim

save_facebook_model() - AssertionError

Are you able to share code that reproduces your problem?

imendibo

comment created time in 9 hours

startedgethue/compose

started time in 9 hours

issue commentRaRe-Technologies/gensim

save_facebook_model() - AssertionError

@mpenkov 3.8.3

imendibo

comment created time in 9 hours

startedxeroxzen/genuine-fake

started time in 9 hours

PR opened RaRe-Technologies/gensim

modify comment in LdaModel.inference() to be more explicit

One comment in LdaModel.inference() confused me: # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w. And I think it should modified to this so the comment would be more accurate: # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_kw.

Reasons: 1. According to Algorithm 1 on paper: Online Learning for Latent Dirichlet Allocation, NIPS 2010 http://www.cs.princeton.edu/~mdhoffma, phi_dwk is proportional to exp(Elogtheta_dk + Elogbeta_kw). 2. phi_{dwk} is a scalar, expElogthetad_k is a scalar, and expElogbetad_wis a vector with K elements, so expElogthetad_k * expElogbetad_w will return a vector with K elements, which does not match the type of phi_{dwk}. If we use expElogthetad_k * expElogbetad_kw instead, in which expElogbetad_kwis a scalar, it matches.

+1 -1

0 comment

1 changed file

pr created time in 11 hours

pull request commentRaRe-Technologies/gensim

Point WordEmbeddingSimilarityIndex documentation to gensim.similarities

Merged, thank you @Witiko !

Witiko

comment created time in 13 hours

push eventRaRe-Technologies/gensim

Vít Novotný

commit sha ddeeb1274c996b6992e11f7cbe7ca44c776f6daf

Point WordEmbeddingSimilarityIndex documentation to gensim.similarities (#3003) * Remove deprecated init_sims from run_wmd.py * Add missing cross-links between soft cosine measure classes * Add normalized parameter to SoftCosineSimilarity * Point WordEmbeddingSimilarityIndex documentation to gensim.similarities * Add run_scm.py * Redirect documentation of SCM and WMD to Gallery * Run run_scm.py * Add sphx_glr_run_scm_thumb.png * Add sphx_glr_run_scm_001.png * Use anonymous references in the documentation

view details

push time in 13 hours

PR merged RaRe-Technologies/gensim

Point WordEmbeddingSimilarityIndex documentation to gensim.similarities bug documentation

Currently, all documentation, Jupyter notebooks and code examples assume that WordEmbeddingSimilarityIndex is located in gensim.models, although it lives in gensim.similarities since 4.0.0. This pull request fixes the documentation.

+970 -308

11 comments

26 changed files

Witiko

pr closed time in 13 hours

pull request commentRaRe-Technologies/gensim

Point WordEmbeddingSimilarityIndex documentation to gensim.similarities

I re-ran the tests, they all passed.

Witiko

comment created time in 13 hours

issue commentRaRe-Technologies/gensim

Provide Python3.9 Wheels

Related PR: https://github.com/RaRe-Technologies/gensim/pull/2966

FredHappyface

comment created time in 14 hours

delete branch RaRe-Technologies/gensim

delete branch : 2852

delete time in 14 hours

push eventRaRe-Technologies/smart_open

Michael Penkov

commit sha 3b2fa86bb3e85600396b7478b0ef32fbd0edb4d9

fixup

view details

push time in 14 hours

push eventRaRe-Technologies/smart_open

Michael Penkov

commit sha bc1876ee1ca9bd9e5121f744a762a203732d792e

fixup

view details

push time in 14 hours

push eventRaRe-Technologies/smart_open

Michael Penkov

commit sha 9cc714393386e5a2e83c3619c2ccb4c4e1b58ea1

fixup

view details

push time in 14 hours