piskvorky/data_science_python 57
Source code for the "Practical Data Science in Python" tutorial
Fast Library for Approximate Nearest Neighbors
Web mining module for Python
Repository to build and test gensim wheels on OSX
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
ASKBOT is a StackOverflow-like Q&A forum, based on CNPROG.
Dexter is a framework that implements some popular algorithms and provides all the tools needed to develop any entity linking technique.
Demo of fuzzy (nearest neighbor) search with dense data
issue commentRaRe-Technologies/gensim
save_facebook_model() - AssertionError
Here is a snippet that can be reproduced with the attached data: train_voc.zip
import json
import gensim
with open("train_voc.json", "r") as file:
sents = json.load(file)
model = gensim.models.fasttext.load_facebook_model("cc.de.300.bin")
model.build_vocab(sents, update=True)
model.train(sentences=sents, total_examples = len(sents), epochs=2)
gensim.models.fasttext.save_facebook_model(model, "cc.de.300.tuned.bin")
Link to the German model: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz
What I get in my debugging state:
comment created time in 43 minutes
pull request commentRaRe-Technologies/gensim
is this something that I would be able to take a look at as it would be great to get the 3.9 Wheels on pypi.org?
comment created time in an hour
issue commentRaRe-Technologies/gensim
save_facebook_model() - AssertionError
Does the problem reproduce if you try in the 4.0.0-beta, as can be installed locally for you per instructions at https://groups.google.com/g/gensim/c/5mAeWrQN7lg/m/eMhYpktCAAAJ?
If it does not, then its likely an already-fixed bug – after all, this issue was closed-as-fixed after a test case was created that showed the problem in 3.8.x, and didn't in develop
– and the recommendation would just be: use the 4.0.0 prerelease(s) as a workaround.
comment created time in an hour
issue commentRaRe-Technologies/gensim
save_facebook_model() - AssertionError
I understand that you cannot share commercial data. Can you reproduce the problem using other data that you can share?
comment created time in 6 hours
push eventRaRe-Technologies/smart_open
commit sha d8f1602f22fe72cbc2a4318bc85eb1215720a0dc
Refactor S3, replace high-level resource/session API with low-level client API (#583) * wip * wip * all tests passing * update documentation * fix test * fixup * update documentation * fixup in doc * Update smart_open/s3.py Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz> * update migration guide * fixup * fixup * fixup * fixup * fixup * fixup * Update MIGRATING_FROM_OLDER_VERSIONS.rst Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz> * more examples * Update MIGRATING_FROM_OLDER_VERSIONS.rst Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz>
push time in 6 hours
PR merged RaRe-Technologies/smart_open
While functionally they are the same, the session/resource stuff is not safe to use across multiple threads and subprocesses, and the client stuff is. The former is a bit easier to use directly than the latter, but this does not impact the user, because smart_open is doing all the work.
I also refactored the way we pass keyword parameters to boto3. Previously, we had a separate parameter for each boto3 function. This was a pain, because this made parameter lists longer for each new function, e.g.
- resource_kwargs
- multipart_upload_kwargs
- singlepart_upload_kwargs
- object_kwargs
This PR moves all of the above parameters into a single nested dict and introduces a wrapper that transparently injects the parameters into the required function call.
This does break backwards compatibility, so will need a major version bump when releasing.
pr closed time in 6 hours
issue commentRaRe-Technologies/gensim
save_facebook_model() - AssertionError
I can't share the data to make the code reproducible because it is commercial data. I removed the paths because they have no influence on the code but added a name of the fasttext model that I am using. A column "text" in a csv file contains strings with tokens that are separated with ", ".
train_text_df = pd.read_csv("text_train_data.csv", index_col=[0])
sents = []
for text_val in train_text_df["text"].values:
sents.append(text_val.split(", "))
# sents = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'flies']]
model = load_facebook_model("cc.de.300.bin")
model.build_vocab(sents, update=True)
model.train(sentences=sents, total_examples = len(sents), epochs=10)
gensim.models.fasttext.save_facebook_model(model, "cc.de.300.tuned.bin"))
comment created time in 6 hours
pull request commentRaRe-Technologies/smart_open
Refactor S3, replace high-level resource/session API with low-level client API
Updated https://github.com/RaRe-Technologies/smart_open/blob/client/MIGRATING_FROM_OLDER_VERSIONS.rst, please have a look.
comment created time in 6 hours
issue commentRaRe-Technologies/gensim
save_facebook_model() - AssertionError
Thank you. What are the values of config.RESOURCE_PATH and config.MODEL_PATH?
Ideally, your reproducible example should be something that I (or another developer) can run here locally on my machine and get the same result as you do. We can then try to do something about the bug.
comment created time in 6 hours
push eventRaRe-Technologies/smart_open
commit sha eaf8b385fa489a3ecd1c3e21017a2ce05e73e32a
more examples
push time in 6 hours
issue commentRaRe-Technologies/gensim
save_facebook_model() - AssertionError
Sure `train_text_df = pd.read_csv(os.path.join("..", config.RESOURCE_PATH, "text_train_data.csv"), index_col=[0]) sents = [] for text_val in train_text_df["text"].values: sents.append(text_val.split(", "))
sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'semi-groups']]
model = load_facebook_model(os.path.join("..", config.MODEL_PATH)) model.build_vocab(sents, update=True) model.train(sentences=sents, total_examples = len(sents), epochs=10) gensim.models.fasttext.save_facebook_model(model, os.path.join("..", config.RESOURCE_PATH, "cc.de.300.tuned.bin"))`
Followed the tutorial from https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
comment created time in 8 hours
pull request commentRaRe-Technologies/gensim
Remove Poincare and references to deleted WordRank model
Yup, that's why I closed this PR without merging it.
comment created time in 9 hours
Pull request review commentRaRe-Technologies/smart_open
Refactor S3, replace high-level resource/session API with low-level client API
+Migrating to the new client-based S3 API+========================================++Version of smart_open prior to 5.0.0 used the boto3 `resource API_` for communicating with S3.+This API was easy to integrate for smart_open developers, but this came at a cost: it was not thread- or multiprocess-safe.+Furthermore, as smart_open supported more and more options, the transport parameter list grew, making it less maintainable.+Starting with version 5.0.0, smart_open uses the `client API`_ instead of the resource API.+Functionally, the little changes for the smart_open user. +The only difference is in passing transport parameters to the S3 backend.++More specifically, the following S3 transport parameters are no longer supported:++- `multipart_upload_kwargs`+- `object_kwargs`+- `resource`+- `resource_kwargs`+- `session`+- `singlepart_upload_kwargs`++If you weren't using the above parameters, nothing changes for you.+However, if you were using any of the above, then you need to adjust your code.+Here are some quick recipes below.++If you were previously passing `session`, then construct an S3 client from the session and pass that instead.+For example, before::++ smart_open.open('s3://bucket/key', transport_params={'session': session})++After::++ smart_open.open('s3://bucket/key', transport_params={'client': session.client('s3')})+++If you were passing `resource`, then replace the resource with a client, and pass that instead.+For example, before::++ resource = session.resource('s3', **resource_kwargs)+ smart_open.open('s3://bucket/key', transport_params={'resource': resource})++After::++ client = session.client('s3')+ smart_open.open('s3://bucket/key', transport_params={'client': client})++If you were passing any of the `*_kwargs` parameters, you will need to include them in `client_kwargs`, keeping in mind the following transformations.++========================== ====================================== ==========================+Parameter name Resource API method Client API function+========================== ====================================== ==========================+`multipart_upload_kwargs` `s3.Object.initiate_multipart_upload`_ `create_multipart_upload`_+`object_kwargs` `s3.Object.get`_ `get_object`_+`resource_kwargs` ??? ???
Still a work in progress, it has been fixed since.
comment created time in 9 hours
push eventRaRe-Technologies/smart_open
commit sha 0dcfe1827cc9cf19a3e8f99ed28edb23cf4e0b90
Update MIGRATING_FROM_OLDER_VERSIONS.rst Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz>
push time in 9 hours
issue commentRaRe-Technologies/gensim
save_facebook_model() - AssertionError
Are you able to share code that reproduces your problem?
comment created time in 9 hours
startedgethue/compose
started time in 9 hours
issue commentRaRe-Technologies/gensim
save_facebook_model() - AssertionError
@mpenkov 3.8.3
comment created time in 9 hours
startedxeroxzen/genuine-fake
started time in 9 hours
PR opened RaRe-Technologies/gensim
One comment in LdaModel.inference() confused me:
# The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.
And I think it should modified to this so the comment would be more accurate:
# The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_kw.
Reasons:
1.
According to Algorithm 1 on paper:
Online Learning for Latent Dirichlet Allocation, NIPS 2010 http://www.cs.princeton.edu/~mdhoffma,
phi_dwk is proportional to exp(Elogtheta_dk + Elogbeta_kw).
2.
phi_{dwk}
is a scalar, expElogthetad_k
is a scalar, and expElogbetad_w
is a vector with K elements, so expElogthetad_k * expElogbetad_w
will return a vector with K elements, which does not match the type of phi_{dwk}
. If we use expElogthetad_k * expElogbetad_kw
instead, in which expElogbetad_kw
is a scalar, it matches.
pr created time in 11 hours
pull request commentRaRe-Technologies/gensim
Point WordEmbeddingSimilarityIndex documentation to gensim.similarities
Merged, thank you @Witiko !
comment created time in 13 hours
push eventRaRe-Technologies/gensim
commit sha ddeeb1274c996b6992e11f7cbe7ca44c776f6daf
Point WordEmbeddingSimilarityIndex documentation to gensim.similarities (#3003) * Remove deprecated init_sims from run_wmd.py * Add missing cross-links between soft cosine measure classes * Add normalized parameter to SoftCosineSimilarity * Point WordEmbeddingSimilarityIndex documentation to gensim.similarities * Add run_scm.py * Redirect documentation of SCM and WMD to Gallery * Run run_scm.py * Add sphx_glr_run_scm_thumb.png * Add sphx_glr_run_scm_001.png * Use anonymous references in the documentation
push time in 13 hours
PR merged RaRe-Technologies/gensim
Currently, all documentation, Jupyter notebooks and code examples assume that WordEmbeddingSimilarityIndex
is located in gensim.models
, although it lives in gensim.similarities
since 4.0.0. This pull request fixes the documentation.
pr closed time in 13 hours
pull request commentRaRe-Technologies/gensim
Point WordEmbeddingSimilarityIndex documentation to gensim.similarities
I re-ran the tests, they all passed.
comment created time in 13 hours
issue commentRaRe-Technologies/gensim
Related PR: https://github.com/RaRe-Technologies/gensim/pull/2966
comment created time in 14 hours
push eventRaRe-Technologies/smart_open
commit sha 3b2fa86bb3e85600396b7478b0ef32fbd0edb4d9
fixup
push time in 14 hours
push eventRaRe-Technologies/smart_open
commit sha bc1876ee1ca9bd9e5121f744a762a203732d792e
fixup
push time in 14 hours
push eventRaRe-Technologies/smart_open
commit sha 9cc714393386e5a2e83c3619c2ccb4c4e1b58ea1
fixup
push time in 14 hours