profile
viewpoint
Radim Řehůřek piskvorky @RaRe-Technologies Geoje-do https://twitter.com/radimrehurek Creator of gensim, founder at @RaRe-Technologies. Lover of history and beginnings in general.

piskvorky/data_science_python 56

Source code for the "Practical Data Science in Python" tutorial

piskvorky/flann 13

Fast Library for Approximate Nearest Neighbors

piskvorky/PredatorPrey 8

Boid flocking model in OpenGL

piskvorky/pattern 6

Web mining module for Python

MacPython/gensim-wheels 1

Repository to build and test gensim wheels on OSX

piskvorky/annoy 1

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

piskvorky/askbot-devel 1

ASKBOT is a StackOverflow-like Q&A forum, based on CNPROG.

piskvorky/dexter 1

Dexter is a framework that implements some popular algorithms and provides all the tools needed to develop any entity linking technique.

issue closedRaRe-Technologies/sqlitedict

JSON encoder still using pickle/binary encoding

I'm using SqliteDict to have persistent storage of a dict. For debugging purposes, I want to save the dict as JSON/ASCII, so I can edit it manually.

from sqlitedict import SqliteDict
mydict = SqliteDict('./my_db.sqlite', encode=json.dumps, decode=json.loads, autocommit=True)
mydict['key'] = 'value'

However, the resulting file is a binary (I assume pickle) file. Am I doing something wrong?

closed time in a day

skjerns

issue commentRaRe-Technologies/sqlitedict

JSON encoder still using pickle/binary encoding

Yeah that's what sqlitedict does. The file it syncs to is a SQLite database (binary).

skjerns

comment created time in a day

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

I think we should document expandos in code for the benefit of future developers and maintainers. So, add a multi-line code comment explaining what it is, what it does, and why it does things this way.

@mpenkov I improved the comments in 782f7ff9e0cdd53832f93300b712a19b95782608 . I didn't want to start a new PR for that, so it's a part of #2954 , where I'm fixing docs anyway.

piskvorky

comment created time in a day

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 782f7ff9e0cdd53832f93300b712a19b95782608

fix docs in KeyedVectors

view details

push time in a day

issue commentRaRe-Technologies/sqlitedict

JSON encoder still using pickle/binary encoding

The resulting file is a SQLite database: https://www.sqlite.org/fileformat.html

You can export all DB rows into plain JSON/ASCII:

with open("db_export.json", "w") as fout:
    json.dump(my_sqlite_dict.items(), fout)

… but then why do you need the SQLite database at all?

skjerns

comment created time in a day

pull request commentRaRe-Technologies/gensim

[WIP] New docs theme

@dvorakvaclav I committed fixes & changes to the page copy into https://github.com/RaRe-Technologies/gensim/tree/new_docs_theme. Please merge that into this PR first, before any further changes, to avoid git conflicts.

dvorakvaclav

comment created time in 2 days

create barnchRaRe-Technologies/gensim

branch : new_docs_theme

created branch time in 2 days

issue commentRaRe-Technologies/gensim

Support streaming models split into multiple files from S3 / GCS

Reading directly would be nice – although mmap is impossible, and that's the main reason we use np.load().

JensMadsen

comment created time in 2 days

issue commentRaRe-Technologies/gensim

Support streaming models split into multiple files from S3 / GCS

No need for a new ticket at this point – we can always create one if it turns out to be a separate (pun intended) problem.

JensMadsen

comment created time in 3 days

issue commentRaRe-Technologies/gensim

Move BrownCorpus from word2vec to gensim.corpora

Yes – plus add aliases to the old locations, so code that relies on the original location continues to work. Make sure the files have proper headers and documentation.

Do we have unit tests for those classes? If not, could you add them too please? Thanks.

piskvorky

comment created time in 3 days

pull request commentRaRe-Technologies/gensim

[WIP] New docs theme

My review from Trello (sorry guys, Czech only):

  1. Social ikony dole jsou zabite: facebook ikona vede na github, github ikona chybi, ta treti ikona nevim co je. Celkove je tato sekce nepekna, ikony jsou umistene divne, bez logiky stranky. Prosim "socializovat" nejak lepe.
  2. Favicon lepe pozitivni verzi (bez tmaveho pozadi)?
  3. U sekce "Scalable statistical semantics Analyze plain-text documents for semantic structure Retrieve semantically similar documents" jako navstevnik nevim, na co se divam. Sloupce nijak oddelene, hesla se k nicemu nevztahuji, ani u nich neni zadna ikona nebo obrazek. Pohledem na puvodni verzi toto byl "checklist" hlavnich benefitu Gensimu. Prosim body presunout nad sample kod (nebo vedle?), namisto tlacitka "Explore all features", ktere nepotrebujem, a nejak je lepe vizualizovat.
  4. Kliknuti na odkazy ke stazeni (napr. Download Python source code: run_core_concepts.py nebo Download Jupyter notebook: run_core_concepts.ipynb v docs/auto_examples/core/run_core_concepts.html, a nejspis i jinde) by melo soubor stahnout. Ne otevrit v prohlizeci jako text.
  5. Chrome mi ukazuje chybu pri stahovani jednoho souboru, transport.png: Access to image from origin 'null' has been blocked by CORS policy: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.
  6. Poskakovani stranky pri (automatickem) posunovani v sekci Who is using Gensim? je velice neprijemne. Jak budeme v budoucnu menit (pregenerovat) ten syntax-highlighted sample kod nahore? Jaky je zde postup.
  7. Stahovat 3MB na jednoduche staticke strance je docela dost. Napada vas, jak page load optimalizovat / urychlit? (krome CDN, kde pouzivame CloudFlare)

Otherwise build generation works, tested locally 👍 I have to remember to re-apply 0b12d9dbfab21bb32d5be401a37f4a0380bb62ba (new commit in develop missing from this PR).

dvorakvaclav

comment created time in 4 days

pull request commentRaRe-Technologies/gensim

[WIP] Implement position-dependent weighting to fastText

I saw that graph and the difference between "0" and "50" on the X-axis seems to be 1h on the Y-axis.

Which doesn't match "at the same training time" – that's why I ask for a TL;DR clarification. What am I missing?

Witiko

comment created time in 4 days

pull request commentRaRe-Technologies/gensim

[WIP] Implement position-dependent weighting to fastText

5% extra accuracy for no extra training time sounds great. Any gotchas? Why would some app want to not use this?

Witiko

comment created time in 4 days

issue commentRaRe-Technologies/gensim

Move BrownCorpus from word2vec to gensim.corpora

Yes :)

We'll want to retain an alias in word2vec, so that people's code that relies on the current location continues to work. But the code itself should live under gensim.corpora. Same with other such corpus classes that are currently under gensim.models – I believe there were several.

Thanks!

piskvorky

comment created time in 4 days

issue commentRaRe-Technologies/gensim

Measure performance of gensim 4.0.0 vs previous versions

Some Word2vec measurements here: https://github.com/RaRe-Technologies/gensim/pull/2939#issuecomment-691729995

mpenkov

comment created time in 4 days

push eventRaRe-Technologies/gensim

Gordon Mohr

commit sha 30af57340a0ecfe79b41fd423ac4e30fb9eab6b7

Delete .gitattributes

view details

Gordon Mohr

commit sha 5c08d3efcbb2a76fb90c382c0897d43875f0660f

Merge remote-tracking branch 'upstream/develop' into develop

view details

Gordon Mohr

commit sha 3f7047f2d11c9caa99665e8e14c90de1fd7813aa

test showing FT failure as W2V

view details

Gordon Mohr

commit sha ac9126de4d5e848f80fdff03926a705ec340aad4

set .vectors even when ngrams off

view details

Gordon Mohr

commit sha 0316084be31975301f192142076708d0cf754026

use _save_specials/_load_specials per type

view details

Radim Řehůřek

commit sha d524fa41fd5a304f258874ffdf465eb245b2af69

Merge branch 'develop' into 2vec_saveload_fixes

view details

Radim Řehůřek

commit sha 49b35b718bff53d45ffab6884a38c7103eff6a13

docstirng fixes

view details

Radim Řehůřek

commit sha 3f972a6d4c51af725f71a95e3eea861a3fa7a343

get rid of python2 constructs

view details

Radim Řehůřek

commit sha 4331ccf0f0c8288003763806142d2f3711da7ff3

code style fixes while debugging pickle model sizes

view details

Radim Řehůřek

commit sha 34e77dc7771666b6cf085ca2c3bfc19087cb72ae

Merge branch 'pickle_perambulations' into 2vec_saveload_fixes

view details

Radim Řehůřek

commit sha 012d59813341997e71930dfd1a2e314a02a45775

py2 to 3: get rid of forgotten range

view details

Radim Řehůřek

commit sha eefe9ab87ae1b0206f0c7ff7adc296961dada6bc

fix docs

view details

Radim Řehůřek

commit sha 1a9b6466d69d8cd532209e3be3af0fad3f9ec172

get rid of numpy.str_

view details

Radim Řehůřek

commit sha c14456d2ca51f895e3202b6add531bce98952f1e

Merge remote-tracking branch 'origin/develop' into 2vec_saveload_fixes

view details

Radim Řehůřek

commit sha 06aef7510f8208b32f183c1c95597ef55865826d

fix index2entity, fix docs, hard-fail deprecated properties

view details

Radim Řehůřek

commit sha 5e215607636e868e27c867a140be3ba4f71a3890

fix typos + more doc fixes + fix failing tests

view details

Radim Řehůřek

commit sha 51cae68d188fabf3ba19d3dbcde32569104d8a96

more index2word => index_to_key fixes

view details

Radim Řehůřek

commit sha 17da21e4e3a5d73275ac32e501bf152500a49ee5

finish method renaming - add() => add_vectors() - add_one() => add_vector()

view details

Radim Řehůřek

commit sha f0cade1af9dc9177cd2b6d47cac9b19cf295965d

Update gensim/models/word2vec.py Co-authored-by: Michael Penkov <m@penkov.dev>

view details

Radim Řehůřek

commit sha 6fa5a1b464b22bbebea8389fffa3581a04a28ae7

a few more style fixes

view details

push time in 4 days

delete branch RaRe-Technologies/gensim

delete branch : 2vec_saveload_fixes

delete time in 4 days

PR merged RaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

Continued from #2892 (subsumes #2892):

  1. Move model serialization fixups in save() / load() into _save_specials()/_load_specials(), to better handle cases when one SaveLoad contains another.
  2. Code style & py3 migration fixes.
  3. Resolved mysterious bloat in the pickle file of pre-4.0-dev Word2Vec models.

This PR does not include additional changes around serialization:

  • ensure there's at least one full test load of gensim-3.8.3 FastText/Word2Vec/Doc2Vec models that might not yet be tested
  • double-checking mmap options are functionally tested across 2Vec models, & fixing anything that might turn up
  • delete all tests of loading no-longer-supported older models – ideally, everything earlier than gensim-3.8.3 - and then potentially discarding support code thus no-longer needed

These will come in separate PRs later. In particular, we have to decide what to do with compatibility for models trained and stored pre-3.8.3.

+515 -363

20 comments

21 changed files

piskvorky

pr closed time in 4 days

pull request commentRaRe-Technologies/gensim

remove gensim.summarization subpackage, docs and test data

Can you update https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 with the motivation please?

That doc will be my starting point for further write-ups, clean ups, guides etc. Thanks!

Also, what about other modules? We talked about axing a bunch: HDP, wrappers etc.

mpenkov

comment created time in 4 days

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 def load(cls, *args, **kwargs):             Save :class:`~gensim.models.fasttext.FastText` model.          """-        model = super(FastText, cls).load(*args, rethrow=True, **kwargs)--        if not hasattr(model.wv, 'vectors_vocab_lockf') and hasattr(model.wv, 'vectors_vocab'):-            # TODO: try trainables-location-            model.wv.vectors_vocab_lockf = ones(1, dtype=REAL)-        if not hasattr(model, 'vectors_ngrams_lockf') and hasattr(model.wv, 'vectors_ngrams'):-            # TODO: try trainables-location-            model.wv.vectors_ngrams_lockf = ones(1, dtype=REAL)-        # fixup mistakenly overdimensioned gensim-3.x lockf arrays-        if len(model.wv.vectors_vocab_lockf.shape) > 1:-            model.wv.vectors_vocab_lockf = ones(1, dtype=REAL)-        if len(model.wv.vectors_ngrams_lockf.shape) > 1:-            model.wv.vectors_ngrams_lockf = ones(1, dtype=REAL)-        if hasattr(model, 'bucket'):-            del model.bucket  # should only exist in one place: the wv subcomponent-        if not hasattr(model.wv, 'buckets_word') or not model.wv.buckets_word:-            model.wv.recalc_char_ngram_buckets()+        return super(FastText, cls).load(*args, rethrow=True, **kwargs)

Moved to a separate ticket, so this doesn't get lost: #2957.

piskvorky

comment created time in 4 days

PullRequestReviewEvent

issue openedRaRe-Technologies/gensim

Clean up OOP / stub methods

Do we really need such stub methods that only call the same superclass method with the same arguments? That's already the default which occurs if no method is present. By my understanding, doc-comment tools like Sphinx will, in their current versions, already propagate superclass API docs down to subclasses.

The only thing that's varying is the comment, and while it expresses a different expected-type from the superclass, in practice that doc may be misleading: I think (but have not recently checked) that these SaveLoad .load() methods can return objects that may not be what the caller expects. They return the class that's in the file, not the class-that-.load()-was-called-on.

If so, it might be a worthwhile short-term step as soon as 4.0.0 – for limiting the risk of confusion & requirement for redundant/caveat-filled docs – to deprecate the practice of calling SpecificClass.load(filename) entirely, despite its common appearance in previously-idiomatic gensim example code. Instead, either (1) call it only on class SaveLoad itself, to express that the only expectation for the returned type is that it's a SaveLoad subclass; (2) promote load functionality to model-specific top-level functions in each relevant model – a bit more like the load_facebook_model() function for loading Facebook-FasttText-native-format models – which might themselves do some type-checking, so any docs which imply they return a certain type are true; (3) just make one utils.py generic load() or load_model(), perhaps with an optional class-enforcement parameter, and encourage its use.

(For explicitness, I think I like this third option. In practice, it might appear in example code as:

from gensim.utils import load_model
from gensim.models import Word2Vec

w2v_model_we_hope = load_model('w2v.bin')
w2v_model_or_error = load_model('w2v.bin', expected_class=Word2Vec)

Plenty of code where the file is saved/loaded in the same example block, or under strong expectations & naming conventions, might skip the enforced-type-checking – but it'd be an option & true/explicit, rather than something that's implied-but-not-really-enforced in the current idiom Word2Vec.load('kv.bin'))

Despite the effort involved in making such changes, they could minimize duplicated code/comments & avoid some unintuitive gotchas in the current SaveLoad approach. They might also help make a future migration to some more standard big-model-serialization convention (as proposed by #2848) cleaner.

Originally posted by @gojomo in https://github.com/RaRe-Technologies/gensim/pull/2939#discussion_r493807649

created time in 4 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

My notes from our call with @mpenkov ; FYI @gojomo :

Decisions

  1. Support loading of pre-3.8.3 models?
  • Misha: better: a stand-alone script to upgrade models to the latest version… rather than class methods doing the work on each load
  • Misha: also in favour of limited backward compatibility loading (just 3.8.3)
  • Radim: in favour of keeping existing compatibility code; why remove it? pros/cons unclear => decision: talk to Gordon, make the costs (pros/cons) more explicit.
  1. itertoolsvs it?
  • prefer full name unless critical to Gordon, but w/e.
  1. Remove subpar packages, contributed modules, wrappers?
    • yes; Misha to review & axe
    • what about NMF? semi-ready, but not much confidence, and its author no longer interested in finishing/maintaining.
      • options: 1. axe it; 2. invest effort & finish & fix; 3. do nothing, keep as-is => decision keep 4.0.0, Radim looks at option (2) later after 4.0.0

Actions

  • G: Finish the model serialization tickets: mmap, save_facebook bug
  • R: Finish all docs:
    • migration guide (wiki)
    • release notes
    • docstrings
    • migrate & re-run tutorials and guides
    • proof-read generated API docs for readability: we don't want stubs or obsolete/compatibility classes and methods like word2vec.Word2VecVocab, word2vec.Word2VecTrainables etc.
      • esp. with the new website style: correct syntax highlighting, headings, sections
    • internal code style: hanging indents, trailing commas, FIXME vs TODO, etc
  • M: Finish structure cleanup: remove subpackages and modules we don't want, incl. their docs

@mpenkov @gojomo did I forget something?

piskvorky

comment created time in 4 days

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 from collections import defaultdict, namedtuple from types import GeneratorType import threading-import itertools+import itertools as it

OK. Renamed to itertools for now, in da8847a0.

piskvorky

comment created time in 4 days

PullRequestReviewEvent

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha da8847a04f9ee56702cb81a0218cd5a57e1f24e6

`it` => `itertools`, + code style fixes

view details

push time in 4 days

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 def load(cls, *args, rethrow=False, **kwargs):             if not isinstance(model, Word2Vec):                 rethrow = True                 raise AttributeError("Model of type %s can't be loaded by %s" % (type(model), str(cls)))-            # for backward compatibility-            if not hasattr(model, 'ns_exponent'):-                model.ns_exponent = 0.75-            if model.negative and hasattr(model.wv, 'index2word'):-                model.make_cum_table()  # rebuild cum_table from vocabulary  ## TODO: ???-            if not hasattr(model, 'corpus_count'):-                model.corpus_count = None-            if not hasattr(model, 'corpus_total_words'):-                model.corpus_total_words = None-            if not hasattr(model.wv, 'vectors_lockf') and hasattr(model.wv, 'vectors'):-                model.wv.vectors_lockf = getattr(model, 'vectors_lockf', np.ones(1, dtype=REAL))-            if not hasattr(model, 'random'):-                model.random = np.random.RandomState(model.seed)-            if not hasattr(model, 'train_count'):-                model.train_count = 0-                model.total_train_time = 0-            if not hasattr(model, 'epochs'):-                model.epochs = model.iter-                del model.iter-            if not hasattr(model, 'max_final_vocab'):-                model.max_final_vocab = None-            if hasattr(model, 'vocabulary'):  # re-integrate state that had been moved-                for a in ('max_vocab_size', 'min_count', 'sample', 'sorted_vocab', 'null_word', 'raw_vocab'):-                    setattr(model, a, getattr(model.vocabulary, a))-                del model.vocabulary-            if hasattr(model, 'trainables'):  # re-integrate state that had been moved-                for a in ('hashfxn', 'layer1_size', 'seed', 'syn1neg', 'syn1'):-                    if hasattr(model.trainables, a):-                        setattr(model, a, getattr(model.trainables, a))-                if hasattr(model, 'syn1'):-                    model.syn1 = model.syn1-                    del model.syn1-                del model.trainables             return model         except AttributeError as ae:             if rethrow:                 raise ae             logger.error(                 "Model load error. Was model saved using code from an older Gensim Version? "-                "Try loading older model using gensim-3.8.1, then re-saving, to restore "+                "Try loading older model using gensim-3.8.3, then re-saving, to restore "                 "compatibility with current code.")             raise ae +    def _load_specials(self, *args, **kwargs):+        """Handle special requirements of `.load()` protocol, usually up-converting older versions."""+        super(Word2Vec, self)._load_specials(*args, **kwargs)+        # for backward compatibility, add/rearrange properties from prior versions+        if not hasattr(self, 'ns_exponent'):+            self.ns_exponent = 0.75+        if self.negative and hasattr(self.wv, 'index_to_key'):+            self.make_cum_table()  # rebuild cum_table from vocabulary+        if not hasattr(self, 'corpus_count'):+            self.corpus_count = None+        if not hasattr(self, 'corpus_total_words'):+            self.corpus_total_words = None+        if not hasattr(self.wv, 'vectors_lockf') and hasattr(self.wv, 'vectors'):+            self.wv.vectors_lockf = np.ones(1, dtype=REAL)+        if not hasattr(self, 'random'):+            # use new instance of numpy's recommended generator/algorithm+            self.random = np.random.default_rng(seed=self.seed)+        if not hasattr(self, 'train_count'):+            self.train_count = 0+            self.total_train_time = 0+        if not hasattr(self, 'epochs'):+            self.epochs = self.iter+            del self.iter+        if not hasattr(self, 'max_final_vocab'):+            self.max_final_vocab = None+        if hasattr(self, 'vocabulary'):  # re-integrate state that had been moved+            for a in ('max_vocab_size', 'min_count', 'sample', 'sorted_vocab', 'null_word', 'raw_vocab'):+                setattr(self, a, getattr(self.vocabulary, a))+            del self.vocabulary+        if hasattr(self, 'trainables'):  # re-integrate state that had been moved+            for a in ('hashfxn', 'layer1_size', 'seed', 'syn1neg', 'syn1'):+                if hasattr(self.trainables, a):+                    setattr(self, a, getattr(self.trainables, a))+            if hasattr(self, 'syn1'):+                self.syn1 = self.syn1+                del self.syn1+            del self.trainables++    def get_latest_training_loss(self):+        """Get current value of the training loss.++        Returns+        -------+        float+            Current training loss.++        """+        return self.running_training_loss+  class BrownCorpus(object):

Created a ticket for this: https://github.com/RaRe-Technologies/gensim/issues/2956

piskvorky

comment created time in 4 days

PullRequestReviewEvent

issue openedRaRe-Technologies/gensim

Move BrownCorpus from word2vec to gensim.corpora

Not a high-priority at all, but it'd be more sensible for such a tutorial/testing utility corpus to be implemented elsewhere - maybe under /test/ or some other data- or doc- related module – rather than in gensim.models.word2vec.

Originally posted by @gojomo in https://github.com/RaRe-Technologies/gensim/pull/2939#discussion_r493820305

created time in 4 days

pull request commentRaRe-Technologies/gensim

[WIP] Implement position-dependent weighting to fastText

@piskvorky @gojomo Given the above, would it be acceptable if I initialized word/n-gram vectors with

I have no intuition on the impact of Gauss vs Uniform (vs Beta) initialization on word2vec performance, sorry.

More broadly – can you give a TL;DR of the user benefits please? Something pithy, to potentially go in the release notes / tweet.

You've done a tremendous amount of work @Witiko and I'm sure it'll be an interesting paper; I see two paths forward:

  1. This is better than the current algo, for the majority of common training scenarios. We merge and change Gensim's default to this.

  2. This improves a fringe use-case, or has serious caveats, so to be only used in "special circumstances".

    In this case the PR is better served as a standalone "contributed repo", extending Gensim, I'm -1 on merging yet another "maybe" switch. We're moving Gensim in the direction of less switches, more compact core, less cognitive load.

Thanks!

Witiko

comment created time in 4 days

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 def load(cls, *args, **kwargs):             Save :class:`~gensim.models.fasttext.FastText` model.          """-        model = super(FastText, cls).load(*args, rethrow=True, **kwargs)--        if not hasattr(model.wv, 'vectors_vocab_lockf') and hasattr(model.wv, 'vectors_vocab'):-            # TODO: try trainables-location-            model.wv.vectors_vocab_lockf = ones(1, dtype=REAL)-        if not hasattr(model, 'vectors_ngrams_lockf') and hasattr(model.wv, 'vectors_ngrams'):-            # TODO: try trainables-location-            model.wv.vectors_ngrams_lockf = ones(1, dtype=REAL)-        # fixup mistakenly overdimensioned gensim-3.x lockf arrays-        if len(model.wv.vectors_vocab_lockf.shape) > 1:-            model.wv.vectors_vocab_lockf = ones(1, dtype=REAL)-        if len(model.wv.vectors_ngrams_lockf.shape) > 1:-            model.wv.vectors_ngrams_lockf = ones(1, dtype=REAL)-        if hasattr(model, 'bucket'):-            del model.bucket  # should only exist in one place: the wv subcomponent-        if not hasattr(model.wv, 'buckets_word') or not model.wv.buckets_word:-            model.wv.recalc_char_ngram_buckets()+        return super(FastText, cls).load(*args, rethrow=True, **kwargs)

unintuitive gotchas in the current SaveLoad approach

What are they? I was under the impression save/load was straightforward. Were users complaining?

piskvorky

comment created time in 5 days

PullRequestReviewEvent

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 def load(cls, *args, rethrow=False, **kwargs):             if not isinstance(model, Word2Vec):                 rethrow = True                 raise AttributeError("Model of type %s can't be loaded by %s" % (type(model), str(cls)))-            # for backward compatibility-            if not hasattr(model, 'ns_exponent'):-                model.ns_exponent = 0.75-            if model.negative and hasattr(model.wv, 'index2word'):-                model.make_cum_table()  # rebuild cum_table from vocabulary  ## TODO: ???-            if not hasattr(model, 'corpus_count'):-                model.corpus_count = None-            if not hasattr(model, 'corpus_total_words'):-                model.corpus_total_words = None-            if not hasattr(model.wv, 'vectors_lockf') and hasattr(model.wv, 'vectors'):-                model.wv.vectors_lockf = getattr(model, 'vectors_lockf', np.ones(1, dtype=REAL))-            if not hasattr(model, 'random'):-                model.random = np.random.RandomState(model.seed)-            if not hasattr(model, 'train_count'):-                model.train_count = 0-                model.total_train_time = 0-            if not hasattr(model, 'epochs'):-                model.epochs = model.iter-                del model.iter-            if not hasattr(model, 'max_final_vocab'):-                model.max_final_vocab = None-            if hasattr(model, 'vocabulary'):  # re-integrate state that had been moved-                for a in ('max_vocab_size', 'min_count', 'sample', 'sorted_vocab', 'null_word', 'raw_vocab'):-                    setattr(model, a, getattr(model.vocabulary, a))-                del model.vocabulary-            if hasattr(model, 'trainables'):  # re-integrate state that had been moved-                for a in ('hashfxn', 'layer1_size', 'seed', 'syn1neg', 'syn1'):-                    if hasattr(model.trainables, a):-                        setattr(model, a, getattr(model.trainables, a))-                if hasattr(model, 'syn1'):-                    model.syn1 = model.syn1-                    del model.syn1-                del model.trainables             return model         except AttributeError as ae:             if rethrow:                 raise ae             logger.error(                 "Model load error. Was model saved using code from an older Gensim Version? "-                "Try loading older model using gensim-3.8.1, then re-saving, to restore "+                "Try loading older model using gensim-3.8.3, then re-saving, to restore "                 "compatibility with current code.")             raise ae +    def _load_specials(self, *args, **kwargs):+        """Handle special requirements of `.load()` protocol, usually up-converting older versions."""+        super(Word2Vec, self)._load_specials(*args, **kwargs)+        # for backward compatibility, add/rearrange properties from prior versions+        if not hasattr(self, 'ns_exponent'):+            self.ns_exponent = 0.75+        if self.negative and hasattr(self.wv, 'index_to_key'):+            self.make_cum_table()  # rebuild cum_table from vocabulary+        if not hasattr(self, 'corpus_count'):+            self.corpus_count = None+        if not hasattr(self, 'corpus_total_words'):+            self.corpus_total_words = None+        if not hasattr(self.wv, 'vectors_lockf') and hasattr(self.wv, 'vectors'):+            self.wv.vectors_lockf = np.ones(1, dtype=REAL)+        if not hasattr(self, 'random'):+            # use new instance of numpy's recommended generator/algorithm+            self.random = np.random.default_rng(seed=self.seed)+        if not hasattr(self, 'train_count'):+            self.train_count = 0+            self.total_train_time = 0+        if not hasattr(self, 'epochs'):+            self.epochs = self.iter+            del self.iter+        if not hasattr(self, 'max_final_vocab'):+            self.max_final_vocab = None+        if hasattr(self, 'vocabulary'):  # re-integrate state that had been moved+            for a in ('max_vocab_size', 'min_count', 'sample', 'sorted_vocab', 'null_word', 'raw_vocab'):+                setattr(self, a, getattr(self.vocabulary, a))+            del self.vocabulary+        if hasattr(self, 'trainables'):  # re-integrate state that had been moved+            for a in ('hashfxn', 'layer1_size', 'seed', 'syn1neg', 'syn1'):+                if hasattr(self.trainables, a):+                    setattr(self, a, getattr(self.trainables, a))+            if hasattr(self, 'syn1'):+                self.syn1 = self.syn1+                del self.syn1+            del self.trainables++    def get_latest_training_loss(self):+        """Get current value of the training loss.++        Returns+        -------+        float+            Current training loss.++        """+        return self.running_training_loss+  class BrownCorpus(object):

I agree. This belongs under gensim.corpora – not sure why it's here.

Although I vaguely recall some discussion around this, a long time ago… I wish the code had more motivating (as opposed to technical) comments!

piskvorky

comment created time in 5 days

PullRequestReviewEvent

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 def _load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8'   def load_word2vec_format(*args, **kwargs):-    """Alias for `KeyedVectors.load_word2vec_format(...)`"""+    """Alias for :meth:`~gensim.models.keyedvectors.KeyedVectors.load_word2vec_format`."""     return KeyedVectors.load_word2vec_format(*args, **kwargs)   def pseudorandom_weak_vector(size, seed_string=None, hashfxn=hash):-    """Get a 'random' vector (but deterministically derived from seed_string if supplied).+    """Get a random vector, derived deterministically from `seed_string` if supplied.++    Useful for initializing KeyedVectors that will be the starting projection/input layers of _2Vec models.

OK. Let's merge this ASAP, I fear divergence. CC @mpenkov.

piskvorky

comment created time in 5 days

PullRequestReviewEvent

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 def load(cls, *args, **kwargs):             Save :class:`~gensim.models.fasttext.FastText` model.          """-        model = super(FastText, cls).load(*args, rethrow=True, **kwargs)--        if not hasattr(model.wv, 'vectors_vocab_lockf') and hasattr(model.wv, 'vectors_vocab'):-            # TODO: try trainables-location-            model.wv.vectors_vocab_lockf = ones(1, dtype=REAL)-        if not hasattr(model, 'vectors_ngrams_lockf') and hasattr(model.wv, 'vectors_ngrams'):-            # TODO: try trainables-location-            model.wv.vectors_ngrams_lockf = ones(1, dtype=REAL)-        # fixup mistakenly overdimensioned gensim-3.x lockf arrays-        if len(model.wv.vectors_vocab_lockf.shape) > 1:-            model.wv.vectors_vocab_lockf = ones(1, dtype=REAL)-        if len(model.wv.vectors_ngrams_lockf.shape) > 1:-            model.wv.vectors_ngrams_lockf = ones(1, dtype=REAL)-        if hasattr(model, 'bucket'):-            del model.bucket  # should only exist in one place: the wv subcomponent-        if not hasattr(model.wv, 'buckets_word') or not model.wv.buckets_word:-            model.wv.recalc_char_ngram_buckets()+        return super(FastText, cls).load(*args, rethrow=True, **kwargs)

SpecificClass.load() is there to allow custom logic overrides, IIRC. To clean up backward compatibility from loaded objects etc. So exactly your item (2) promote load functionality to model-specific top-level functions in each relevant model?

Note I didn't make any changes to this code in this PR, the stubs must be coming from elsewhere (older).

piskvorky

comment created time in 5 days

PullRequestReviewEvent

pull request commentRaRe-Technologies/gensim

New docs theme

I'm sure most of the files are not needed. @dvorakvaclav can you trim this PR down please? Let's review & commit & maintain only what we actually need.

dvorakvaclav

comment created time in 5 days

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 from collections import defaultdict, namedtuple from types import GeneratorType import threading-import itertools+import itertools as it

@gojomo ping. Unless this is vital for you I'll revert to spelling out itertools in full, I concur with @mpenkov.

piskvorky

comment created time in 5 days

PullRequestReviewEvent

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

@mpenkov ping – please review & merge, so we can move on.

piskvorky

comment created time in 5 days

issue commentRaRe-Technologies/gensim

Support streaming models split into multiple files from S3 / GCS

@hilaryp yeah that looks weird. That code seems to have been added in this commit: https://github.com/RaRe-Technologies/gensim/commit/e08af7b9d91207da3db56e3e97e65f83dafb1498, something to do with loading Python2 models in Python3, https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q9-how-do-i-load-a-model-in-python-3-that-was-trained-and-saved-using-python-2

I'd consider it a bug though, hardwiring separately is not the way to do it.

Will you be able to take this up & open a PR with a fix?

JensMadsen

comment created time in 5 days

issue commentRaRe-Technologies/smart_open

Fix integration tests

N/M, I see the commit here: https://github.com/RaRe-Technologies/smart_open/commit/909930e9f30ee04a609fc0c8910389b963d7dba3

mpenkov

comment created time in 6 days

issue commentRaRe-Technologies/smart_open

Fix integration tests

@mpenkov what was the issue?

mpenkov

comment created time in 6 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

Yes, object dtype should be safer wrt memory, that stores just the references. OTOH we'll lose the type safety of "all array values are of the same type"… but I guess nobody expects that from Python anyway.

If you've added hard-fails for those aliases, which I think is alright, that's in some still-pending PR.

I added them to this PR: https://github.com/RaRe-Technologies/gensim/pull/2939/files#diff-741f69634ff2e1b2c4ce544e0d82a1e6R616

piskvorky

comment created time in 6 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

the wins in generality/speed/memory-efficiency [of expandos] justify the one-time pain, in my opinion.

If expandos is backed by numpy arrays, won't it suffer from the same memory issues as numpy.str_? Numpy arrays are great for primitive types (ints, floats, etc), but not the best choice for arbitrary objects.

piskvorky

comment created time in 7 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

the latest changes retain aliases for backward-compatible access:

I actually changed this in recent commits, following:

Perhaps index2word and index2entity should also loud-error to trigger old-code-updates as with .vocab and .vectors_norm.

I was thinking whether to keep those in there or not… but with so many breaking changes, I also figured it is best to "rip off the bandaid" rather than introduce deprecations (again).

The biggest pain to any advanced API users is likely to be the removal of the .vocab dict,

Can you put the rationale into https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4 ? Even if we discussed it somewhere already, it'll be useful to have all the changes and their back-story in one place. Thanks!

piskvorky

comment created time in 7 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

To be clear, I have no problem with vector_size and index_to_key.

The question is, does the naming purity outweigh the (massive) breaking of existing code and API that this will cause.

Checking "previous discussion" now.

piskvorky

comment created time in 7 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha dc9c3fc79797f8186f933ecce0dea6b9a9345cbf

more doc fixes

view details

push time in 8 days

PullRequestReviewEvent

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha e5057c12838cf0156b410cd32d35d1afc0f54a14

more doc fixes

view details

push time in 8 days

Pull request review commentRaRe-Technologies/gensim

Random state ldamulticore doc

 def __init__(self, corpus=None, num_topics=100, id2word=None, workers=None,         minimum_probability : float, optional             Topics with a probability lower than this threshold will be filtered out.         random_state : {np.random.RandomState, int}, optional-            Either a randomState object or a seed to generate one. Useful for reproducibility.+            Either a randomState object or a seed to generate one. Useful for reproducibility. Note, that results can
            Either a randomState object or a seed to generate one. Useful for reproducibility.
            Note that results can still vary due to non-determinism in OS scheduling of the worker processes.
sezanzeb

comment created time in 8 days

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentRaRe-Technologies/gensim

Random state ldamulticore doc

Yes, with multiprocessing (and multithreading), training results are no longer 100% reproducible. This is because of OS scheduling of processes (threads), which introduces non-determinism.

A docstring to say as much is welcome, thanks!

sezanzeb

comment created time in 8 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 5509f9e07f077fff8b8b227263b60d2ed1b5f1b9

more doc fixes

view details

push time in 8 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha dab59ecea003797b52b7b1849166c9286e4571a9

more doc fixes

view details

push time in 8 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha e95ac0ac15dc27c256b8cba5cc1ddb0c1b655f59

fix nonsensical word2vec path examples

view details

push time in 8 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 6fa5a1b464b22bbebea8389fffa3581a04a28ae7

a few more style fixes

view details

push time in 9 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha d14c47c7136f877193f8444f0fff04a07b275970

a few more style fixes

view details

push time in 9 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha f0cade1af9dc9177cd2b6d47cac9b19cf295965d

Update gensim/models/word2vec.py Co-authored-by: Michael Penkov <m@penkov.dev>

view details

push time in 9 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 17da21e4e3a5d73275ac32e501bf152500a49ee5

finish method renaming - add() => add_vectors() - add_one() => add_vector()

view details

push time in 9 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 51cae68d188fabf3ba19d3dbcde32569104d8a96

more index2word => index_to_key fixes

view details

push time in 9 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha d4b9cc0ad209ee5edad2f67fcbb14b681d28120c

more index2word => index_to_key fixes

view details

push time in 9 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

I'm still not sure about superficial breaking changes like size => vector_size or index2word => index_to_key.

Do we really have to annoy users with that? I'll add that to the list for "IRL discussion", probably best handled & decided live.

piskvorky

comment created time in 9 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 231b613214f1f0e7404412fbee241e1560c9f6b6

more index2word => index_to_key fixes

view details

push time in 9 days

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 from collections import defaultdict, namedtuple from types import GeneratorType import threading-import itertools+import itertools as it

I'm also in favour of explicit itertools.xyz, not a fan of 2-letter acronyms. Especially ones that coincide with common English words.

How strongly do you feel about this @gojomo ?

piskvorky

comment created time in 9 days

PullRequestReviewEvent

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 5e215607636e868e27c867a140be3ba4f71a3890

fix typos + more doc fixes + fix failing tests

view details

push time in 9 days

pull request commentRaRe-Technologies/gensim

[WIP] *2Vec SaveLoad improvements

@gojomo fixed in 06aef751. Please review – there's still one outstanding FIXME left in keyedvectors.py, I wasn't sure how you meant # FIXME: Update to better match & share code with most_similar().

piskvorky

comment created time in 9 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 06aef7510f8208b32f183c1c95597ef55865826d

fix index2entity, fix docs, hard-fail deprecated properties

view details

push time in 9 days

push eventRaRe-Technologies/gensim

Vít Novotný

commit sha 09b7e9485f94c5044e0a478e1d42d331ee0503f4

Fix deprecations in SoftCosineSimilarity (#2940) * Remove deprecated Soft Cosine Measure parameters, functions, and tests. Here is a detailed list of the deprecations: - Parameter `positive_definite` of `SparseTermSimilarityMatrix` has been renamed to `dominant`. Test `test_positive_definite` has been removed. - Parameter `similarity_matrix` of `SoftCosineSimilarity` no longer accepts unencapsulated sparse matrices. - Parameter `normalized` of `SparseTermSimilarityMatrix.inner_product` no longer accepts booleans. - Function `matutils.softcossim` has been superseded by method `SparseTermSimilarityMatrix.inner_product`. Tests in `TestSoftCosineSimilarity` have been removed. * Remove unused imports * Fix additional warnings from the CI test suite * Update CHANGELOG.md Co-authored-by: Michael Penkov <m@penkov.dev>

view details

Paul O'Leary McCann

commit sha cddf3c1243a950fcaf8c4dc4abbf9ba0383b9405

Fix "generator" language in word2vec docs (#2935) * Fix docs about Word2Vec (fix #2934) Docs say you can use a generator as the first argument, but you can't. The tempfile path was also unused, so that's been removed. * Fix langauge to make it clear streaming is supported Technically a generator is a kind of iterator, so this clarifies that a restartable iterator (as opposed to a consumable generator) is necessary. * Update gensim/models/word2vec.py * Update CHANGELOG.md Co-authored-by: Michael Penkov <m@penkov.dev>

view details

Gordon Mohr

commit sha 08a61e5eb6395ca4f03fa0bee9b2845f983fc3de

Bump minimum Python version to 3.6 (#2947) * remove claims of Python 3.5 support brings `setup.py` up to sync with #2713 & #2715 changes * remove py2.7 and py3.5 from web index page * Update CHANGELOG.md Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz> Co-authored-by: Michael Penkov <m@penkov.dev>

view details

Radim Řehůřek

commit sha c14456d2ca51f895e3202b6add531bce98952f1e

Merge remote-tracking branch 'origin/develop' into 2vec_saveload_fixes

view details

push time in 9 days

delete branch RaRe-Technologies/gensim

delete branch : test-py35-in-2020-09

delete time in 11 days

issue commentRaRe-Technologies/smart_open

Support GCS files without credentials

Hm, maybe we can move this info / recipes on various connectors (S3, GC, HTTPS…) into separate Wiki pages? And link to them from the README.

Because I'm worried the README is becoming unwieldy CC @mpenkov . Also it's showing a red badge "build failing" ATM.

ppwwyyxx

comment created time in 11 days

issue commentRaRe-Technologies/smart_open

Support GCS files without credentials

Thanks! That makes sense, I wasn't aware of that use case.

Can you open a PR with a fix?

ppwwyyxx

comment created time in 11 days

Pull request review commentRaRe-Technologies/gensim

[WIP] save_facebook_model after update-vocab

 def _raw_word_count(self, job):         """         return sum(len(sentence) for sentence in job) -    def _check_training_sanity(self, epochs=None, total_examples=None, total_words=None, **kwargs):+    def _check_corpus_sanity(self, corpus_iterable=None, corpus_file=None, passes=1):+        """Checks whether the corpus parameters make sense."""+        if corpus_file is None and corpus_iterable is None:+            raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")+        if corpus_file is not None and corpus_iterable is not None:+            raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")+        if corpus_iterable is None and not os.path.isfile(corpus_file):+            raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead" % corpus_file)+        if corpus_iterable is not None and not isinstance(corpus_iterable, Iterable):+            raise TypeError("The corpus_iterable must be an iterable of list, got %r instead" % corpus_iterable)+        if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:+            raise TypeError(+                f"Using a generator as corpus_iterable can't support {passes} passes. Try a re-iterable sequence.")

I checked setup.py + official docs and saw Py3.5+. I simply forgot about those old CI PRs, don't read too much into it.

I hope to clean up the overall docs structure a bit, to go with 4.0.0: an explicit split into "for users" / "for contributors" (incl. this release policy, code style, docs style…); deduplicate content from wiki vs sphinx vs github README; better content on the new main page (website index).

gojomo

comment created time in 12 days

PullRequestReviewEvent

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha 0b12d9dbfab21bb32d5be401a37f4a0380bb62ba

remove py2.7 and py3.5 from web index page

view details

push time in 12 days

issue openedRaRe-Technologies/gensim

test_lda_callback.py test stalls

I've seen multiple PRs fail (red) with the same problem in log: https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/727593115

Screen Shot 2020-09-16 at 13 49 59

Restarting the build helps = the re-run succeeds.

This only seems to affect Travis; I never saw this in our Azure (windows) tests.

created time in 12 days

issue commentRaRe-Technologies/gensim

RFC: New website design

I'd like to keep several versions (releases) in the navigation. Not sure how to do it with Sphinx, but I saw it in other projects, should be possible.

All good points (except shrinking fonts), I'll pass it on to the web devs. Thanks!

piskvorky

comment created time in 12 days

Pull request review commentRaRe-Technologies/gensim

[WIP] save_facebook_model after update-vocab

 def _raw_word_count(self, job):         """         return sum(len(sentence) for sentence in job) -    def _check_training_sanity(self, epochs=None, total_examples=None, total_words=None, **kwargs):+    def _check_corpus_sanity(self, corpus_iterable=None, corpus_file=None, passes=1):+        """Checks whether the corpus parameters make sense."""+        if corpus_file is None and corpus_iterable is None:+            raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")+        if corpus_file is not None and corpus_iterable is not None:+            raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")+        if corpus_iterable is None and not os.path.isfile(corpus_file):+            raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead" % corpus_file)+        if corpus_iterable is not None and not isinstance(corpus_iterable, Iterable):+            raise TypeError("The corpus_iterable must be an iterable of list, got %r instead" % corpus_iterable)+        if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:+            raise TypeError(+                f"Using a generator as corpus_iterable can't support {passes} passes. Try a re-iterable sequence.")

4.0.0 will be the first release to NOT support Python 2.7 (never mind Python 3.5).

What we do with Py3 support going forward is up to us. Not supporting EOLed Pythons makes sense to me (with docs + setup.py updated to match).

I don't think we're making much use of any new Python 3.x features either – the switch is more externally-enforced. Once the "no more Python 2" river has been crossed, the version makes little difference in practice IMO beyond sysops.

gojomo

comment created time in 13 days

PullRequestReviewEvent

Pull request review commentRaRe-Technologies/gensim

[WIP] save_facebook_model after update-vocab

 def _raw_word_count(self, job):         """         return sum(len(sentence) for sentence in job) -    def _check_training_sanity(self, epochs=None, total_examples=None, total_words=None, **kwargs):+    def _check_corpus_sanity(self, corpus_iterable=None, corpus_file=None, passes=1):+        """Checks whether the corpus parameters make sense."""+        if corpus_file is None and corpus_iterable is None:+            raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")+        if corpus_file is not None and corpus_iterable is not None:+            raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")+        if corpus_iterable is None and not os.path.isfile(corpus_file):+            raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead" % corpus_file)+        if corpus_iterable is not None and not isinstance(corpus_iterable, Iterable):+            raise TypeError("The corpus_iterable must be an iterable of list, got %r instead" % corpus_iterable)+        if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:+            raise TypeError(+                f"Using a generator as corpus_iterable can't support {passes} passes. Try a re-iterable sequence.")

I'm very much for dropping 3.5, yes (f-strings being one of the main reasons).

That hasn't happened yet though; @mpenkov WDYT?

gojomo

comment created time in 13 days

PullRequestReviewEvent

Pull request review commentRaRe-Technologies/gensim

Fix doc entry for corpus_iterable in train

 def train(self, corpus_iterable=None, corpus_file=None, total_examples=None, tot          Parameters         -----------        sentences : iterable of list of str, optional-            The `sentences` iterable can be simply a list of lists of tokens, but for larger corpora,+        corpus_iterable : iterable of list of str, optional+            The `corpus_iterable` iterable can be simply a list of lists of tokens, but for larger corpora,
            The `corpus_iterable` iterable can be simply a list of lists of tokens, but for larger corpora, to save RAM,
svenski

comment created time in 13 days

PullRequestReviewEvent
PullRequestReviewEvent

issue commentRaRe-Technologies/gensim

RFC: New website design

@mpenkov @gojomo close-to-finish draft: https://stage.friendlystudio.cz/gensim-doc

Any comments, suggestions? Now is the right time :)

Things that will still change:

  • Update logo + fix the tooltip on logo hover.
  • Fix the "main image" / code sample on top
  • Fix a few images + typo.

Otherwise the site structure, layout and content are complete. The switch should not require any changes to our deployment process: it's still static HTML generated by Sphinx.

piskvorky

comment created time in 13 days

Pull request review commentRaRe-Technologies/gensim

[WIP] save_facebook_model after update-vocab

 def _raw_word_count(self, job):         """         return sum(len(sentence) for sentence in job) -    def _check_training_sanity(self, epochs=None, total_examples=None, total_words=None, **kwargs):+    def _check_corpus_sanity(self, corpus_iterable=None, corpus_file=None, passes=1):+        """Checks whether the corpus parameters make sense."""+        if corpus_file is None and corpus_iterable is None:+            raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")+        if corpus_file is not None and corpus_iterable is not None:+            raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")+        if corpus_iterable is None and not os.path.isfile(corpus_file):+            raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead" % corpus_file)+        if corpus_iterable is not None and not isinstance(corpus_iterable, Iterable):+            raise TypeError("The corpus_iterable must be an iterable of list, got %r instead" % corpus_iterable)+        if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:+            raise TypeError(+                f"Using a generator as corpus_iterable can't support {passes} passes. Try a re-iterable sequence.")

F-strings are great but require py3.6+. IIRC Gensim supports 3.5+.

gojomo

comment created time in 13 days

Pull request review commentRaRe-Technologies/gensim

[WIP] save_facebook_model after update-vocab

 def _raw_word_count(self, job):         """         return sum(len(sentence) for sentence in job) -    def _check_training_sanity(self, epochs=None, total_examples=None, total_words=None, **kwargs):+    def _check_corpus_sanity(self, corpus_iterable=None, corpus_file=None, passes=1):+        """Checks whether the corpus parameters make sense."""+        if corpus_file is None and corpus_iterable is None:+            raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")+        if corpus_file is not None and corpus_iterable is not None:+            raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")+        if corpus_iterable is None and not os.path.isfile(corpus_file):+            raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead" % corpus_file)+        if corpus_iterable is not None and not isinstance(corpus_iterable, Iterable):+            raise TypeError("The corpus_iterable must be an iterable of list, got %r instead" % corpus_iterable)+        if corpus_iterable is not None and isinstance(corpus_iterable, GeneratorType) and passes > 1:+            raise TypeError(+                f"Using a generator as corpus_iterable can't support {passes} passes. Try a re-iterable sequence.")++    def _check_training_sanity(self, epochs=0, total_examples=None, total_words=None, **kwargs):         """Checks whether the training parameters make sense. -        Called right before training starts in :meth:`~gensim.models.base_any2vec.BaseWordEmbeddingsModel.train`-        and raises warning or errors depending on the severity of the issue in case an inconsistent parameter-        combination is detected.-         Parameters         -----------        epochs : int, optional-            Number of training epochs. Must have a (non None) value.+        epochs : int+            Number of training epochs. Must have a positive value to pass check.
            Number of training epochs. A positive integer.
gojomo

comment created time in 13 days

Pull request review commentRaRe-Technologies/gensim

[WIP] save_facebook_model after update-vocab

 def _raw_word_count(self, job):         """         return sum(len(sentence) for sentence in job) -    def _check_training_sanity(self, epochs=None, total_examples=None, total_words=None, **kwargs):+    def _check_corpus_sanity(self, corpus_iterable=None, corpus_file=None, passes=1):+        """Checks whether the corpus parameters make sense."""+        if corpus_file is None and corpus_iterable is None:+            raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")+        if corpus_file is not None and corpus_iterable is not None:+            raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")+        if corpus_iterable is None and not os.path.isfile(corpus_file):+            raise TypeError("Parameter corpus_file must be a valid path to a file, got %r instead" % corpus_file)+        if corpus_iterable is not None and not isinstance(corpus_iterable, Iterable):+            raise TypeError("The corpus_iterable must be an iterable of list, got %r instead" % corpus_iterable)
            raise TypeError("The corpus_iterable must be an iterable of lists of strings, got %r instead" % corpus_iterable)
gojomo

comment created time in 13 days

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

OK I think I found it. Have to run now, will clean up the pop & index2word etc later.

piskvorky

comment created time in 14 days

push eventRaRe-Technologies/gensim

Radim Řehůřek

commit sha eefe9ab87ae1b0206f0c7ff7adc296961dada6bc

fix docs

view details

Radim Řehůřek

commit sha 1a9b6466d69d8cd532209e3be3af0fad3f9ec172

get rid of numpy.str_

view details

push time in 14 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

This is the core of it:

In [88]: type(model.wv.index_to_key[0])
Out[88]: numpy.str_

I'll check where "numpy strings" creep in.

piskvorky

comment created time in 14 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

Yes, I can replicate on text8:

3.8.3

27034172 bytes in pickle (after training on a 66515395 raw words (49739954 effective words) took 96.0s, 518211 effective words/s).

4.0.0

68765086 bytes in pickle (after training on a 66515395 raw words (49737330 effective words) took 79.3s, 627379 effective words/s).

Timings are for #2887; trained with word2vec.Word2Vec(word2vec.LineSentence('/tmp/enwik8'), size=100, window=5, min_count=2, workers=8).

piskvorky

comment created time in 15 days

pull request commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

OK, I'll double check tmr.

piskvorky

comment created time in 15 days

Pull request review commentRaRe-Technologies/gensim

[MRG] *2Vec SaveLoad improvements

 from collections import defaultdict, namedtuple from types import GeneratorType import threading-import itertools+import itertools as it

I agree. See the original PR for a discussion with @gojomo – he prefers it because it's more idiomatic.

piskvorky

comment created time in 15 days

more