profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/ddangelov/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Dimo Angelov ddangelov Ottawa, Canada Data Scientist

ddangelov/Top2Vec 1285

Top2Vec learns jointly embedded topic, document and word vectors.

ddangelov/RESTful-Top2Vec 39

Expose a Top2Vec model with a REST API.

issue commentddangelov/Top2Vec

Not working for "distiluse-base-multilingual-cased" model

Try passing the following to the model: hdbscan_args = {'min_cluster_size': 10, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}

Yashsethi24

comment created time in a month

issue closedddangelov/Top2Vec

2 cluster trained when small number of data entry

I used your example yahoo to test but applied a filter that I only use data that has 100 or less space " " (i.e. a way to specify small docs) i applied this to the readme.md example

data = [doc for doc in data if len(doc.split(" ")) <= 100] i got around 10000 docs left and only 2 cluster was found

closed time in a month

humblemat810

issue commentddangelov/Top2Vec

2 cluster trained when small number of data entry

Using doc2vec as the embedding model for small datasets may be suboptimal. Try using the universal-sentence-encoder.

humblemat810

comment created time in a month

issue commentddangelov/Top2Vec

Multiple downloads of different versions of ```tensorflow``` while performing ```pip install top2vec[sentence_encoders]```

I would just run:

pip install torch
pip install torch sentence_transformers

Either way this sounds like an environment issue.

harshgeek4coder

comment created time in a month

issue closedddangelov/Top2Vec

Multiple downloads of different versions of ```tensorflow``` while performing ```pip install top2vec[sentence_encoders]```

I was trying to explore top2vec and trying to use and install a universal sentence encoder. I observed that while installing pip install top2vec[sentence_encoders], there was an attempt to download multiple versions of TensorFlow during this installation.

I tried this on Google Colab . Here is the log of the following installation :

Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (1.1.5)
Requirement already satisfied: gensim<4.0.0 in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (3.6.0)
Requirement already satisfied: hdbscan>=0.8.27 in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (0.8.27)
Requirement already satisfied: umap-learn>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (0.5.1)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (1.21.2)
Requirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (1.5.0)
Requirement already satisfied: tensorflow in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (2.6.0)
Requirement already satisfied: tensorflow-text in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (2.6.0)
Requirement already satisfied: tensorflow-hub in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (0.12.0)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->top2vec[sentence_encoders]) (1.4.1)
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->top2vec[sentence_encoders]) (1.15.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim<4.0.0->top2vec[sentence_encoders]) (5.1.0)
Requirement already satisfied: joblib>=1.0 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec[sentence_encoders]) (1.0.1)
Requirement already satisfied: cython>=0.27 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec[sentence_encoders]) (0.29.24)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec[sentence_encoders]) (0.22.2.post1)
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.1->top2vec[sentence_encoders]) (0.5.4)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.1->top2vec[sentence_encoders]) (0.51.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.1->top2vec[sentence_encoders]) (57.4.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.1->top2vec[sentence_encoders]) (0.34.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->top2vec[sentence_encoders]) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->top2vec[sentence_encoders]) (2.8.2)
Requirement already satisfied: tensorflow-estimator~=2.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (2.6.0)
Requirement already satisfied: typing-extensions~=3.7.4 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (3.7.4.3)
Requirement already satisfied: gast==0.4.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (0.4.0)
Requirement already satisfied: termcolor~=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (1.1.0)
Collecting tensorflow
  Using cached tensorflow-2.5.1-cp37-cp37m-manylinux2010_x86_64.whl (454.4 MB)
  Using cached tensorflow-2.5.0-cp37-cp37m-manylinux2010_x86_64.whl (454.3 MB)
  Using cached tensorflow-2.4.3-cp37-cp37m-manylinux2010_x86_64.whl (394.5 MB)
Collecting tensorflow-estimator<2.5.0,>=2.4.0
  Using cached tensorflow_estimator-2.4.0-py2.py3-none-any.whl (462 kB)
Collecting grpcio~=1.32.0
  Downloading grpcio-1.32.0-cp37-cp37m-manylinux2014_x86_64.whl (3.8 MB)
     |████████████████████████████████| 3.8 MB 7.9 MB/s 
Requirement already satisfied: protobuf>=3.9.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (3.17.3)
Requirement already satisfied: astunparse~=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (1.6.3)
Requirement already satisfied: absl-py~=0.10 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (0.12.0)
Requirement already satisfied: tensorboard~=2.4 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (2.6.0)
Requirement already satisfied: keras-preprocessing~=1.1.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (1.1.2)
Requirement already satisfied: opt-einsum~=3.3.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (3.3.0)
Requirement already satisfied: wheel~=0.35 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (0.37.0)
Requirement already satisfied: wrapt~=1.12.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (1.12.1)
Requirement already satisfied: google-pasta~=0.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (0.2.0)
Collecting gast==0.3.3
  Using cached gast-0.3.3-py2.py3-none-any.whl (9.7 kB)
Collecting tensorflow
  Using cached tensorflow-2.4.2-cp37-cp37m-manylinux2010_x86_64.whl (394.5 MB)
  Using cached tensorflow-2.4.1-cp37-cp37m-manylinux2010_x86_64.whl (394.3 MB)
  Using cached tensorflow-2.4.0-cp37-cp37m-manylinux2010_x86_64.whl (394.7 MB)
  Downloading tensorflow-2.3.4-cp37-cp37m-manylinux2010_x86_64.whl (320.6 MB)
     |████████████████████████████████| 320.6 MB 27 kB/s 
Collecting tensorflow-estimator<2.4.0,>=2.3.0
  Downloading tensorflow_estimator-2.3.0-py2.py3-none-any.whl (459 kB)
     |████████████████████████████████| 459 kB 73.3 MB/s 
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow->top2vec[sentence_encoders]) (1.39.0)
Collecting tensorflow
  Downloading tensorflow-2.3.3-cp37-cp37m-manylinux2010_x86_64.whl (320.5 MB)
     |████████████████▌               | 164.8 MB 1.3 MB/s eta 0:02:03ERROR: Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/urllib3/response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/cachecontrol/filewrapper.py", line 62, in read
    data = self.__fp.read(amt)
  File "/usr/lib/python3.7/http/client.py", line 465, in read
    n = self.readinto(b)
  File "/usr/lib/python3.7/http/client.py", line 509, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/base_command.py", line 180, in _main
    status = self.run(options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/req_command.py", line 199, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/commands/install.py", line 319, in run
    reqs, check_supported_wheels=not options.target_dir
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 128, in resolve
    requirements, max_rounds=try_to_avoid_resolution_too_deep
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 473, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 367, in resolve
    failure_causes = self._attempt_to_pin_criterion(name)
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 211, in _attempt_to_pin_criterion
    for candidate in criterion.candidates:
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 129, in <genexpr>
    return (c for c in iterator if id(c) not in self._incompatible_ids)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 54, in _iter_built_with_prepended
    candidate = func()
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/factory.py", line 205, in _make_candidate_from_link
    version=version,
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/candidates.py", line 312, in __init__
    version=version,
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/candidates.py", line 151, in __init__
    self.dist = self._prepare()
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/candidates.py", line 234, in _prepare
    dist = self._prepare_distribution()
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/candidates.py", line 318, in _prepare_distribution
    self._ireq, parallel_builds=True
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/prepare.py", line 508, in prepare_linked_requirement
    return self._prepare_linked_requirement(req, parallel_builds)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/prepare.py", line 552, in _prepare_linked_requirement
    self.download_dir, hashes
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/prepare.py", line 243, in unpack_url
    hashes=hashes,
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/operations/prepare.py", line 102, in get_http_url
    from_path, content_type = download(link, temp_dir.path)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/network/download.py", line 157, in __call__
    for chunk in chunks:
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/progress_bars.py", line 152, in iter
    for x in it:
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/network/utils.py", line 86, in response_chunks
    decode_content=False,
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/urllib3/response.py", line 576, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/urllib3/response.py", line 541, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
pip._vendor.urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Although restarting the runtime helped to create an instance of the top2vec model with a universal sentence encoder, this error seemed to repeat every time I tried to perform a pip install.

closed time in a month

harshgeek4coder

issue closedddangelov/Top2Vec

How do we update Top2Vec model with more data later?

How do we update Top2Vec model with more data later? Is that possible here?

closed time in a month

saivignanm

issue commentddangelov/Top2Vec

How do we update Top2Vec model with more data later?

Using the add_documents function.

saivignanm

comment created time in a month

push eventddangelov/Top2Vec

Dimo Angelov

commit sha f7d00feb260af328e95c09c8a378f329f1026e16

query_documents and query_topics fix

view details

push time in a month

issue closedddangelov/Top2Vec

Can we use GPU for faster training on large data using top2vec

hi! I have a custom data and wanted to use this package to train with top2vec. it take 6+ hours with deep learn, and take 35 min on fast learn for a fraction of my dataset. I wanted to train all of it, can you suggest me something that can be useful to speedup the training process. I have a NVidia GPU 11GB. is there any way to use this resource for it?

closed time in a month

Itsneuralnet

issue commentddangelov/Top2Vec

Cannot use list of strings as an input to the model : TypeError: Cannot convert list to numpy.ndarray

I have tested this with the exact same top2vec and gensim versions and I cannot recreate it.

vedpd

comment created time in a month

issue closedddangelov/Top2Vec

Cannot use list of strings as an input to the model : TypeError: Cannot convert list to numpy.ndarray

Tried reading one of the datasets with a column consisting of text data. Obtained a list of strings using the textual column from the dataframe.

On running the model, it throws the error :
TypeError: Cannot convert list to numpy.ndarray

Operations done :

  1. info_df = pd.read_csv('/content/info.csv')
  2. info_df. head
  3. input_str_lst = df.Reviews.tolist()
  4. type(input_str_lst) : O/P = List
  5. model = Top2Vec(documents=input_str_lst) error_top2vec model_error_top2vec

Attached are screenshots

closed time in a month

vedpd

issue commentddangelov/Top2Vec

Is there a way to infer only the new corpus?

query_topics should do what you are looking for. Use the documents as the queries. Unless you want to create new topics based on the 200 documents. IN which case what embedding_model are you using?

IanYu-GBI

comment created time in a month

issue closedddangelov/Top2Vec

What are the model performance metrics for top2vec model?

For the LDA model there are model performance metrics that are perplexity and log-likelihood. Similarly, what are model performance metrics for top2vec model and how it could be fetched from top2vec model?

closed time in a month

ervivek

issue commentddangelov/Top2Vec

What are the model performance metrics for top2vec model?

see #158

ervivek

comment created time in a month

issue closedddangelov/Top2Vec

Keywords from top2vec model are not representative of related documents

I am working with amazon reviews data. Topic model created as follows - model = Top2Vec(documents=df1, speed="learn", workers=2)

Sharing a topic generated by this model - image

Following are the top 20 documents corresponding to the above topic. Keywords mentioned in the above topic not found in the documents list. Can you please suggest corrective actions?

Document: 2733, Score: 0.9990834593772888 bad screen open second press power bottun fingerprint sensor work properly

Document: 8982, Score: 0.9987523555755615 wifi connect automaticallyit search luck restart mobile connect wifii think major issuethe application load slowly

Document: 7159, Score: 0.9987006783485413 hey defective product hour fully charge fully charge drain fast rate min btry face heating issue yesterday onwards phone start switch automatically help think defective product download update help

Document: 4494, Score: 0.9983952641487122 look hardware software version use lenovo note model bad software lack basic functionality theme basic feature find samsung cast option work defect late operating system change handle mobile note note upgrade mobile latest nogut version 711 help future version oreio available diwali 2018 effort lenovo bad quality product lenovo samsung

Document: 270, Score: 0.9983184337615967 device getting heat application usage min

Document: 8283, Score: 0.9981307983398438 pros1 fantastic display colour reproduction2 design build quality3 average secondary camera selfie flash4 good ram management5 heating issue feel till datecons1 mediocre primary camera2 dedicated music button stop work unexpectedly day

Document: 4286, Score: 0.998047947883606 cost worthy good battery backup camera quality good day time poor night major disadvantage notification lead glow 3rd party application whatsapp messanger hike know issue till product release fix find sad product

Document: 8293, Score: 0.9978888630867004 buy lenovo venom black model day start use satisfied deliver product face issue device gets lock automatically unable unlock use power key new product expect quality substance case request look issue device provide solution

Document: 8204, Score: 0.9978846907615662 pretty good phone lot featured issue till good phone fingerprint sensor backside

Document: 1256, Score: 0.99785977602005 google app respond message againi phone todayand face issue

Document: 4887, Score: 0.9978034496307373 recording feature headphone provide previous lenova phone provide headphone

Document: 2752, Score: 0.9977614283561707 regular customer amazon lenovo mobile previous lenovo mobile fine problem mobile disppointment audio poor sense virtually unable hear people speak eventhough volume maximum think lenovo amazon ditch replacement throw away original packing invoice confident mobile good early lenovo mobile

Document: 6755, Score: 0.9977318644523621 bad mobile phone low battery backup2 recordings3 network problem4 picture finger print sensor keys5 set ringtone device song6 anable edit image tool etc

Document: 9286, Score: 0.9976276755332947 lenovo need maintain standard key structure upgrade find soft key traingular locate left locate right easy comfortable right hander

Document: 559, Score: 0.9976266622543335 write review month use phone phone good build qualitysupereb megapixel camera quit powerful flash performance good play lot game lag gameingcons talk con product bokeh mode blur effect good average battey backup bit good phone heavy usage charge evening

Document: 4976, Score: 0.9975117444992065 volte work battery saver mode advertise phone true sim use sim sim network signal strength good compare redmi note nexus zenfone 2dolby atmos stop work install music player gets switch charge use charge night find gets switch day instal battery temp monitor software hot ~40 deg play game charging remain hot normal use compare redmi note nexus zenfone use booster charger time charge phonefinally return product easy return product amazon hardware fault technician visit need

Document: 3509, Score: 0.9974973797798157 phone goodbut emi 666it 702 taxi end pay 18000tax emi tax upar air taxlike 680 emi tax itand interest let 30rs tax end pay 100 buck actually showso watch

Document: 540, Score: 0.9974951148033142 great phone stock android smooth phone use till date build quality descent rear camera perform good phone heavy user 4000 mah battery good day work look phone rounder range note great option thing leave sound quality good dolby atmo ram varient good ram varient issue completely fine

Document: 9603, Score: 0.9974249601364136 lenovo note awesome good picture dual camera slim metallic body bit difficult hold hand cover feature awesome android battery bit concern regular user

Document: 1817, Score: 0.9974241852760315 bad quality screenmobile fall height screen brokencustomer care lenovo jabalpur respond regard replacement screen pay service basis deposit 4200 screen replacement september 2017 today 26092017 spare screen change casual approach service teambad experience

closed time in a month

ervivek

issue commentddangelov/Top2Vec

Keywords from top2vec model are not representative of related documents

If you are using embedding_model='doc2vec' then the quality of the document and word embedding will depend on the size and quality of your dataset. I would recommend trying to use embedding_model='universal-sentence-encoder'

ervivek

comment created time in a month

issue closedddangelov/Top2Vec

Top2Vec fails when a single topic is generated

I'm running Top2Vec in an automated fashion on several custom corpuses. For some, Top2Vec generates a single topic, which causes an error. For a reproducible example, run:

Top2Vec(['test' for _ in range(51)], embedding_model='universal-sentence-encoder')

The loop is required as Top2Vec requires >50 inputs.

Note that I consider this a bug as I still want to use Top2Vec to get the keywords from the topic, even if there is only one.

The stack trace produced is: 2021-07-29 11:44:52,437 - top2vec - INFO - Pre-processing documents for training INFO:top2vec:Pre-processing documents for training 2021-07-29 11:44:52,440 - top2vec - INFO - Downloading universal-sentence-encoder model INFO:top2vec:Downloading universal-sentence-encoder model 2021-07-29 11:44:56,486 - top2vec - INFO - Creating joint document/word embedding INFO:top2vec:Creating joint document/word embedding 2021-07-29 11:44:56,666 - top2vec - INFO - Creating lower dimension embedding of documents INFO:top2vec:Creating lower dimension embedding of documents 2021-07-29 11:44:58,501 - top2vec - INFO - Finding dense areas of documents INFO:top2vec:Finding dense areas of documents 2021-07-29 11:44:58,502 - top2vec - INFO - Finding topics INFO:top2vec:Finding topics Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/jack/.pyenv/versions/3.9.5/lib/python3.9/site-packages/top2vec/Top2Vec.py", line 376, in __init__ self._create_topic_vectors(cluster.labels_) File "/home/jack/.pyenv/versions/3.9.5/lib/python3.9/site-packages/top2vec/Top2Vec.py", line 589, in _create_topic_vectors np.vstack([self._get_document_vectors(norm=False)[np.where(cluster_labels == label)[0]] File "<__array_function__ internals>", line 5, in vstack File "/home/jack/.pyenv/versions/3.9.5/lib/python3.9/site-packages/numpy/core/shape_base.py", line 282, in vstack return _nx.concatenate(arrs, 0) File "<__array_function__ internals>", line 5, in concatenate ValueError: need at least one array to concatenate

closed time in a month

will-jac

issue commentddangelov/Top2Vec

Top2Vec fails when a single topic is generated

This usually occurs when your corpus is too small. Try increasing corpus size and changing the value of min_count=2

will-jac

comment created time in a month

issue commentddangelov/Top2Vec

RuntimeError: you must first build vocabulary before training the model

This usually occurs when your corpus is too small. Try increasing corpus size and changing the value of min_count=2

Evan-wyl

comment created time in a month

issue closedddangelov/Top2Vec

ValueError: numpy.ndarray size changed

I'm running Top2Vec in Databricks and this error appears:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Code:

from top2vec import Top2Vec model = Top2Vec(messages_lst, embedding_model='distiluse-base-multilingual-cased')

Environment: python3.8.8 numpy==1.20.0 (also 1.21.x doesn't work) scipy==1.5.2 hdbscan==0.8.27

Seems to be related to hdbscan but doing !pip install hdbscan --no-cache-dir --no-binary :all: --no-build-isolation (also with numpy), as suggested in other issues, doesn't solve the problem.

closed time in a month

edogab33

issue commentddangelov/Top2Vec

ValueError: numpy.ndarray size changed

This sounds like an environment issue.

edogab33

comment created time in a month

issue commentddangelov/Top2Vec

Getting ValueError while searching a topic for a given text (in string format)

Thank you for pointing this out. I have fixed this issue and top2vec version 1.0.27 will reflect those changes.

devsinghh

comment created time in a month

issue closedddangelov/Top2Vec

Error while training the very large text files

Hello ddangelov I have very large text files and want to generete the topics from this large text files. I tried to implement Top2Vec model as topic modelling model for topic gereration as topics. I am facing some error with no topics.Could you suggest me how can i use my large text files for Top2Vec models. Many thanks

closed time in a month

sandeshchand

issue commentddangelov/Top2Vec

Error while training the very large text files

I am sorry but this is not enough detail and I cannot help you.

sandeshchand

comment created time in a month

issue closedddangelov/Top2Vec

Training Doc2Vec with Corpus File Results in Permission Error on Windows

Hello,

When implementing top2vec under Windows 10 using embedding_model='doc2vec' and corpus_file=True, the script successfully generates the temporary file here in the top2vec.py script starting on line 290

if use_corpus_file:
     processed = [' '.join(tokenizer(doc)) for doc in documents]
     lines = "\n".join(processed)
     temp = tempfile.NamedTemporaryFile(mode='w+t')
     temp.write(lines)
     doc2vec_args["corpus_file"] = temp.name

however, when trying to access the file later in Gensim's doc2vec.py script on lines 1313-1315

with utils.open(self.source, 'rb') as fin:
     for item_no, line in enumerate(fin):
          yield TaggedDocument(utils.to_unicode(line).split(), [item_no])

the call to utils.open(self.source, 'rb') as fin: give the error

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\MCLEME~1\\AppData\\Local\\Temp\\tmp_njo370f'

This issue appears to be related to

https://github.com/deepchem/deepchem/issues/707#issue-246546717

In that Windows seems to have an issue with accessing a file if the delete=True option of NamedTemporaryFile().

A workaround is to just modify the source code and set delete=False, however, this will cause errors for Unix/Linux users. Is there a way to build in a fix for this that is platform agnostic?

closed time in a month

MarkWClements

issue commentddangelov/Top2Vec

Training Doc2Vec with Corpus File Results in Permission Error on Windows

I would just set use_corpus_file=False, it won't be significantly slower.

MarkWClements

comment created time in a month

issue commentddangelov/Top2Vec

Not working for "distiluse-base-multilingual-cased" model

I just tried using the distiluse-base-multilingual-cased with top2vec version 1.0.26 and it works as expected. Can you be more specific about what doesn't work?

Yashsethi24

comment created time in a month

issue closedddangelov/Top2Vec

Pre-trained Model subtitle misleading [documentation of readme.md]

Intuition will lead us to think that the section is about loading a pretrained model as described in the top2vec paper, however, it actually means the use the pretrained model in its underlying components.

Also, it will be nice to be able to have a way to load your model in the paper to validate/verify the promising results

closed time in a month

humblemat810

issue commentddangelov/Top2Vec

Pre-trained Model subtitle misleading [documentation of readme.md]

Added clarification that it is pre-trained embedding models.

humblemat810

comment created time in a month