profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/ruiEnca/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Rui Encarnação ruiEnca CISUC - University of Coimbra Guarda, Portugal https://www.cisuc.uc.pt/people/show/race Machine Learning Researcher @ CISUC - University of Coimbra

ruiEnca/conda 0

OS-agnostic, system-level binary package manager and ecosystem

issue openedchartbeat-labs/textacy

scikit-learn > 0.24

cikit-learn 0.24.1 was released a while ago.

https://github.com/chartbeat-labs/textacy/blob/cdedd2351bf2a56e8773ec162e08c3188809d486/setup.cfg#L51

The Nix packages is already at 0.24.1 and this breaks the build.

Do you have any plans to allow scikit-learn > 0.24?

Thanks

created time in 6 hours

issue openedchartbeat-labs/textacy

what are the `extract.matches` patterns analogous to `constants.POS_REGEX_PATTERNS` ?

what's wrong?

<!-- Describe the problem or confusion here. If you've found a minor mistake and know how to fix it, feel free to skip this issue and submit a pull request instead: https://github.com/chartbeat-labs/textacy/pulls -->

textacy.constants has POS_REGEX_PATTERNS:

textacy.constants.POS_REGEX_PATTERNS['en']['NP'] 

'<DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN>|<PROPN> <PART>?)+'

What is the equivalent pattern with the newer textacy.extract.matches pattern format?

Should it be in the constants file?

relevant page or section

https://github.com/chartbeat-labs/textacy/blob/cdedd2351bf2a56e8773ec162e08c3188809d486/src/textacy/constants.py#L86

created time in 6 days

issue openedchartbeat-labs/textacy

subject_verb_object_triples works on sample text, but not a more complicated example

I have tried a sample text that I have seen in this repo to extract SVO triples:

text_str = u'Startup companies create jobs and support innovation. Hilary supports entrepreneurship.'
text = nlp(text_str)
text_ext = textacy.extract.subject_verb_object_triples(text)
list(text_ext)

This runs just fine for me and produces the expected result:

[(companies, create, jobs), (Hilary, supports, entrepreneurship)]

However, I then try to do something a bit more complicated:

text_ada = (u'Ada Lovelace was an English mathematician and' 
            ' writer, chiefly known for her work on'
            ' mechanical general-purpose computer, the'
            ' Analytical Engine. She was the first to'
            ' recognise that the machine had applications'
            ' beyond pure calculation, and published the'
            ' first algorithm intended to be carried out' 
            ' by such a machine. As a result, she is'
            ' sometimes regarded as the first to recognise'
            ' the full potential of a computing machine and'
            ' one of the first computer programmers.')

So I run through the same procedure:

ada = nlp(text_ada)
text_ext = textacy.extract.subject_verb_object_triples(ada)
list(text_ext)

but I get an empty list.

Any thoughts?

created time in 13 days

PR opened chartbeat-labs/textacy

Fix Spanish excerpt

Just nitpicking: as a native speaker, I think you wanted to write the second sentence, but I'm not completely sure.

  • (Old) Sí, ¡se habla español! / Yes, Spanish is spoken!
  • (New) Sí, ¡sé hablar español! / Yes, I can speak Spanish!

By the way: awesome project!

+1 -1

0 comment

1 changed file

pr created time in 20 days

issue openedchartbeat-labs/textacy

spaCy 3.0 support?

Hi textacy team,

First, thanks for the amazing library.

Any plans to support the new spaCy 3.0 release?

Thanks

created time in a month

issue commentchartbeat-labs/textacy

IndexError: [E201] Span index out of range

Hi @bdewilde, The issue is closed, but I am not sure if it resolved the issue or not. Using "text_with_ws" extracts only the acronym itself, without its full wording. Am I missing something? Can you please let me know if there is a workaround to get it right? Thanks in advance!

mzeidhassan

comment created time in a month

issue commentchartbeat-labs/textacy

IndexError: [E201] Span index out of range

Hi all,

As @mzeidhassan mentions, this could indeed be related to a bug in spaCy where calling span.text on an empty Span would result in an IndexError.

There's a few ways of dealing with this:

  • Ensure you never define empty spans if you can
  • Use self.text_with_ws as a workaround instead
mzeidhassan

comment created time in a month

startedmpenkov/balbes

started time in a month

issue commentchartbeat-labs/textacy

IndexError: [E201] Span index out of range

I also would like to confirm that everything was working just fine with spaCy 2.3.2

mzeidhassan

comment created time in 2 months

Pull request review commentchartbeat-labs/textacy

Support prepositional objects in SVO IE

 def test_get_subjects_of_verb(spacy_doc):   def test_get_objects_of_verb(spacy_doc):-    expected = [[], ["Python"], ["incompatibilities"], [], ["God"]]+    expected = [[], ["Python"], ["incompatibilities"], ["sake"], ["God", "Overflow"]]

I think I will also have to do some research as to whether this is a better outcome, stay tuned.

8W9aG

comment created time in 2 months

issue openedchartbeat-labs/textacy

IndexError: [E201] Span index out of range

Hi @bdewilde,

I updated spaCy to the latest release '2.3.5', but the acronyms' extraction doesn't seem to be working any longer.

Here is the error I am getting:

IndexError: [E201] Span index out of range.
Traceback:
File "py37/env/lib/python3.7/site-packages/streamlit/script_runner.py", line 332, in _run_script
    exec(code, module.__dict__)
File "py37/env/deep_analysis.py", line 314, in <module>
    acronyms = textacy.extract.acronyms_and_definitions(doc)
File "py37/env/lib/python3.7/site-packages/textacy/extract.py", line 599, in acronyms_and_definitions
    window_ = window.text
File "span.pyx", line 503, in spacy.tokens.span.Span.text.__get__
File "span.pyx", line 190, in spacy.tokens.span.Span.__getitem__

Environment

  • operating system: Ubuntu 16.04
  • python version: 3.7.9
  • spacy version: 2.3.5
  • installed spacy models: en_core_web_sm
  • textacy version: 0.10.1

Any idea?

Thanks in advance!

created time in 2 months

Pull request review commentchartbeat-labs/textacy

Support prepositional objects in SVO IE

 def test_get_subjects_of_verb(spacy_doc):   def test_get_objects_of_verb(spacy_doc):-    expected = [[], ["Python"], ["incompatibilities"], [], ["God"]]+    expected = [[], ["Python"], ["incompatibilities"], ["sake"], ["God", "Overflow"]]

So, this is what I was wondering about — do these changes make sense, in practice? My grammar is a bit rusty, but I believe "for textacy's sake" is an adverb clause and not an object of the verb "were". And "(Stack) Overflow" is separate from "God" wrt "Thank". I'm not totally sure if this is a bug or feature, expected or unexpected. Could you help me understand this better? 🙏

8W9aG

comment created time in 2 months

pull request commentchartbeat-labs/textacy

Support prepositional objects in SVO IE

Hi @8W9aG, thanks for the PR and your patience. What you're doing in the code makes sense, but I think it would be helpful to see tangible, expected outputs by way of a couple unit tests. Could you add a few representative examples to a test, for reference?

8W9aG

comment created time in 2 months

push eventchartbeat-labs/textacy

Burton DeWilde

commit sha eb55b03aa41f64a1ebca34963816a007969de7eb

Drop PY3.6 from CI builds there's some irreconcilable problem going on where a version of numpy i've specifically excluded keeps getting installed, and that fails on PY3.6. so, for the sake of knowing how things are going on 3.7 and 3.8, we're just going to push forward here

view details

Burton DeWilde

commit sha cdedd2351bf2a56e8773ec162e08c3188809d486

Merge branch 'master' of github.com:chartbeat-labs/textacy

view details

push time in 2 months

push eventchartbeat-labs/textacy

Tim Gates

commit sha f108883f7ef24c8fd14f15b229570138e096a8e5

docs: fix simple typo, inferrred -> inferred There is a small typo in src/textacy/io/utils.py. Should read `inferred` rather than `inferrred`.

view details

Burton DeWilde

commit sha 116c807ea83e6a9b9b737741d58aa11effcaff85

Merge pull request #313 from timgates42/bugfix_typo_inferred docs: fix simple typo, inferrred -> inferred

view details

push time in 2 months

PR merged chartbeat-labs/textacy

docs: fix simple typo, inferrred -> inferred

There is a small typo in src/textacy/io/utils.py.

Should read inferred rather than inferrred.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

+1 -1

1 comment

1 changed file

timgates42

pr closed time in 2 months

pull request commentchartbeat-labs/textacy

docs: fix simple typo, inferrred -> inferred

Do I really only one spelling mistake over all those docs?! 😅 Thanks for the fix.

timgates42

comment created time in 2 months

startedjbarlow83/OCRmyPDF

started time in 2 months

PR opened chartbeat-labs/textacy

docs: fix simple typo, inferrred -> inferred

There is a small typo in src/textacy/io/utils.py.

Should read inferred rather than inferrred.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

+1 -1

0 comment

1 changed file

pr created time in 2 months

startedsupertokens/supertokens-core

started time in 2 months

PR opened chartbeat-labs/textacy

Support prepositional objects in SVO IE

Support the handling of prepositions when performing Subject-Verb-Object Information Extraction.

Description

When running Subject-Verb-Object analysis the current program cannot handle looking past a preposition before a noun, and therefore cannot extract the Object correctly when there is a preposition in front of it.

Motivation and Context

I was running this on a test string "Barack Obama was born in Hawaii." and failed to get a Subject-Verb-Object triple.

How Has This Been Tested?

I run a modified version of textacy 0.10.1 that isolates the "subject_verb_object" function and has support for running on spacy 2.1.0. I tested this in this modified version and the input "Barack Obama was born in Hawaii." produces the triple ("Barack Obama", "born in", "Hawaii").

Screenshots (if appropriate):

N/A

Types of changes

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • [x] My code follows the code style of this project.
  • [ ] My change requires a change to the documentation, and I have updated it accordingly.
+7 -1

0 comment

1 changed file

pr created time in 2 months

CommitCommentEvent

push eventchartbeat-labs/textacy

Burton DeWilde

commit sha 7f3bd999a79d85e967bcba66aa57816ed1172089

Play hardball with numpy and pip well okay then

view details

push time in 3 months

push eventchartbeat-labs/textacy

Burton DeWilde

commit sha a675cc5608885a0d4e4188a2f0bb553f5c472eec

Try a different numpy pin range since github actions is unexpectedly still trying to install 1.20.0rc1

view details

push time in 3 months

push eventchartbeat-labs/textacy

Burton DeWilde

commit sha a28c8c3f44772f34b9481bb5ddd2ba8b385be49a

Set upper bound on numpy dep numpy 1.20 drops PY3.6 support, which we have not yet done

view details

push time in 3 months

issue closedchartbeat-labs/textacy

to_bag_of_terms unexpected behavior

steps to reproduce

  1. initialize doc with text containing the word "didn't"
  2. doc._.to_bag_of_terms( ngrams=(1, 2, 3), weighting="count", normalize=None, as_strings=True, filter_stops = False, filter_punct = True, filter_nums = False )

expected vs. actual behavior

Actual behavior: to_bag_of_terms outputs the word "didn't" as: "didn't", "did", "n't". Expected behavior: to_bag_of_terms outputs the word "didn't" as: "didn't"

context

Trying to count n_grams in a block of text. If the block of text contains the word "didn't" and "did", it double counts "did". This also messes up counting n_grams because it counts "n't do" as a bigram instead of "didn't do".

environment

  • operating system: linux
  • python version: 3.7.9
  • spacy version: 2.3.4
  • installed spacy models: en_core_web_sm-2.3.1
  • textacy version: 0.10.1

Thanks :-)

closed time in 3 months

nealonhager

issue commentchartbeat-labs/textacy

to_bag_of_terms unexpected behavior

Hi @nealonhager , I think there are a couple things going on here.

First, ngrams=(1, 2, 3) tells the function to include the combination of unigrams, bigrams, and trigrams in the bag of terms (docs), which by definition will result in overlapping representations of adjacent terms. The terms "did" and "n't" are the unigrams and "didn't" is the bigram that result from the original "didn't".

Second, the word "didn't" (and like contractions) are typically tokenized into "did" and "n't" since they, in a sense, two words mushed together. textacy relies on spacy's tokenization:

>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("didn't")
>>> [tok for tok in doc]
[did, n't]

This is all standard and expected behavior. If it doesn't work for your use case, there is functionality for merging tokens: check out merge_spans() here. Using the simple example above:

>>> textacy.spacier.utils.merge_spans([doc[0:2]], doc)
>>> [tok for tok in doc]
[didn't]
nealonhager

comment created time in 3 months

issue openedchartbeat-labs/textacy

to_bag_of_terms unexpected behavior

steps to reproduce

  1. initialize doc with text containing the word "didn't"
  2.      ngrams=(1, 2, 3), 
         weighting="count", 
         normalize=None,
         as_strings=True,
         filter_stops = False,
         filter_punct = True,
         filter_nums = False
     )```
    
    

expected vs. actual behavior

Actual behavior: to_bag_of_terms outputs the word "didn't" as: "didn't", "did", "n't". Expected behavior: to_bag_of_terms outputs the word "didn't" as: "didn't"

context

Trying to count n_grams in a block of text. If the block of text contains the word "didn't" and "did", it double counts "did". This also messes up counting n_grams because it counts "n't do" as a bigram instead of "didn't do".

environment

  • operating system: linux
  • python version: 3.7.9
  • spacy version: 2.3.4
  • installed spacy models: en_core_web_sm-2.3.1
  • textacy version: 0.10.1

Thanks :-)

created time in 3 months