OS-agnostic, system-level binary package manager and ecosystem
issue openedchartbeat-labs/textacy
cikit-learn 0.24.1 was released a while ago.
https://github.com/chartbeat-labs/textacy/blob/cdedd2351bf2a56e8773ec162e08c3188809d486/setup.cfg#L51
The Nix packages is already at 0.24.1 and this breaks the build.
Do you have any plans to allow scikit-learn > 0.24?
Thanks
created time in 6 hours
issue openedchartbeat-labs/textacy
what are the `extract.matches` patterns analogous to `constants.POS_REGEX_PATTERNS` ?
what's wrong?
<!-- Describe the problem or confusion here. If you've found a minor mistake and know how to fix it, feel free to skip this issue and submit a pull request instead: https://github.com/chartbeat-labs/textacy/pulls -->
textacy.constants
has POS_REGEX_PATTERNS
:
textacy.constants.POS_REGEX_PATTERNS['en']['NP']
'<DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN>|<PROPN> <PART>?)+'
What is the equivalent pattern with the newer textacy.extract.matches
pattern format?
Should it be in the constants file?
relevant page or section
https://github.com/chartbeat-labs/textacy/blob/cdedd2351bf2a56e8773ec162e08c3188809d486/src/textacy/constants.py#L86
created time in 6 days
issue openedchartbeat-labs/textacy
subject_verb_object_triples works on sample text, but not a more complicated example
I have tried a sample text that I have seen in this repo to extract SVO triples:
text_str = u'Startup companies create jobs and support innovation. Hilary supports entrepreneurship.'
text = nlp(text_str)
text_ext = textacy.extract.subject_verb_object_triples(text)
list(text_ext)
This runs just fine for me and produces the expected result:
[(companies, create, jobs), (Hilary, supports, entrepreneurship)]
However, I then try to do something a bit more complicated:
text_ada = (u'Ada Lovelace was an English mathematician and'
' writer, chiefly known for her work on'
' mechanical general-purpose computer, the'
' Analytical Engine. She was the first to'
' recognise that the machine had applications'
' beyond pure calculation, and published the'
' first algorithm intended to be carried out'
' by such a machine. As a result, she is'
' sometimes regarded as the first to recognise'
' the full potential of a computing machine and'
' one of the first computer programmers.')
So I run through the same procedure:
ada = nlp(text_ada)
text_ext = textacy.extract.subject_verb_object_triples(ada)
list(text_ext)
but I get an empty list.
Any thoughts?
created time in 13 days
PR opened chartbeat-labs/textacy
Just nitpicking: as a native speaker, I think you wanted to write the second sentence, but I'm not completely sure.
- (Old) Sí, ¡se habla español! / Yes, Spanish is spoken!
- (New) Sí, ¡sé hablar español! / Yes, I can speak Spanish!
By the way: awesome project!
pr created time in 20 days
issue openedchartbeat-labs/textacy
Hi textacy team,
First, thanks for the amazing library.
Any plans to support the new spaCy 3.0 release?
Thanks
created time in a month
issue commentchartbeat-labs/textacy
IndexError: [E201] Span index out of range
Hi @bdewilde, The issue is closed, but I am not sure if it resolved the issue or not. Using "text_with_ws" extracts only the acronym itself, without its full wording. Am I missing something? Can you please let me know if there is a workaround to get it right? Thanks in advance!
comment created time in a month
issue commentchartbeat-labs/textacy
IndexError: [E201] Span index out of range
Hi all,
As @mzeidhassan mentions, this could indeed be related to a bug in spaCy where calling span.text
on an empty Span
would result in an IndexError
.
There's a few ways of dealing with this:
- Ensure you never define empty spans if you can
- Use
self.text_with_ws
as a workaround instead
comment created time in a month
startedmpenkov/balbes
started time in a month
issue commentchartbeat-labs/textacy
IndexError: [E201] Span index out of range
I also would like to confirm that everything was working just fine with spaCy 2.3.2
comment created time in 2 months
Pull request review commentchartbeat-labs/textacy
Support prepositional objects in SVO IE
def test_get_subjects_of_verb(spacy_doc): def test_get_objects_of_verb(spacy_doc):- expected = [[], ["Python"], ["incompatibilities"], [], ["God"]]+ expected = [[], ["Python"], ["incompatibilities"], ["sake"], ["God", "Overflow"]]
I think I will also have to do some research as to whether this is a better outcome, stay tuned.
comment created time in 2 months
issue openedchartbeat-labs/textacy
IndexError: [E201] Span index out of range
Hi @bdewilde,
I updated spaCy to the latest release '2.3.5', but the acronyms' extraction doesn't seem to be working any longer.
Here is the error I am getting:
IndexError: [E201] Span index out of range.
Traceback:
File "py37/env/lib/python3.7/site-packages/streamlit/script_runner.py", line 332, in _run_script
exec(code, module.__dict__)
File "py37/env/deep_analysis.py", line 314, in <module>
acronyms = textacy.extract.acronyms_and_definitions(doc)
File "py37/env/lib/python3.7/site-packages/textacy/extract.py", line 599, in acronyms_and_definitions
window_ = window.text
File "span.pyx", line 503, in spacy.tokens.span.Span.text.__get__
File "span.pyx", line 190, in spacy.tokens.span.Span.__getitem__
Environment
- operating system: Ubuntu 16.04
- python version: 3.7.9
spacy
version: 2.3.5- installed
spacy
models: en_core_web_sm textacy
version: 0.10.1
Any idea?
Thanks in advance!
created time in 2 months
Pull request review commentchartbeat-labs/textacy
Support prepositional objects in SVO IE
def test_get_subjects_of_verb(spacy_doc): def test_get_objects_of_verb(spacy_doc):- expected = [[], ["Python"], ["incompatibilities"], [], ["God"]]+ expected = [[], ["Python"], ["incompatibilities"], ["sake"], ["God", "Overflow"]]
So, this is what I was wondering about — do these changes make sense, in practice? My grammar is a bit rusty, but I believe "for textacy's sake" is an adverb clause and not an object of the verb "were". And "(Stack) Overflow" is separate from "God" wrt "Thank". I'm not totally sure if this is a bug or feature, expected or unexpected. Could you help me understand this better? 🙏
comment created time in 2 months
pull request commentchartbeat-labs/textacy
Support prepositional objects in SVO IE
Hi @8W9aG, thanks for the PR and your patience. What you're doing in the code makes sense, but I think it would be helpful to see tangible, expected outputs by way of a couple unit tests. Could you add a few representative examples to a test, for reference?
comment created time in 2 months
push eventchartbeat-labs/textacy
commit sha eb55b03aa41f64a1ebca34963816a007969de7eb
Drop PY3.6 from CI builds there's some irreconcilable problem going on where a version of numpy i've specifically excluded keeps getting installed, and that fails on PY3.6. so, for the sake of knowing how things are going on 3.7 and 3.8, we're just going to push forward here
commit sha cdedd2351bf2a56e8773ec162e08c3188809d486
Merge branch 'master' of github.com:chartbeat-labs/textacy
push time in 2 months
push eventchartbeat-labs/textacy
commit sha f108883f7ef24c8fd14f15b229570138e096a8e5
docs: fix simple typo, inferrred -> inferred There is a small typo in src/textacy/io/utils.py. Should read `inferred` rather than `inferrred`.
commit sha 116c807ea83e6a9b9b737741d58aa11effcaff85
Merge pull request #313 from timgates42/bugfix_typo_inferred docs: fix simple typo, inferrred -> inferred
push time in 2 months
PR merged chartbeat-labs/textacy
There is a small typo in src/textacy/io/utils.py.
Should read inferred
rather than inferrred
.
Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
pr closed time in 2 months
pull request commentchartbeat-labs/textacy
docs: fix simple typo, inferrred -> inferred
Do I really only one spelling mistake over all those docs?! 😅 Thanks for the fix.
comment created time in 2 months
startedjbarlow83/OCRmyPDF
started time in 2 months
PR opened chartbeat-labs/textacy
There is a small typo in src/textacy/io/utils.py.
Should read inferred
rather than inferrred
.
Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
pr created time in 2 months
startedsupertokens/supertokens-core
started time in 2 months
PR opened chartbeat-labs/textacy
Support the handling of prepositions when performing Subject-Verb-Object Information Extraction.
Description
When running Subject-Verb-Object analysis the current program cannot handle looking past a preposition before a noun, and therefore cannot extract the Object correctly when there is a preposition in front of it.
Motivation and Context
I was running this on a test string "Barack Obama was born in Hawaii." and failed to get a Subject-Verb-Object triple.
How Has This Been Tested?
I run a modified version of textacy 0.10.1 that isolates the "subject_verb_object" function and has support for running on spacy 2.1.0. I tested this in this modified version and the input "Barack Obama was born in Hawaii." produces the triple ("Barack Obama", "born in", "Hawaii").
Screenshots (if appropriate):
N/A
Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
- [x] My code follows the code style of this project.
- [ ] My change requires a change to the documentation, and I have updated it accordingly.
pr created time in 2 months
push eventchartbeat-labs/textacy
commit sha 7f3bd999a79d85e967bcba66aa57816ed1172089
Play hardball with numpy and pip well okay then
push time in 3 months
push eventchartbeat-labs/textacy
commit sha a675cc5608885a0d4e4188a2f0bb553f5c472eec
Try a different numpy pin range since github actions is unexpectedly still trying to install 1.20.0rc1
push time in 3 months
push eventchartbeat-labs/textacy
commit sha a28c8c3f44772f34b9481bb5ddd2ba8b385be49a
Set upper bound on numpy dep numpy 1.20 drops PY3.6 support, which we have not yet done
push time in 3 months
issue closedchartbeat-labs/textacy
to_bag_of_terms unexpected behavior
steps to reproduce
- initialize doc with text containing the word "didn't"
doc._.to_bag_of_terms( ngrams=(1, 2, 3), weighting="count", normalize=None, as_strings=True, filter_stops = False, filter_punct = True, filter_nums = False )
expected vs. actual behavior
Actual behavior: to_bag_of_terms outputs the word "didn't" as: "didn't", "did", "n't". Expected behavior: to_bag_of_terms outputs the word "didn't" as: "didn't"
context
Trying to count n_grams in a block of text. If the block of text contains the word "didn't" and "did", it double counts "did". This also messes up counting n_grams because it counts "n't do" as a bigram instead of "didn't do".
environment
- operating system: linux
- python version: 3.7.9
spacy
version: 2.3.4- installed
spacy
models: en_core_web_sm-2.3.1 textacy
version: 0.10.1
Thanks :-)
closed time in 3 months
nealonhagerissue commentchartbeat-labs/textacy
to_bag_of_terms unexpected behavior
Hi @nealonhager , I think there are a couple things going on here.
First, ngrams=(1, 2, 3)
tells the function to include the combination of unigrams, bigrams, and trigrams in the bag of terms (docs), which by definition will result in overlapping representations of adjacent terms. The terms "did" and "n't" are the unigrams and "didn't" is the bigram that result from the original "didn't".
Second, the word "didn't" (and like contractions) are typically tokenized into "did" and "n't" since they, in a sense, two words mushed together. textacy
relies on spacy
's tokenization:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("didn't")
>>> [tok for tok in doc]
[did, n't]
This is all standard and expected behavior. If it doesn't work for your use case, there is functionality for merging tokens: check out merge_spans()
here. Using the simple example above:
>>> textacy.spacier.utils.merge_spans([doc[0:2]], doc)
>>> [tok for tok in doc]
[didn't]
comment created time in 3 months
issue openedchartbeat-labs/textacy
to_bag_of_terms unexpected behavior
steps to reproduce
- initialize doc with text containing the word "didn't"
-
ngrams=(1, 2, 3), weighting="count", normalize=None, as_strings=True, filter_stops = False, filter_punct = True, filter_nums = False )```
expected vs. actual behavior
Actual behavior: to_bag_of_terms outputs the word "didn't" as: "didn't", "did", "n't". Expected behavior: to_bag_of_terms outputs the word "didn't" as: "didn't"
context
Trying to count n_grams in a block of text. If the block of text contains the word "didn't" and "did", it double counts "did". This also messes up counting n_grams because it counts "n't do" as a bigram instead of "didn't do".
environment
- operating system: linux
- python version: 3.7.9
spacy
version: 2.3.4- installed
spacy
models: en_core_web_sm-2.3.1 textacy
version: 0.10.1
Thanks :-)
created time in 3 months