I am running a SpaCy Matcher line-by-line on a text file. My file has each text entry on a separate line. I am trying to extract 1) the matched instance, 2) the full sentence, and 3) the previous sentence. I am able to get the first two, but I am having trouble getting the previous sentence, given that there isn't a sentence index (from this post). Here is my code:
with open('file.txt', 'r') as f: for line in iter(f.readline, ''): doc = nlp(line) matcher = Matcher(nlp.vocab) matcher.add("pattern_of_interest", [pattern]) matches = matcher(doc) for match_id, start, end in matches: string_id = nlp.vocab.strings[match_id] span = doc[start:end] for sent in doc.sents: if matcher(sent): instances.append(pd.Series({"instance":str(span.text), "sentence":str(sent.text),"previous_sentence":str(sent[-1].text)}))
I understand that the bolded part is giving me the previous token, not sentence (I tried to get around this with the list, but it doesn't work). Any advice for retrieving the previous sentence would be greatly appreciated. Thank you!