I have this document
It is not normal text
It is a text of Scientific terminologies
The text of these documents are like this
RepID,Txt1,K9G3P9 4H477 -Q207KL41 98464 ... Q207KL412,D84T8X4 -D9W4S2 -D9W4S2 8E8E65 ... D9W4S2 3,-05L8NJ38 K2DD949 0W28DZ48 207441 ... K2D28K84
I can build a feature set using BOW algorithm
Here is my code
def BOW(df): CountVec = CountVectorizer() # to use only bigrams ngram_range=(2,2) Count_data = CountVec.fit_transform(df) Count_data = Count_data.astype(np.uint8) cv_dataframe=pd.DataFrame(Count_data.toarray(), columns=CountVec.get_feature_names_out(), index=df.index) # <- HERE return cv_dataframe.astype(np.uint8)df_reps = pd.read_csv("c:\\file.csv")df = BOW(df_reps["Txt"])
The result will be the count of words in the "Txt" column.
RepID K9G3P9 4H477 -Q207KL41 98464 ... Q207KL411 2 8 3 2 ... 12 0 1 2 4 ... 2
The trick and here where I need the help, is that some of these terms have a - ahead of it, and that should count as negative value
So if the a text have these values Q207KL41 -Q207KL41 -Q207KL41
in that case the terms that starts with - should be count as negative and therefore, the BOW for the Q207KL41
is -1
instead of having a feature for Q207KL41
and -Q207KL41
they both count towards the same term Q207KL41
but with positive and -negative
so the dataset after BOW will look like this
RepID K9G3P9 4H477 Q207KL41 98464 ... 1 2 8 -2 2 ...2 0 1 0 4 ...
How to do that?