Quantcast
Viewing all articles
Browse latest Browse all 14069

Bag of Words with Negative Words in Python

I have this document

It is not normal text

It is a text of Scientific terminologies

The text of these documents are like this

RepID,Txt1,K9G3P9 4H477 -Q207KL41 98464 ... Q207KL412,D84T8X4 -D9W4S2 -D9W4S2 8E8E65 ... D9W4S2 3,-05L8NJ38 K2DD949 0W28DZ48 207441 ... K2D28K84

I can build a feature set using BOW algorithm

Here is my code

def BOW(df):  CountVec = CountVectorizer() # to use only  bigrams ngram_range=(2,2)  Count_data = CountVec.fit_transform(df)  Count_data = Count_data.astype(np.uint8)  cv_dataframe=pd.DataFrame(Count_data.toarray(), columns=CountVec.get_feature_names_out(), index=df.index)  # <- HERE  return cv_dataframe.astype(np.uint8)df_reps = pd.read_csv("c:\\file.csv")df = BOW(df_reps["Txt"])

The result will be the count of words in the "Txt" column.

RepID K9G3P9  4H477 -Q207KL41 98464 ... Q207KL411     2       8     3         2     ... 12     0       1     2         4     ... 2

The trick and here where I need the help, is that some of these terms have a - ahead of it, and that should count as negative value

So if the a text have these values Q207KL41 -Q207KL41 -Q207KL41

in that case the terms that starts with - should be count as negative and therefore, the BOW for the Q207KL41 is -1

instead of having a feature for Q207KL41 and -Q207KL41they both count towards the same term Q207KL41 but with positive and -negative

so the dataset after BOW will look like this

RepID K9G3P9  4H477 Q207KL41 98464 ... 1     2       8     -2         2     ...2     0       1     0         4     ...

How to do that?


Viewing all articles
Browse latest Browse all 14069

Trending Articles