I am building an NLP model to parse through resumes for names and education history which I will then append to a larger Excel file with other applicant data to use for broader analysis. I am introducing patterns into the SpaCy matcher to identify the broad range of names for educational institutions around the world. I would like the end result to be a table with rows for each applicant and a column with the educational institution they've attended possibly concatenated as a string or multiple columns instead. When I run my model on a test resume, each of the 15 patterns I've passed through the matcher shows the same result, rather than one or two results corresponding to the pattern that hit. I'm also wondering if my method of approach is the most efficient way of going about this. I am a very beginner level data scientist and this is basically my first complex program outside of my studies. I have never used SpaCy nor NLP techniques before. Learning it with little experience has been challenging but very fun and rewarding.
# Patterns to identify possible educational institutions of applicantsx_university_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "university", "OP": "+"}]# not tested university_of_x_pattern = [{"TEXT": "university"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# tested and workingx_college_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "college", "OP": "+"}]# not testedcollege_of_x_pattern = [{"TEXT": "college"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# should workx_community_college_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "community college", "OP": "+"}]# not testedcommunity_college_of_x_pattern = [{"TEXT": "community college"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# should workcsu_pattern = [{"TEXT": "csu"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# not testedcalifornia_state_university_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "california state university"}]# not testedx_state_university_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "state university"}]# not testedx_state_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "state"}]# not testedstate_university_of_x_pattern = [{"TEXT": "state university"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# not testeduniversity_of_california_pattern = [{"TEXT": "university of california"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# not testeduc_pattern = [{"TEXT": "uc"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# not testedinstitute_of_x_pattern = [{"TEXT": "institute"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# should workx_institute_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "institute", "OP": "+"}]# not testedpattern_names = ["x_university","university_of_x","x_college","college_of_x","x_community_college","community_college_of_x","csu","california_state_university","x_state_university","x_state","state_university_of_x","university_of_california","uc","institute_of_x","x_institute"]patterns = [ x_university_pattern, university_of_x_pattern, x_college_pattern, college_of_x_pattern, x_community_college_pattern, community_college_of_x_pattern, csu_pattern, california_state_university_pattern, x_state_university_pattern, x_state_pattern, state_university_of_x_pattern, university_of_california_pattern, uc_pattern, institute_of_x_pattern, x_institute_pattern]def contains_any_substring(text, substrings):"""Check if any of the specified substrings are present in the text. Will be iterated through each applicant folder. Will check if the file has resume or cv in it.""" for substring in substrings: if substring.lower() in text.lower(): return True return Falsedef remove_stop_words(string): stop_words = set(stopwords.words('english')) words = string.split() filtered_words = [word for word in words if word.lower() not in stop_words] new_string = ''.join(filtered_words) return new_string# Path to resume/cv, which then gets used as a doctest_applicant_folder = "Path/To/Applicants Directory/Doe J"results = {}nlp = spacy.load("en_core_web_sm")matcher = Matcher(nlp.vocab)#for loop which goes through each folder, finds the resume/cv, opens it, joins it to a string, standardizes string, assigns that string as the doc object, sets up matcher with patterns above, feeds doc object, into matcher, stores matches.for file_name in os.listdir(test_applicant_folder): file_path = os.path.join(test_applicant_folder, file_name) if file_name.endswith(".pdf") and any(substring in file_name.lower() for substring in ["resume", "cv"]): with open(file_path, "rb") as file: pdf_reader = PyPDF2.PdfReader(file) text = " ".join(page.extract_text() for page in pdf_reader.pages) text = text.lower() punc = "''!()-[];:',<>./#$%^&*_~`''" for ele in text: if ele in punc: text = text.replace(ele, "") remove_stop_words(text) doc = nlp(text) file_matches = {} for i, pattern_name in enumerate(pattern_names): file_matches[pattern_name] = [] # Apply each pattern to the resume/cv and store the matches for pattern_name, pattern in zip(pattern_names, patterns): matcher.add(pattern_name, [pattern]) matches = matcher(doc) matches.sort(key=lambda x: x[1]) for i, pattern_name in enumerate(pattern_names): file_matches[pattern_name] = [doc[start:end].text for _, start, end in matches] for pattern_name in pattern_names: matcher.remove(pattern_name) # Store the matches for this file in the results dictionary results[file_name] = file_matchesprint(len(results))test_table = pd.DataFrame(results).transpose()test_table
x_university | university_of_x | x_college | college_of_x | x_community_college | community_college_of_x | csu | california_state_university | x_state_university | x_state | state_university_of_x | university_of_california | uc | institute_of_x | x_institute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DOE_JANE_RESUME.pdf | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] | [university of washington] |
I tried adding in the for loop that removes the pattern from the matcher every time the matcher is done
for pattern_name in pattern_names: matcher.remove(pattern_name)
because I thought that having all the patterns into the matcher at one time was causing any match from any pattern to apply to all pattern_names. But this didn't do anything. I tried it as a nested for loop and outside but nothing.
Using all those nested for loops was also something that I tried so that each pattern is getting applied to the resume one by one. Not much there though.
My hope is for the table to look like this:
x_university | university_of_x | x_college | college_of_x | x_community_college | community_college_of_x | csu | california_state_university | x_state_university | x_state | state_university_of_x | university_of_california | uc | institute_of_x | x_institute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DOE_JANE_RESUME.pdf | [] | [university of washington] | [] | [] | [] | [] | [] | [] | [] | [] | [] | [] | [] | [] | [] |
JAMES_MICHAEL_CV.pdf | [] | [] | [dartmouth college] | [] | [] | [] | [] | [] | [] | [] | [] | [] | [] |