(NLP Modelling) How to get SpaCy matcher to parse names and education history from resumes to append to an Excel file

I am building an NLP model to parse through resumes for names and education history which I will then append to a larger Excel file with other applicant data to use for broader analysis. I am introducing patterns into the SpaCy matcher to identify the broad range of names for educational institutions around the world. I would like the end result to be a table with rows for each applicant and a column with the educational institution they've attended possibly concatenated as a string or multiple columns instead. When I run my model on a test resume, each of the 15 patterns I've passed through the matcher shows the same result, rather than one or two results corresponding to the pattern that hit. I'm also wondering if my method of approach is the most efficient way of going about this. I am a very beginner level data scientist and this is basically my first complex program outside of my studies. I have never used SpaCy nor NLP techniques before. Learning it with little experience has been challenging but very fun and rewarding.

# Patterns to identify possible educational institutions of applicantsx_university_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "university", "OP": "+"}]# not tested university_of_x_pattern = [{"TEXT": "university"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# tested and workingx_college_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "college", "OP": "+"}]# not testedcollege_of_x_pattern = [{"TEXT": "college"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# should workx_community_college_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "community college", "OP": "+"}]# not testedcommunity_college_of_x_pattern = [{"TEXT": "community college"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# should workcsu_pattern = [{"TEXT": "csu"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# not testedcalifornia_state_university_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "california state university"}]# not testedx_state_university_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "state university"}]# not testedx_state_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "state"}]# not testedstate_university_of_x_pattern = [{"TEXT": "state university"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# not testeduniversity_of_california_pattern = [{"TEXT": "university of california"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# not testeduc_pattern = [{"TEXT": "uc"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# not testedinstitute_of_x_pattern = [{"TEXT": "institute"}, {"SPACY": True}, {"POS": "PROPN", "OP": "+"}]# should workx_institute_pattern = [{"POS": "PROPN", "OP": "+"}, {"SPACY": True}, {"TEXT": "institute", "OP": "+"}]# not testedpattern_names = ["x_university","university_of_x","x_college","college_of_x","x_community_college","community_college_of_x","csu","california_state_university","x_state_university","x_state","state_university_of_x","university_of_california","uc","institute_of_x","x_institute"]patterns = [    x_university_pattern,    university_of_x_pattern,    x_college_pattern,    college_of_x_pattern,    x_community_college_pattern,    community_college_of_x_pattern,    csu_pattern,    california_state_university_pattern,    x_state_university_pattern,    x_state_pattern,    state_university_of_x_pattern,    university_of_california_pattern,    uc_pattern,    institute_of_x_pattern,    x_institute_pattern]def contains_any_substring(text, substrings):"""Check if any of the specified substrings are present in the text. Will be iterated through each applicant folder. Will check if the file has resume or cv in it."""    for substring in substrings:        if substring.lower() in text.lower():            return True    return Falsedef remove_stop_words(string):    stop_words = set(stopwords.words('english'))    words = string.split()    filtered_words = [word for word in words if word.lower() not in stop_words]    new_string = ''.join(filtered_words)    return new_string# Path to resume/cv, which then gets used as a doctest_applicant_folder = "Path/To/Applicants Directory/Doe J"results = {}nlp = spacy.load("en_core_web_sm")matcher = Matcher(nlp.vocab)#for loop which goes through each folder, finds the resume/cv, opens it, joins it to a string, standardizes string, assigns that string as the doc object, sets up matcher with patterns above, feeds doc object, into matcher, stores matches.for file_name in os.listdir(test_applicant_folder):    file_path = os.path.join(test_applicant_folder, file_name)    if file_name.endswith(".pdf") and any(substring in file_name.lower() for substring in ["resume", "cv"]):        with open(file_path, "rb") as file:            pdf_reader = PyPDF2.PdfReader(file)            text = " ".join(page.extract_text() for page in pdf_reader.pages)        text = text.lower()        punc = "''!()-[];:',<>./#$%^&*_~`''"        for ele in text:            if ele in punc:                text = text.replace(ele, "")        remove_stop_words(text)        doc = nlp(text)        file_matches = {}        for i, pattern_name in enumerate(pattern_names):            file_matches[pattern_name] = []        # Apply each pattern to the resume/cv and store the matches        for pattern_name, pattern in zip(pattern_names, patterns):            matcher.add(pattern_name, [pattern])            matches = matcher(doc)            matches.sort(key=lambda x: x[1])            for i, pattern_name in enumerate(pattern_names):                file_matches[pattern_name] = [doc[start:end].text for _, start, end in matches]        for pattern_name in pattern_names:            matcher.remove(pattern_name)        # Store the matches for this file in the results dictionary        results[file_name] = file_matchesprint(len(results))test_table = pd.DataFrame(results).transpose()test_table

	x_university	university_of_x	x_college	college_of_x	x_community_college	community_college_of_x	csu	california_state_university	x_state_university	x_state	state_university_of_x	university_of_california	uc	institute_of_x	x_institute
DOE_JANE_RESUME.pdf	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]	[university of washington]

I tried adding in the for loop that removes the pattern from the matcher every time the matcher is done

for pattern_name in pattern_names:            matcher.remove(pattern_name)

because I thought that having all the patterns into the matcher at one time was causing any match from any pattern to apply to all pattern_names. But this didn't do anything. I tried it as a nested for loop and outside but nothing.

Using all those nested for loops was also something that I tried so that each pattern is getting applied to the resume one by one. Not much there though.

My hope is for the table to look like this:

	x_university	university_of_x	x_college	college_of_x	x_community_college	community_college_of_x	csu	california_state_university	x_state_university	x_state	state_university_of_x	university_of_california	uc	institute_of_x	x_institute
DOE_JANE_RESUME.pdf	[]	[university of washington]	[]	[]	[]	[]	[]	[]	[]	[]	[]	[]	[]	[]	[]
JAMES_MICHAEL_CV.pdf	[]	[]	[dartmouth college]	[]	[]	[]	[]	[]	[]	[]	[]	[]	[]

(NLP Modelling) How to get SpaCy matcher to parse names and education history from resumes to append to an Excel file

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112