Python - Extract certain values from PDFs in a folder

I am using the below code to extract text from hundreds of PDF files in a specific folder:

from pypdf import PdfReaderimport osimport globpath = input("Enter the file path: ")pattern = path +"\\*.pdf"result = glob.glob(pattern)for file_name in result:   reader = PdfReader(file_name)   page = reader.pages[0]   text = page.extract_text()   print(text)

It gives me the following output:

Payslip March 2024   randomtext 123456Name of the employee

and there are numerous lines afterwards, which are irrelevant.

What I want is, to be able to extract the employees' IDs from each document, and then use that to rename each file in the folder, on a monthly basis. The files always look exactly the same, but the name of the month and the year are changing. The employee ID can be 4-6 digits, but always in the same spot.

Is there another PDF reader library, you would use? Is it possible, to extract the digits that are always in the same spot, depending on the length, which I cannot know in advance?

The rest of the code I have, to be able to use the extracted ID number as part of the file name, but I am not sure how to take the IDs from that exact spot each month. It is a string, so I was thinking of slicing, but since the month is always changing and I cannot be sure when it will be 4, 5 or 6 digits, I gave up on that idea. The randomtext is always 9 characters, including the leading space.

Thank you in advance!

I tried slicing it but it does not work in this setting. I put the code together based on a previous code that I am using to rename PDF files in a folder, but in those cases, the ID number is always present in the filename, so it is way easier.

I finished my code (I am sorry it is not nice looking, I only do this as a hobby and to try and cut out some manual processes at work):

from pypdf import PdfReaderimport osimport globimport remonth = input("Which month's reports? Type the full name of the month: ""(In the following format: MM_YYYY) ")path = input("Enter the file path: ")pattern = path +"\\*.pdf"result = glob.glob(pattern)try:   for file_name in result:      reader = PdfReader(file_name)      page = reader.pages[0]      text = page.extract_text()      mylist = text.split()      myIds = []      myIds.append(mylist[4])      mystring = ''.join(myIds)      temp = re.findall(r'\d+', mystring)      res = list(map(int, temp))      old_name = file_name      number = str(res)      number = number.replace('[', '')      number = number.replace(']', '')      new_name = path +'\\'+ number +".Company name Payslip" +"_" + \      month + old_name[-4:]      os.rename(old_name, new_name)except FileExistsError:   print("There is a duplicate payslip for EE ID " + number +". Delete one of their\ payslips.")

Python - Extract certain values from PDFs in a folder

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...