I am trying to convert a PDF containing a university department's class schedule into JSON format using Python. The table in the PDF is similar to the one shown in the attached image. I have tried using simple text extraction and parsing techniques, but I am encountering two main problems:
Empty slots: The script is not correctly identifying and representing empty slots in the schedule.Multi-line teacher names: If a teacher's name spans multiple lines in the PDF, the script is not capturing the full name correctly.I am wondering if it is possible to achieve this conversion using basic Python libraries or any other language and techniques, or if I need to resort to training an AI model for more sophisticated text understanding.
Single Timetable in pdf looks like this:
I have already tried using libraries like PyPDF2 and pdfplumber for text extraction. I am open to using other libraries or tools if they can help solve the problems I am facing. The ultimate goal is to be able to process the JSON data and use it for further analysis or visualization. How to approach this task effectively?