I am parsing cobol code, example is below. tried with lark, but failed because of grammer error. so using regex with elementTree, is giving incorrect output. I want to make it as generic as possible. I am going to use the xml parsed output as input to Starcoder base model,to generate Java code from same logic.:
IDENTIFICATION DIVISION.PROGRAM-ID. PAYROLL-PROCESSING.DATA DIVISION. WORKING-STORAGE SECTION. 01 EMPLOYEE-RECORD. 02 EMPLOYEE-ID PIC 9(5). 02 EMPLOYEE-NAME PIC X(30). 02 HOURS-WORKED PIC 9(3). 02 HOURLY-RATE PIC 9(5)V99. 02 GROSS-SALARY PIC 9(7)V99. 02 TAX-RATE PIC 9(3). 02 NET-SALARY PIC 9(7)V99. 02 BASIC-SALARY PIC 9(7)V99. 02 HOLIDAYS PIC 9(5). 02 HRA PIC 9(5)V99. 02 MEDICAL-ALLOWANCE PIC 9(5)V99. 02 TRANSPORT-ALLOWANCE PIC 9(5) VALUE "1500". 02 LTA PIC 9(5)V99. 02 FIXED-BONUS PIC 9(5) VALUE "15000". 02 PERFORMANCE-BONUS PIC 9(5)V99. 02 PROVIDENT-FUND PIC 9(5)V99. 02 PROF-TAX PIC 9(5) VALUE "200". 02 LWF-CONTRI PIC 9(5) VALUE "0". 02 INCOME-TAX PIC 9(5)V99. 02 TRUST-CONTRI PIC 9(5) VALUE "500".PROCEDURE DIVISION. DISPLAY-HEADER. ACCEPT-EMPLOYEE-DATA. CALCULATE-BASIC-SALARY. CALCULATE-HRA. CALCULATE-MEDICAL-ALLOWANCE. CALCULATE-LTA. CALCULATE-PERFORMANCE-BONUS. CALCULATE-GROSS-SALARY. CALCULATE-PROVIDENT-FUND. CALCULATE-INCOME-TAX. DISPLAY-SALARY. STOP-RUN.DISPLAY-HEADER. DISPLAY "PAYROLL PROCESSING SYSTEM". DISPLAY "-------------------------".ACCEPT-EMPLOYEE-DATA. DISPLAY "ENTER EMPLOYEE ID: ". ACCEPT EMPLOYEE-ID. DISPLAY "ENTER EMPLOYEE NAME: ". ACCEPT EMPLOYEE-NAME. DISPLAY "ENTER HOURS WORKED: ". ACCEPT HOURS-WORKED. DISPLAY "ENTER HOURLY RATE: ". ACCEPT HOURLY-RATE. DISPLAY "ENTER HOLIDAYS TAKEN: ". ACCEPT HOLIDAYS.CALCULATE-BASIC-SALARY. COMPUTE BASIC-SALARY = HOURS-WORKED*HOURLY-RATE.CALCULATE-HRA. COMPUTE HRA = (BASIC-SALARY *10 /100).CALCULATE-MEDICAL-ALLOWANCE. COMPUTE MEDICAL-ALLOWANCE = (BASIC-SALARY *5 /100).CALCULATE-LTA. COMPUTE LTA = (BASIC-SALARY *12 /100).CALCULATE-PERFORMANCE-BONUS. IF HOURLY-RATE < 31 COMPUTE PERFORMANCE-BONUS = 50000 ELSE COMPUTE PERFORMANCE-BONUS = 25000 END-IF.CALCULATE-GROSS-SALARY. COMPUTE GROSS-SALARY = BASIC-SALARY + HRA + MEDICAL-ALLOWANCE + LTA + PERFORMANCE-BONUS.CALCULATE-PROVIDENT-FUND. COMPUTE PROVIDENT-FUND = (BASIC-SALARY *12 /100).CALCULATE-INCOME-TAX. IF GROSS-SALARY > 0 AND GROSS-SALARY < 50000 COMPUTE INCOME-TAX = GROSS-SALARY - (GROSS-SALARY * 10/100).DISPLAY-SALARY. DISPLAY "EMPLOYEE ID: " EMPLOYEE-ID. DISPLAY "EMPLOYEE NAME: " EMPLOYEE-NAME. DISPLAY "BASIC SALARY: " BASIC-SALARY. DISPLAY "HRA :" HRA. DISPLAY "MEDICAL ALLOWANCE :" MEDICAL-ALLOWANCE. DISPLAY "TRANSPORT ALLOWANCE :" TRANSPORT-ALLOWANCE. DISPLAY "LTA :" LTA. DISPLAY "FIXED BONUS :" FIXED-BONUS. DISPLAY "PERFORMANCE BONUS :" PERFORMANCE-BONUS. DISPLAY "PROVIDENT FUND :" PROVIDENT-FUND. DISPLAY "PROFESSIONAL TAX :" PROF-TAX. DISPLAY "LWF CONTRIBUTION :" LWF-CONTRI. DISPLAY "INCOME TAX :" INCOME-TAX. DISPLAY "WELFARE TRUST CONTRI :" TRUST-CONTRI.
I am using following python code to parse it:
def parse_cobol_code(cobol_code): root = ET.Element("COBOL_Code") current_division = None current_section = None current_paragraph = None for line in cobol_code.split('\n'): line = line.strip() division_match = re.match(r'^\s+\d{4}-\d{4}\s{2,}([A-Z ]+\.?)$', line) section_match = re.match(r'^\s{7}([A-Z ]+\.?)$', line) if division_match: current_division = ET.SubElement(root, "Division", name=division_match.group(1)) current_section = None current_paragraph = None elif section_match and current_division is not None: current_section = ET.SubElement(current_division, "Section", name=section_match.group(1)) current_paragraph = None elif current_section is not None and line: current_paragraph = ET.SubElement(current_section, "Paragraph") current_paragraph.text = line xml_content = ET.tostring(root, encoding='unicode') return xml_content
But always getting xml output as: <COBOL_Code />
nothing else is updating, new to python, please help.