I need to parsing a C# code. Just separate the statements, considering break lines. Need to ignore comments, multiline comments, verbatim strings and multiline verbating strings.
What i try...I read the file into a variable and then split by break lines (because i need the original line number)... then i add the line number with a pattern, then i break the string by characters ;, {, }and remove the patterns not needed (keep the first one)...
with open("./program.cs", "r") as f: prg=[] for number, line in enumerate(f): prg.append(f"<#<{number}>#>{line}") dotnet_lines=re.split(r'[;\{\}]',"".join(prg)) for i in range(len(dotnet_lines)): dotnet_lines[i] = dotnet_lines[i].replace("\n","") dotnet_lines[i] = re.sub(r'(.)(\<#\<[0-9]+\>#\>)',r'\1',dotnet_lines[i]) # Result.... for ln in dotnet_lines: ocorrencia=ln.find('>#>')+3 line=ln[ocorrencia:] number=re.sub('[<#>]','',ln[:ocorrencia]) print(f"Ln Nr: {number} {line}")It's a basic solution, but it doesn't solve the issue of comments or strings.
Using pygments is ok too... but i want to separate sentence blocks only ...
from pygments.lexers.dotnet import CSharpLexerfrom pygments.token import Tokendef tokenize_dotnet_file(file_path): with open(file_path, 'r') as file: code = file.read() lexer = CSharpLexer() tokens = lexer.get_tokens(code) for token in tokens: token_type = token[0] token_value = token[1] print(f"Type: {token_type}, Value: {token_value}")if __name__ == "__main__": file_path = "./program.cs" tokenize_dotnet_file(file_path)this is better but i need the sentences and not the tokens.