I want to parse through a SQL Select Statement that has all the features a normal SQL dialect like MySQL has too. I looked for parsing libraries in python but couldn't find one that is doing the job. By that I mean I found some parsing libraries, but they were only able to parse through basic select statements (FROM and WHERE, not even ORDER BY). So as alternative I made my own parser (which I know is not a great solution at all). I spent a few hours working on it, but I keep getting an weird error and don't know how to approach it.Before I show the code I just want to mention that if you know a python library that is able to parse through SQL statements, not just SELECT but also CREATE TABLE, INSERT, etc., let me know.
Here is my language grammar string:
select_grammar = """ start: select_statement ";" select_statement: "SELECT" column_list "FROM" table_list join_list? where_clause? groupby_clause? having_clause? orderby_clause? column_list: "*" | column_expr ("," column_expr)* column_expr: function_call | column_name | subquery column_name: (table_name ".")? NAME ("AS" NAME)? table_name: NAME ("AS" NAME)? function_call: NAME "(" function_args ")" ("AS" NAME)? function_args: expression ("," expression)* where_clause: "WHERE" condition groupby_clause: "GROUP BY" column_expr ("," column_expr)* having_clause: "HAVING" logical_expr orderby_clause: "ORDER BY" order_column ("," order_column)* order_column: column_expr ["ASC" | "DESC"]? condition: logical_expr logical_expr: logical_term | logical_expr "AND" logical_term | logical_expr "OR" logical_term | "NOT" logical_term logical_term: comparison_expr | "(" logical_expr ")" | subquery comparison_expr: expression OPERATOR expression | expression "IS" ("NULL" | "NOT NULL") expression: (table_name ".")? NAME | INT | string | function_call | subquery table_list: table_name ("," table_name)* | subquery subquery: "(" select_statement ")" join_list: join_expr+ join_expr: join_type (table_name | subquery) "ON" condition join_type: "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN" string: ESCAPED_STRING | /'[^']*'/ OPERATOR: ">" | "<" | ">=" | "<=" | "=" | "!=" %import common.CNAME -> NAME %import common.INT %import common.ESCAPED_STRING %import common.WS %ignore WS"""
I also created the Transformer class, which looks like this:
@v_args(inline=True)class SelectTransformer(Transformer): def start(self, *args): print("start result: ", args) return Tree("SELECT statement", args) def column_list(self, *args): return args def column_expr(self, *args): return args[0] if len(args) == 1 else args def function_call(self, name, args, alias=None): return (name, args, alias) def subquery(self, value): print("Subquery:", value) def where_clause(self, condition=None): return condition def groupby_clause(self, *args): return args def having_clause(self, condition=None): return condition def orderby_clause(self, *args): return args def order_column(self, *args): return args def condition(self, *args): return args def logical_expr(self, *args): return args def logical_term(self, *args): return args def comparison_expr(self, *args): return args def expression(self, *args): return args[0] if len(args) == 1 else args def column_name(self, *args): if len(args) == 1: return args[0] # No alias present elif len(args) == 3: return args[0], args[2] # Alias present, return a tuple else: return args def table_list(self, *args): return args def join_list(self, *args): return args def join_expr(self, *args): return args def join_type(self, *args): return args def subquery(self, *args): return args def string(self, value): return value.strip("'") def table_name(self, *args): if len(args) == 1: return args[0] # No alias present elif len(args) == 3: return args[0], args[2] # Alias present, return a tuple else: return args
I don't know if it matters, I also created a little function that displays the final tree nicely:
def format_ast(ast, level=0): result = "" indent = " " * level if isinstance(ast, tuple): for item in ast: result += format_ast(item, level + 1) elif isinstance(ast, Token): result += f"{indent}{ast.type}, Token('{ast.value}')\n" elif isinstance(ast, Tree): result += f"{indent}Tree({ast.data}), [\n" for child in ast.children: result += format_ast(child, level + 1) result += f"{indent}]\n" else: result += f"{indent}{ast}\n" return result
Here's the statement I'm parsing:
sql_query = 'SELECT ' \'name AS alias, ' \'COUNT(age) AS age_alias, ' \'(SELECT department_name FROM departments WHERE department_id = employees.department_id) ' \'FROM employees AS emp, department ' \'INNER JOIN departments AS dep ON employees.department_id = departments.id ' \'LEFT JOIN other_table AS ot ON other_table.id = employees.table_id ' \'WHERE age > 25 ' \'GROUP BY age, name ' \'HAVING COUNT(age) > 1 ' \'ORDER BY name ASC, age DESC;'
The code I'm executing is this:
parser = Lark(select_with_joins_grammar, parser='lalr', transformer=SelectTransformer())tree = parser.parse(sql_query)# Print the custom export formatprint(format_ast(tree))
The problem is related to the method join_type() of my class SelectTransformer. Somehow *args is always empty, although it should theoretically contain (like defined in the rule) "INNER JOIN" or "LEFT JOIN" or "RIGHT JOIN" or "FULL JOIN".My output looks like this:
Tree(SELECT statement), [ Tree(select_statement), [ NAME, Token('name') NAME, Token('alias') NAME, Token('COUNT') Tree(function_args), [ NAME, Token('age') ] NAME, Token('age_alias') Tree(select_statement), [ NAME, Token('department_name') NAME, Token('departments') NAME, Token('department_id') OPERATOR, Token('=') NAME, Token('employees') NAME, Token('department_id') ] NAME, Token('employees') NAME, Token('emp') NAME, Token('department') NAME, Token('departments') NAME, Token('dep') NAME, Token('employees') NAME, Token('department_id') OPERATOR, Token('=') NAME, Token('departments') NAME, Token('id') NAME, Token('other_table') NAME, Token('ot') NAME, Token('other_table') NAME, Token('id') OPERATOR, Token('=') NAME, Token('employees') NAME, Token('table_id') NAME, Token('age') OPERATOR, Token('>') INT, Token('25') NAME, Token('age') NAME, Token('name') NAME, Token('COUNT') Tree(function_args), [ NAME, Token('age') ] None OPERATOR, Token('>') INT, Token('1') NAME, Token('name') NAME, Token('age') ]]
As you can see, no join type is displayed.I am relatively new to parsing so I don't really know what to try.