I'm new to both Airflow and Python, and I'm trying to configure a scheduled report. The report needs to pull data from Hive and email the results.
My code thus far:
from datetime import datetime, timedeltafrom airflow import DAGfrom airflow.operators.hive_operator import HiveOperatordefault_args = {'owner': 'me','depends_on_past': False,'start_date': datetime(2015, 1, 1),'email': ['email@example.com'],'email_on_failure': True,'email_on_retry': True,'retries': 3,'retry_delay': timedelta(hours=2)}dag = DAG( dag_id='hive_report', max_active_runs=1, default_args=default_args, schedule_interval='@once')query = """ #query goes here"""run_hive_query = HiveOperator( task_id="fetch_data", hql=query, dag=dag)
I'm pretty sure I need to add an EmailOperator task to send the results, as this only seems to be configured to email on failure or retry.
My question is this: what does the Hive operator do with the result set? What is the best way to pass the result set from one task to another?