Monday, June 10, 2013

Apache Oozie - Part 2: Workflow - hive action

What's covered in the blog?

1. Documentation on the Oozie hive action
2. A sample workflow that includes fs action, email action, and hive action (query against some syslog generated log files).  

Version: 
Oozie 3.3.0

My other blogs on Oozie:

Blog 1: Oozie workflow - hdfs and email actions
Blog 2: Oozie workflow - hdfs, email and hive actions
Blog 3: Oozie workflow - sqoop action (Hive-mysql; sqoop export)
Blog 4: Oozie workflow - java map-reduce (new API) action
Blog 5: Oozie workflow - streaming map-reduce (python) action 
Blog 6: Oozie workflow - java main action
Blog 7: Oozie workflow - Pig action
Blog 8: Oozie sub-workflow
Blog 9a: Oozie coordinator job - time-triggered sub-workflow, fork-join control and decision control
Blog 9b: Oozie coordinator jobs - file triggered 
Blog 9c: Oozie coordinator jobs - dataset availability triggered
Blog 10: Oozie bundle jobs
Blog 11: Oozie Java API for interfacing with oozie workflows
Blog 12: Oozie workflow - shell action +passing output from one action to another
Blog 13: Oozie workflow - SSH action


Your thoughts/updates:
If you want to share your thoughts/updates, email me at airawat.blog@gmail.com.

About the Hive action

http://archive.cloudera.com/cdh4/cdh/4/oozie/DG_HiveActionExtension.html

Salient features of the hive action:

- Runs the specified hive job synchronously (the workflow job will wait until the Hive job completes before continuing to the next action).
- Can be configured to create or delete HDFS directories before starting the Hive job.
- Supports Hive scripts with parameter variables, their syntax is ${VARIABLES} .
- Hive configuration needs to be specified as part of the job submission
Oozie EL expressions can be used in the inline configuration. Property values specified in the configuration element override values specified in the job-xml file.
Note that Hadoop mapred.job.tracker and fs.default.name properties must not be present in the inline configuration.
As with Hadoop map-reduce jobs, it is possible to add files and archives in order to make them available to the Hive job. 

Components of a workflow with hive action:

For a workflow with (just a) hive action, the following are required:
1.  workflow.xml
2.  job.properties
3.  Any files, archives, jars you want to add
4.  hive-site.xml
5.  Hive query scripts

Refer sample program below.

Sample program

Highlights:

The workflow application runs a report on data in Hive.  The input is log data (Syslog generated) in Hive, output is a table containing the report results in Hive. 

Pictorial overview of application:
















Application:


Oozie web console:

Screenshots of application execution:




18 comments:

  1. Hi,
    can you please show me how to move files from local directory to Hdfs every hour using oozie workflow.

    i have tried oozie fs action but it only works if im moving file from one folder to another in hdfs.

    i want to move a file from local directory to hdfs. please help!!!

    ReplyDelete
    Replies
    1. Hi Khatutshelo-
      Check out my blog 13 on Oozie ssh action.

      Cheers,
      Anagha

      Delete
    2. Hi Anagha, What is the difference between ssh action and shell action? When should i go for shh action over the shell action. Could you please give some idea on the same.

      Delete
  2. i have also same problem could plese post the process

    ReplyDelete
  3. There are several avenues to ingesting data to HDFS.
    Here is a good discussion on this topic to help arrive at the best solution based on your requirements-
    http://www.linkedin.com/groups/Data-Ingestion-Into-Hadoop-3638279.S.199779955

    ReplyDelete
  4. Hive + oozie workflow problem.

    my task is to create an oozie workflow to Load Data to Hive tables every hour.

    i am using Hue 2.3.0

    When i run the command: LOAD DATA INPATH '/user/username1/data/data3.txt' INTO TABLE raw_data; it works perfectly, data gets loaded to the hive table.

    *When i run the same command on oozie workflow the job get killed at 66% and the error message is* Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [10001]

    ----
    but whe i replace "LOAD DATA INPATH '/user/username1/data/data3.txt' INTO TABLE raw_data;" with "create external table table_data (json string) LOCATION '/user/username_k/20130925_data';" the oozie workflow works properly
    May you please help.

    ReplyDelete
  5. Khathutshelo, can you send me your scripts/code, directory structure, samples files etc so I can try to replicate? Thanks.

    ReplyDelete
  6. Hi Anagha,

    i am using Hue 2.3.0
    -----------------------------------
    what i did is:

    create an external table1 where the location is the HDFS directory with data, it has 4 columns

    create another external table2 with 3 columns and a partition column

    write a script to Insert table2 with partition from table1
    the problem is insert script works properly if i put it on the query editor but throws an 'Table not found'error

    table1 and table2 are both in the same database(adhoc)

    hive-script.q
    ---------------------
    INSERT OVERWRITE TABLE production1 PARTITION(dt, hour)
    SELECT a, b, datastring, hour FROM staging2;
    --------------------------------------------------------------------------------------------------------------------



    ReplyDelete
  7. since the partition name will be created automatically, i decide not to use the following method:
    Alter table SysLogEvents Add IF NOT EXISTS partition(node="cdh-vms",year=2013, month=05)

    however i make sure that the first table has 1 more column compared to the second table that has partition. the first column of the first table become the patirtion name on the second table.

    eg table1 has 3 columns and table2 has 2 column and a partition column

    ReplyDelete
  8. Hi Anagha,

    Thanks for the blog. Would you mind giving an example of creating a workflow where in you can pass parameters. And also how to run it.

    Thanks for the wonderful blog and helping us out !

    ReplyDelete
  9. Hi,

    I'm planning to pass oozie workflow as a parameter and input value to hive script how to do this please advice.

    create table wfl_tbl(date string,wf_id string);
    insert into wfl_tbl select '${wf:id()}', '2014-015-15' from dummy ;

    I'm submitting the above from Hue oozie editor it is taing ${wf:id()}. Please advice.

    Regards,
    R

    ReplyDelete
  10. Hi Anagha,

    I am trying to create a directory based on the job_id created after running the oozie job using a shell script. can you help me ? can we pass JOB_ID as parameter while executing the script. id so how to do it..?

    ReplyDelete
  11. interesting blog to read.. After reading this blog i learnt more useful information from this blog.. thanks a lot for sharing this blog

    best big data training | hadoop training institute in chennai | big data training institute in chennai

    ReplyDelete
  12. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  13. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  14. Hey, would you mind if I share your blog with my twitter group? There’s a lot of folks that I think would enjoy your content. Please let me know. Thank you.
    Java Training in Chennai | J2EE Training in Chennai | Advanced Java Training in Chennai | Core Java Training in Chennai | Java Training institute in Chennai

    ReplyDelete
  15. Nice post. it was so informative and keep sharing. Home lifts India

    ReplyDelete