Wednesday, March 23, 2016

Execute Oozie step within Pentaho Data Integration (PDI)

I recently found the need to execute a shell script within an Hadoop cluster. For a client project, after processing files via Pentaho MapReduce steps, we needed to move files on HDFS to an 'archive' location, within the same cluster. A shell script executed on an edge node (containing HDFS commands) would easily perform the move, within the cluster.

However, our limitation was due to the fact that we were running the Spoon client from windows machines. We could execute the 'archive' shell script from the Data Integration server running on linux. However, that would force us to always use the DI server when executing.

What other options exist? Enter Oozie. Oozie is shipped with most, if not all, Hadoop distributions. Integrated with the Hadoop ecosystem, Oozie can be used to orchestrate a workflow of different Hadoop and operating system utilities. Oozie supports Hive, Pig, Sqoop, MapReduce as well as system level commands including shell scripts. More information on Oozie can be found here:
http://hortonworks.com/hadoop/oozie/.

Using Oozie, we can execute the shell script from our PDI job regardless of operating system (i.e. Spoon running on Windows, Data Integration Server running on linux or the kitchen script running on linux).

To configure, Oozie requires a directory on HDFS referred to as oozie.wf.application.path. This property is required and points to the location of the application components. Within this directory, multiple components referenced from your Oozie workflow can be uploaded (e.g. pig scripts, Hive sql files, Java jar files, etc.). In addition to all of the component files, the workflow.xml specifies the actions and flow to be orchestrated on the cluster.

For my client, I needed to execute a shell script to move files from the ingest directory to an archive directory. The following workflow.xml was uploaded to HDFS directory specified via oozie.wf.application.path.

<workflow-app name="wf_archive_files" xmlns="uri:oozie:workflow:0.5">
    <start to="archive-files-action"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="archive-files-action">
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <exec>${scriptPath}</exec>
            <argument>${src_dir}</argument>
            <argument>${target_dir}</argument>
            <file>${scriptPath}#${script}</file>
            <capture-output/>
        </shell>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

You'll notice the workflow is chock full of properties. In addition to the workflow.xml, Oozie requires properties to be passed to execute. Within PDI via the Oozie Job Executor step, there are two ways to specify properties. Via Quick Mode, you specify the location of the job.properties. Via Advanced Mode, you specify each property. I utilized Advanced Mode and specified the following properties:

oozie.wf.application.path=${nameNode}/app/archive_files
nameNode=hdfs://nameservice1
jobTracker=localhost:8032

scriptPath=/app/archive_files/${script}
script=archive_files.sh

src_dir=${SRC_DIR}
target_dir=${ARCHIVE_DIR}

Using Advanced Mode, I could dynamically specify the source & target folders using parameters passed to my PDI job.

Along with the workflow.xml, the archive_files.sh shell script also needed uploaded to the same location, specified via oozie.wf.application.path.

#!/bin/sh
if [ "$#" -ne 2 ]; then
   echo "Usage: $0 "
   echo "Example: $0 /data/landing/app_stream /data/archive/app"
   exit
fi

src_dir=$1
target_dir=$2

for i in `hdfs dfs -ls -R $src_dir | awk '/a/ {print $8}'`; do
   target_folder=${i/#$src_dir/$target_dir}

   if [[ "$i" == *.txt ]]; then
      hdfs_cmd="hdfs dfs -mv $i ${target_folder%/*}"
      echo "$hdfs_cmd"
      $hdfs_cmd
   # Exclude tmp files being written by flume
   # Only grab directories to be created
   elif [[ "$i" != *.tmp ]]; then
      # Create target directory
      hdfs_cmd="hdfs dfs -mkdir -p $target_folder"
      echo $hdfs_cmd
      $hdfs_cmd
   fi
done

With all of the components in place on HDFS and properties specified within the Oozie Job Executor step, we can now execute our job and have files 'archived' from ${SRC_DIR} to ${ARCHIVE_DIR}. Just make sure to specify ${SRC_DIR} and ${ARCHIVE_DIR} as part of the PDI Job. Or explicitly specify the properties.

Friday, June 14, 2013

Pentaho SSO integration with OWF/CAS

Background: Client's UI application is a dashboard consisting of a banner (with navigation bread crumbs & other controls) that calls into a Pentaho dashboard to render dashboard content below the banner. Their application will then be displayed as a widget within Ozone Widget Framework (OWF).

For their development environment & POC, OWF/CAS needed to be installed. Following OWF installation guides (shipped with OWF distribution), we had to create and use a self-signed certificate because they did not have a certificate from a Certificate Authority. The tomcat for OWF/CAS has the keystore specified within $OWF_HOME/certs/keystore.jks. The self-signed cert gets imported into that keystore.

To configure Pentaho, first ensure Pentaho is fully running and operational. OWF/CAS also uses HSQLDB. Therefore, there may be a port conflict between Pentaho HSQLDB and OWF HSQLDB. Easiest thing to do, if possible, is remove Sample data. Follow the instructions on InfoCenter but also delete the data connection definition within the datasource table. If the datasource is NOT deleted, tomcat hangs upon startup when attempting to connect to HSQLDB. NO error message is displayed or shows in log files and tomcat never completes startup.

Second step is to configure Pentaho to use SSL. Once again, for this client, we had to use self-signed certificate. These instructions are also on InfoCenter. After creating and importing the certificate, remember to modify tomcat/conf/server.xml to enable the SSL connector (8443). Once complete, test Pentaho running on 8443.

Third step is to now run the ant script which modifies Pentaho configuration files to perform SSO via CAS. Before proceeding, make a backup of the Pentaho directory or snapshot the VM. Once again, the steps to switch Pentaho to using CAS is documented on InfoCenter. When specifying the cas.authn.provider property, I used 'memory'. I later modified Pentaho to use JDBC to retrieve user details (authorities).

After starting up, navigating to the PUC should result in a redirect to the CAS login page. Enter your credentials as defined within OWF help guides (testUser1, testAdmin1). If using CA certificates, everything 'should' work.

But...if you see the casFailed JSP page on the browser, you may also find the following exception in the log files:

23:19:26,894 ERROR [Cas20ServiceTicketValidator] javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Searching the net, you'll find many blogs or notes on this exception. The gist of them is that communication between 2 servers is not trusted. If CA certificates are used, the certificates would be trusted. But because we used self-signed certs, we have to perform subsequent steps. The certificates within OWF/CAS keystore need to be imported into the Pentaho keystore. List all of the certs in the OWF/CAS keystore using the following command executed from the $OWF_HOME/certs directory:

keytool -list -keystore keystore.jks

Then export the certificates listed using their aliases. For example:

keytool -exportcert -keystore keystore.jks -alias owf -file owf.cer

Now import those certificate files into Pentaho's keystore. Pentaho's keystore is $PENTAHO_HOME/java/lib/security/cacerts. Using the following command, import the OWF/CAS certificates into Pentaho's keystore. Repeat as necessary for each certificate.

keytool -import -keystore cacerts -storepass changeit -noprompt -alias owf -file ${PATH_TO_OWF_CERT_FILES}/owf.cer

Restart Pentaho and integration between OWF/CAS and Pentaho using self-signed certs is complete. Users can now create OWF widgets pointing to Pentaho content (Pentaho User Console, dashboard, report, etc) and the widget will display seamlessly, without requiring the user to log into Pentaho.

Wednesday, December 23, 2009

Monitoring in JBoss

While at a client site or within a testing environment, have you ever started to wonder how many users are on the application? How is your application running with regards to memory (heap size)? Are you close to using all of the database connections in the connection pool? For answers to some questions, maybe your application container provides a status (tomcat) or monitoring screen (WebLogic).

To facilitate recording of these statistics when using JBoss, JBoss has included the ability to log/monitor JMX Mbean values. And it's not difficult to install. Once values are being logged, you no longer have to continue refreshing the JMX console to see the values updated.

For installing and monitoring of your web application(s), perform the following steps:
  1. Copy monitor XML files into $JBOSS/server/server_name/deploy
  2. Copy $JBOSS/docs/examples/jmx/logging-monitor/lib/logging-monitor.jar into $JBOSS/server/server_name/lib
  3. Create monitor XML files to monitor JMX MBeans (samples below)
And there you have statistics being saved in a log file...ready for you to parse and create pretty little graphs for the rest of the world to read and understand. For our deployments, we used these logs to track & monitor the following items. As a side affect of monitoring, you also get to see the peak usage times and potentially performance / memory related issues.
  • DB connections
    • In use
    • Available
    • Max Connections In Use
  • JVM activity
    • Heap size
    • Threads
For additional reading or research on monitoring within JBoss, check out the following links:
DB Connection Monitoring Sample
Here's the XML necessary to monitor a JDBC connection pool (XML comments omitted)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE server PUBLIC
"-//JBoss//DTD MBean Service 4.0//EN"
"http://www.jboss.org/j2ee/dtd/jboss-service_4_0.dtd">

<server>
<mbean code="org.jboss.services.loggingmonitor.LoggingMonitor"
name="jboss.monitor:type=LoggingMonitor,name=MY-DSMonitor">

<attribute name="Filename">${jboss.server.home.dir}/log/my-ds.log</attribute>

<attribute name="AppendToFile">false</attribute>

<attribute name="RolloverPeriod">DAY</attribute>

<attribute name="MonitorPeriod">10000</attribute>

<attribute name="MonitoredObjects">
<configuration>
<monitoredmbean name="jboss.jca:name=MY-DS,service=ManagedConnectionPool"
logger="jca.my-ds">
<attribute>InUseConnectionCount</attribute>
<attribute>AvailableConnectionCount</attribute>
<attribute>ConnectionCreatedCount</attribute>
<attribute>ConnectionDestroyedCount</attribute>
<attribute>MaxConnectionsInUseCount</attribute>
</monitoredmbean>
</configuration>
</attribute>

<depends>jboss.jca:name=MY-DS,service=ManagedConnectionPool</depends>
</mbean>

</server>

Friday, December 18, 2009

Utilizing OpenSymphony Caching (OSCache) with iBatis

Background
My current project is ramping up to be deployed into a centrally hosted data center to be accessed by a large volume of users. In past deployments, we could expect, at most, 200-300 users to be logged into our web application. When translating that number to active users, we can expect somewhere in the range of 25 to 100 active users at any given point in time.

With deployment into a centrally hosted environment, our anticipated user base significantly increases to be approximately 1000 concurrent, active users. With this large number of users, we wanted to investigate caching frequently accessed objects.

The major area in our web application that currently utilizes caching is at the DAO level. Our DAOs are leveraging iBatis and already using iBatis in-memory cache implementation. Moving to a centrally hosted environment, the application will be clustered. Thus, supporting fail-over and high availability. However, using the iBatis cache, we risk the possibility that users see different objects depending upon their assigned clustered instance and when the iBatis cache is refreshed. To synchronize cache flushing across the cache, we decided to investigate incorporating a distributed caching mechanism. With iBatis supporting OSCache, we decided to start there.

Implementation
As mentioned previously, iBatis documentation refers to using OSCache for distributed caching within a clustered environment. Assuming you are using maven to build your project, configuring iBatis to use OSCache is extremely easy and well documented.
  1. Modify the project's pom.xml to include oscache.jar (2.4) as a dependency
  2. Identify and modify the sqlMap / DAO that you wish to use oscache. change cacheModel type to be "OSCACHE"
  3. Optionally include an oscache.properties file on the class path. Within development or continuous integration (CI) environments, this file does not need included as default properties will be applied. Deployment in production or other test areas can include the file.
For configuration options and settings including clustering instructions, refer to OSCache web site. The clustering option sends a flush event that causes all caches within the cluster to be flushed. Currently, OSCache can be configured to send flush events via JMS or JavaGroups (multi-cast). All of the settings are nicely documented.

http://www.opensymphony.com/oscache/wiki/Configuration.html
http://www.opensymphony.com/oscache/wiki/Clustering.html

Tuesday, April 7, 2009

Retrieving a list of changes for a release in SVN

Recently for our project, we needed to review the list of SVN commits on a branch. I'm sure there are several ways that possibly even include dates. For our purposes, we wanted to review the entire list and used the following command:

svn log --stop-on-copy –v https://host/svn/project/branches/project-0.8.0902.1

And sha-bam, a nice, lengthy report showing the status of the changes committed to this branch.

Thanks goes to one of our many, in-house, resident SVN experts!

Tuesday, August 5, 2008

Having to scale a web application?

In a recent sprint planning meeting, solution owners unveiled how a new customer would be using our web application. Of course, it's a web application and, as such, needs to support many, concurrent users. Up until now, we were looking for our application to support approximately roughly 500 users; probably not concurrent, but potentially.

With learning how the new client will use the application, we may blow those numbers out of the water, both overall users as well as concurrent. So, how do you program and application to support large, very large volumes of users? Will the application scale if we cluster the web application servers? Will performance decrease with more users? Can we just throw a bigger machine with more CPU & memory to handle the load?

I'm a big believer in the fact that scalability needs to be designed from the get-go and then monitored. While some items may not be implemented immediately due to project constraints, scalability needs to be considered from day 1 and the code watched and reviewed to ensure new designs, code, etc will not adversely affect scalability and potentially performance.

Two articles were published in TSS on scaling JEE applications that "hit the nail on the head". If you're faced with having to support volumes of users, these articles are a must read as the writer had (has?) the opportunity to pound on application to test their ability to scale under heavy load AND to analyze why they failed or succeeded.

If you're not having to program for scalability now, the articles are still an excellent read & resource!

Scaling Your Java EE Applications - Part 1
Scaling Your Java EE Applications Part 2

Tuesday, July 1, 2008

SVN Merge (Trunk to Branch)

Ever have code changes that need to be pushed into a branch? Or merged back into HEAD or the trunk?

Recently (today), I had the need to merge code changes from HEAD into a newly created branch. Given the fact that my changes spanned a couple of weeks (no lectures, please, as I was on vacation :D), I could not remember all of the lines that were changed in 9 files. I didn't want to blindly copy the files into the branch as I might (shouldn't really) overwrite another developers changes.

After a quick google search and a read of a short blog posting, I had found a quick path forward. For the same reasons that caused the Jake to write a blog, I'm also writing this so that I can easily find it.

I checked in my files into HEAD and got the revision number (7200). Then I changed directories to the directory of the branch and ran the following command:

svn merge -r 7199:7200 https://phlcvs01/svn/netcds/trunk .

If you want to preview the changes, specify '--dry-run' which causes SVN to list the changes that will occur. Using '-r 7199:7200' causes subversion to only grab the differences between those revisions. Upon executing the command, 'svn stat' shows the modified files that you need to check into the branch.

Simple and easy.