Sunday, April 7, 2013

Informatica HParser and MapR for Hadoop


The Apache Hadoop software framework has become the leading solution for massive, data-intensive, distributed applications. More mature than other solutions, it has also proven to be better at scaling; more useful, flexible, and affordable as a generic rather than proprietary data platform; excellent at handling structured and unstructured data; and its many connector products have broadened its use beyond other software frameworks used to handle Big Data applications.

Informatica HParser provides Hadoop developers with parsing capabilities to address data sources that include logs, call data records, industry standards, documents and binary or hierarchical data. This easy-to-use, codeless parsing software enables processing of any file format inside Hadoop with scale and efficiency.

·  Easily access complex data sources and develop data transformations for Hadoop parsing, with broadest support for data formats
·  Eliminate the time-consuming and tedious process of developing and testing data transformations in Java and PERL

MapR provides full data protection, no single points of failure, improved performance, and dramatic ease of use advantages. The MapR Distribution for Apache Hadoop adds innovation to the excellent work already done by a large community of developers. With key new technology advances, MapR transforms Hadoop into a dependable and interactive system with real-time data flows.

The MapR Distribution for Apache Hadoop is 100% API compatible with Apache Hadoop including MapReduce, HDFS, and HBase. MapR fully tests and supports the complete distribution, combining MapR’s intellectual property with the best of the best from the community, including the latest patches.

Key features of the combination of MapR and Informatica include:

·  Bi-directional data integration with Informatica PowerCentre and Informatica PowerExchange.
·  Snapshot replication using Informatica FastClone.
·  Data streaming using Informatica Ultra Messaging.
·  Parallel parsing and transformation on MapR using Informatica HParser

MapR has partnered with Informatica to provide the Community Edition of HParser:

·  The HParser package can be downloaded from Informatica as a Zip archive that includes the HParser engine, the Data Transformation HParser Jar file, HParser Studio, and the HParser Operator Guide.
·  The HParser engine is also available as an RPM via the MapR repository, making it easier to install the HParser Engine on all nodes in the cluster.

HParser can be installed on a MapR cluster running CentOS or Red Hat Enterprise Linux.

To install HParser on a MapR cluster:

·  Register on the Informatica site.
·  Download the Zip file containing the Community Edition of HParser, and extract it.
·  Familiarize yourself with the installation procedure in the HParser Operator Guide.
·  On each node, install HParser Engine from the MapR repository by typing the following command as root or with sudo:
yum install hparser-engine
·  Choose a Command Node, a node in the cluster from which you will issue HParser commands.
·  Following the instructions in the HParser Operator Guide, copy the HParser Jar file to the Command Node and create the HParser configuration file.


1 comment:

Pavan Medishetty said...

Thanks for this information, if you can post some examples on how to use HParser to process some complex structures will be more helpful.