Friday, April 12, 2013

Why to choose MongoDB over other Document-Oriented NoSQL Stores?

MongoDB is a document-oriented datastore. NoSQL data stores such as MongoDB and Cassandra are so vastly different from each other that apples-to-apples comparisons are practically impossible. Thus, within the world of NoSQL, there are subcategories such as key-value stores, graph databases, and document-oriented stores. Here are some of the reasons on why to choose MongoDB over NoSQL Stores:

  • MongoDB, an open source, schema-free document store written in C++.
  • Support for a wide array of programming languages. Thus, a wide variety of applications can leverage Mongo from Java to Ruby to PHP.
  • A SQL-like query language. Developers can access Mongo via its own shell, which uses a JavaScript query language.
  • Easy to learn
  • Quick Installation
  • Large Communitry. e.g. the community around Mongo has created higher-level, ORM-like libraries, which leverage core platform drivers, thus providing a closer mapping of objects in code to documents.
  • Plugin support
  • In addition to strong community backing and commercial support, Mongo benefits from excellent documentation. A number of published books are available.
  • Extensive driver support
  • Support on Windows
  • Documents are stored in a binary JSON format, dubbed BSON. JSON is an extremely understandable format. Humans can easily read it (as opposed to XML, for example) and machines can efficiently parse it.
  • Out of the box, Mongo supports sharding, which permits horizontal scaling by divvying up a collection of documents across a cluster of nodes, thus making reads faster.
  • Mongo offers replication in two modes: master-slave and replica sets. In a replica set, there is no master node; instead, all nodes are copies of one another and there is no single point of failure. Replica sets therefore bring more fault tolerance to larger environments supporting massive amounts of data.
  • They do not need massive hardware expenditures. Mongo can run on commodity hardware platforms, provided there is a healthy amount of memory.
  • Mongo offers MapReduce, a powerful searching algorithm for batch processing and aggregations that is somewhat similar to SQL's group by.
  • Role-Based Privileges allow organizations to assign more granular security policies for server, database and cluster administration.

Sunday, April 7, 2013

Informatica HParser and MapR for Hadoop


The Apache Hadoop software framework has become the leading solution for massive, data-intensive, distributed applications. More mature than other solutions, it has also proven to be better at scaling; more useful, flexible, and affordable as a generic rather than proprietary data platform; excellent at handling structured and unstructured data; and its many connector products have broadened its use beyond other software frameworks used to handle Big Data applications.

Informatica HParser provides Hadoop developers with parsing capabilities to address data sources that include logs, call data records, industry standards, documents and binary or hierarchical data. This easy-to-use, codeless parsing software enables processing of any file format inside Hadoop with scale and efficiency.

·  Easily access complex data sources and develop data transformations for Hadoop parsing, with broadest support for data formats
·  Eliminate the time-consuming and tedious process of developing and testing data transformations in Java and PERL

MapR provides full data protection, no single points of failure, improved performance, and dramatic ease of use advantages. The MapR Distribution for Apache Hadoop adds innovation to the excellent work already done by a large community of developers. With key new technology advances, MapR transforms Hadoop into a dependable and interactive system with real-time data flows.

The MapR Distribution for Apache Hadoop is 100% API compatible with Apache Hadoop including MapReduce, HDFS, and HBase. MapR fully tests and supports the complete distribution, combining MapR’s intellectual property with the best of the best from the community, including the latest patches.

Key features of the combination of MapR and Informatica include:

·  Bi-directional data integration with Informatica PowerCentre and Informatica PowerExchange.
·  Snapshot replication using Informatica FastClone.
·  Data streaming using Informatica Ultra Messaging.
·  Parallel parsing and transformation on MapR using Informatica HParser

MapR has partnered with Informatica to provide the Community Edition of HParser:

·  The HParser package can be downloaded from Informatica as a Zip archive that includes the HParser engine, the Data Transformation HParser Jar file, HParser Studio, and the HParser Operator Guide.
·  The HParser engine is also available as an RPM via the MapR repository, making it easier to install the HParser Engine on all nodes in the cluster.

HParser can be installed on a MapR cluster running CentOS or Red Hat Enterprise Linux.

To install HParser on a MapR cluster:

·  Register on the Informatica site.
·  Download the Zip file containing the Community Edition of HParser, and extract it.
·  Familiarize yourself with the installation procedure in the HParser Operator Guide.
·  On each node, install HParser Engine from the MapR repository by typing the following command as root or with sudo:
yum install hparser-engine
·  Choose a Command Node, a node in the cluster from which you will issue HParser commands.
·  Following the instructions in the HParser Operator Guide, copy the HParser Jar file to the Command Node and create the HParser configuration file.


Tuesday, April 2, 2013

Apache Hadoop ecosystem - March 2013

Apache Hadoop ecosystem continues to evolve at a rapid pace with newer projects that are being added as incubators while those currently under incubation are getting ready to graduate out. Let’s visit the current state of open source Apache Hadoop ecosystem.


Please check here for more details.