Girish Pancha is a data integration veteran who spent the better part of two decades working at integration powerhouse Informatica. In 2013, he had "done all there was in data integration." Or so he thought.
He was wrong, although he wasn't easily convinced when a former employee, Arvind Prabhakar, first pitched him on a new approach to integrating Hadoop with enterprise applications. Prabhakar worked under Pancha before leaving the company in 2010 to join Cloudera, then a small Hadoop startup. While working with large customers there, he experienced data flow and fidelity problems between enterprise applications and the Hadoop cluster.
Pancha was now dabbling in angel investing, primarily in the field of data. When a mutual acquaintance brought them back together, Prabhakar shared his experience and his idea for solving the problem with Pancha.
"One of his theses was that the problem of data integration into at least Hadoop had reverted to being a very manual, opaque and brittle process," said Pancha. " I, at first, really argued with him, obviously standing up for the work I'd done at Informatica and saying, 'Look, we solved this problem, you can use traditional ETL technologies to ingest data into Hadoop.' In fact, I gave him a number of examples of customers that I knew of from that time."
Prabhakar pressed his point until Pancha saw how the nature of Hadoop data stores caused problems in accessing the data. Legacy integration tools created what Pancha calls "data drift." Traditional ETL, with its focus on metadata, had become an Achilles Heel in Big Data applications.
"We started off calling it meta-data driven, then we talked about it as model-driven," Pancha explained. "I talk about it as schema-centric, because fundamentally what we're trying to do in every situation is capture the nature of the source and the destination through user interfaces. The user defines what these things are, then the system takes care of the actual extracting of the data and the confirmation of the data."
Big Data's Integration Problems
Big Data isn't simple, particularly when data is machine generated.
"If you have data sources that change, with respect to the infrastructure producing the data, with respect to the structure of the data, the schema of the data, the semantics of the data; when that starts changing unexpectedly, the whole thing breaks down because the assumptions that were held at development time no longer hold true at operations time," Pancha said.
This creates two big problems: a slowdown in data flow and degradation in fidelity.
"I came to realize this required a fundamentally different approach and, frankly, one that if we can crack it can address all data flow problems in an enterprise, not just Big Data," he said.
Not even a year into retirement, Pancha in June signed on as a co-founder of StreamSets.
KPIs for Data Integration
He and Prabhakar began by completely rethinking how data moves in the enterprise. As is often the case with Big Data solutions, their end product doesn't fit neatly into any one category. It's not so much integration as performance management for data integration. It adds KPIs (key performance indicators) to the process of data integration. The KPIs include metrics about whether data was dropped, whether it changed or whether it's now incorrect.
"What we're saying is that the same methodology, if applied to integration, can deliver a very different end result around your integration flows or your jobs on an ongoing basis. What this allows you to do is really deliver a different level of service to the business in terms of guarantees around the data that they're looking at," he said.
Big Integration Ideas for Big Data
To achieve that, StreamSets DataCollector incorporates three "big ideas" into integration:
1. Defining key performance indicators (KPIs) for all data flows and providing real-time control for data operations teams and data scientists
2. Using adaptable flows or pipelines and an IDE (integrative development environment) that focuses more on actual data flow than schema or metadata
3. Building a highly containerized approach so that every movement of data, every type of processing, is isolated - which allows StreamSets to deliver continuous operations
"In the face of drift, in the face of change, in the face of unexpected data, changing business needs and logic, changing infrastructure, you're able to minimize the amount of downtime of the system and kind of keep it always on," Pancha explained.
Open Source Business Model
StreamSets DataCollector follows the Big Data tradition of open source licensing and is available under Apache license version 2. For most companies, that decision is based on the fact they're leveraging an underlying open source technology.
For StreamSets' founders, it was a conscious decision based on their experience with Informatica 9. Informatica did not opt for open source, so the business model ended up being driven by the technology, Pancha said. Using open source allows StreamSets to easily share the technology, so early adopters can help harden it and enterprises can download the product without the company pushing it. Already, 3,000-plus users have downloaded DataCollector, including "well north of 100 enterprises," he said.
StreamSets makes money in the usual way -- selling support to enterprises. The starting price point is $50,000 a year, with entry-level projects closer to $100,000 he said. However, open source offers another key advantage to StreamSets: It can build new applications on top of the technology, giving the company more options for growth.
One such product is in the works already, with plans for a beta release in the second quarter and a general release in the third. This product is designed for enterprise data operations and addresses a longstanding pain point for enterprises, Pancha said.
Fast Facts about StreamSets
Founders: Arvind Prabhakar and Girish Pancha
HQ: San Francisco
Product: StreamSets DataCollector, which delivers more reliable data streaming coupled with performance management KPIs to Big Data
Customers: Lithium uses the StreamSets Data Collector to enable near-real-time data flow, and Cisco uses StreamSets as part of its InterCloud offering
Funding: $12.5 million in series A funding, led by Battery Ventures and New Enterprise Associates
Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.