By Arsalan Tavakoli-Shiraji, Databricks
The explosion of data across the enterprise has led to an increased focus on how to get value out of that data.
The first step for most enterprises is to establish the Big Data platform itself. Predominantly this platform has centered on Hadoop-oriented technologies and most recently Spark – now the most active project in the Big Data ecosystem, with over 400 contributors and distributed by over 10 vendors, including SAP, Oracle and Teradata.
Once the platform has been established, the next step is to flesh out the application layer. In this realm, two primary design choices exist: general-purpose tools or domain-specific applications. Enterprises need to understand the benefits and tradeoffs of each of these options, and the scenarios under which either is the best fit for addressing the tasks at hand.
General Purpose Data Analysis Tools
General-purpose tools were designed with the theory that users can create value out of data if they can access it through an interface with which they are comfortable. These tools break down into three distinct categories that embody varying levels of abstraction and dependence on visual interfaces:
- Programmatic Access. This is the lowest level abstraction, for data scientists and engineers that are comfortable with unfettered access and operate directly at the platform layer for greatest flexibility and generality. Examples include direct Spark access, Databricks Cloud’s notebook abstraction and general SQL interfaces.
- Data blending and Advanced Analytics Tools. These tools are for those who still understand the data and outcomes and are analytically inclined. Typically there are a set of tools for building data integration, data processing and analysis pipelines (often visually). Examples include Alteryx, RapidMiner and Trifacta.
- Business Intelligence and Visualization Tools. These tools provide executive dashboard and reporting functionality, as well as visual exploration capabilities for the less technical business users with limited training in data science and statistics. These tools are widespread in their popularity and scope, but they are generally limited for deep-dive exploratory purposes. Examples include Tableau, Qlik, ZoomData and Platfora.
Domain-Specific Data Analysis Applications
Domain-specific applications are designed to provide an end-to-end solution for a specific use case or narrow range of use cases. These data-driven applications typically combine well-defined input data type schema expectations, data ETL, automated advanced analytics, visual interfaces and, in some cases, integration with downstream decision support systems. These applications are tied to specific functionality within verticals such as retail trade promotion effectiveness analysis, financial services value-at-risk analysis and healthcare fraud detection, or in "horizontals" such as inventory analysis and financial performance analysis.
Traditionally, domain-specific applications have come in the form of highly custom solutions with significant professional services bundling, lengthy implementation timelines, a lack of upgrade options and non-trivial total cost. However, as enterprises continue to standardize their data platform, leverage the cloud for more agile deployments and converge on a set of common use cases, a set of domain-specific applications have emerged that help addresses many shortcomings of the "solution-oriented" approach. Examples include Tresata for financial services, FAIMdata for retail and Tidemark for financial business planning.
Choosing the Right Data Analysis Tool or App
When deciding on which type of tool or application to utilize for a given use case, multiple dimensions should be examined:
- Analytics Skillsets. How strong are the analytics capabilities of your organization? General-purpose tools provide greater flexibility, but they also require strong analytics capabilities to fully harness this power. The "data-science-in-a-box" delivery model of domain-specific applications is a better one to bring complex analytics to less technical business users.
- Custom Nature of Use Case. How customized or unique is the particular use case? Domain-specific applications, given the need to scale, focus on well-trodden and well-understood use cases. They can be costly and difficult to extend to newer use cases. The flexibility of general-purpose tools tends to provide a much better fit in these scenarios.
- Data Quality and Consistency. How clean is your data and how consistent is the structure/schema? If a significant amount of custom and complex ETL is required to prepare the data for analytics because of "dirty" data and numerous input-type edge cases exist, a general-purpose tool is likely required as extensions to a domain-specific tool would typically necessitate expensive customizations.
Price is often also mentioned as an important dimension, but it tends to be a deceiving metric. Domain-specific applications are generally more expensive than their general-purpose counterparts when looking at just the software cost, but this is only part of the equation. When looking at the full TCO, which includes the resources required to re-implement much of the end-to-end capabilities of the domain-specific application, the picture is much murkier and should be examined on a case-by-case basis.
Also, while the dimensions for evaluation are fairly stable, the application landscape is rapidly evolving. General-purpose tools vendors are increasingly adding domain-specific elements. Some analytics vendors such as SAS have many vertical-oriented analytics solutions, for example. At the same time an increasing number of domain-specific applications are emerging as various use cases become more standardized.
Finally, it is important to note that nearly all enterprises will require multiple tools and applications, including a healthy mix of general-purpose and domain-specific as data-driven strategies encompass multiple diverse types of use cases.
However, what all of these applications have in common is the need for access to data and a powerful data processing platform. Thus it is critical for enterprises to select/design a data platform that enables users to seamlessly deploy tools and applications to consume insights through the interface they’re most comfortable with without needing to duplicate the data processing infrastructure or the data itself.
As head of customer engagement for Databricks, Arsalan Tavakoli-Shiraji splits his time between overseeing strategic partnerships and ecosystem development, and yes, writing code. Prior to Databricks, he was an associate partner at McKinsey & Co., where he advised leading technology companies and enterprises on next generation IT and Big Data. He holds a Ph.D. in computer science from UC Berkeley.