What Hadoop means for SAS’s product roadmap
Tony Baer, Principal Analyst, Software – Information Management
At an analyst gathering a few weeks ago, SAS termed Hadoop its strategic development platform for new analytics tools going forward. SAS is ramping up new products and enabling its existing portfolio to work natively in Hadoop. Its advantage starts with a wide and deep library of analytic functions and established data management capability.
SAS aims to exploit Hadoop’s native multithreaded, massively parallel processing capabilities to retain its edge over lower-cost, open source programming languages and point analytic tools. But open source technologies are rapidly evolving. Simply converting SAS analytics to run as multithreaded MapReduce programs natively on Hadoop won’t preserve SAS’s edge; it must exploit emerging Hadoop processing frameworks that introduce new capabilities to its analytics.
Hadoop’s impact on SAS’s portfolio
While SAS has offered adapters for its analytic programs to run on Hadoop, it is now developing products that execute natively. Its first foray into Hadoop, the LASR in-memory server, is just over a year old; SAS ported LASR to run directly inside Hadoop compute nodes, bringing compute directly to the data to improve performance and scalability.
SAS has several new product releases geared for Hadoop slated this year, beginning with the just-released Visual Statistics product, a Tableau-like self-service analytics visualization tool.
While Visual Statistics doesn’t eliminate the need for experienced analytic programmers to set up analyses, it allows business users to manipulate complex data sets using a much higher-level interface. Later this year, SAS will introduce Data Director, a Hadoop-specific product providing a catalog-like interface to navigate access to prepared SAS data sets on Hadoop, including “accelerators” for existing SAS products, including analytics and data quality that automate the generation of MapReduce.
The common thread running across each of these products is reduction, though not necessarily elimination, of the need for complex analytic programming skills.
Hadoop allows SAS customers the option to run more complex analytics on broader data sets
For SAS, Hadoop is both its strongest defensive and offensive weapon for retaining and growing the installed base. Hadoop adds scale that allows a more comprehensive approach to address familiar complex analytics problems that SAS customers typically run, such as refining customer targeting, identifying fraud, conducting genomic analysis, or dissecting social media sentiment in the context of specific events or scenarios.
Before Hadoop, SAS problems had to be run as single threads or with comparatively limited sets of data. Hadoop now opens up SAS routines to take advantage of a multithreaded computing engine that is extremely linearly scalable, allowing customers to crunch more data and engage in more complex analytics.
Case in point: SAS’s current edge with in-memory compute on Hadoop
SAS’s LASR in-memory analytics server provides a good example of how SAS’s strategy works, and how it must evolve to stay ahead of the pack. Exploiting in-memory leverages Hadoop’s massive parallel scaling without sacrificing performance. And, when compute cycles are not bound by time-consuming reads and writes back to disk, it becomes practical to address more complex analytic problems.
Released just over a year ago, SAS LASR was one of the first products to aggressively leverage in-memory computing on Hadoop. Alternatives have since emerged, though the competition was hardly an apples-to-apples comparison. On the BI tooling side, there is Platfora, which introduced its own technology to populate data views into memory. There are also emerging open source frameworks that potentially could perform what LASR does; Apache Spark (backed by Cloudera), along with Tez and Storm (backed by Hortonworks) are among the best-known contenders. For now, LASR has the advantage of product maturity. For instance, it has more machine-learning libraries already running native than any of these frameworks, but that situation will change.
SAS must look beyond MapReduce to preserve its edge
Alternatives for MapReduce compute processing frameworks optimized for tasks such as streaming analytics, search, graph processing, and even classic HPC (scientific high-performance computing) are emerging that provide other options for complex analytic processing. SAS’s challenge will therefore be to selectively harness these frameworks, where appropriate, into its analytic routines. For instance, SAS already offers sophisticated text search. But would it make sense for the company to leverage open source Lucene search indexes as its jumping-off point for text-based analytics, rather than expend product development cycles reinventing the wheel? Similarly, if it offers analytics for evaluating the relationships of groups of variables, would a graph analytic framework be the most efficient path to getting the answers? These are the types of questions SAS should continue posing as it evolves its Hadoop strategy.
SAS’s native edge on Hadoop is temporary
SAS’s trump card is undoubtedly the depth of its analytics. For instance, with its text-parsing technology, it could combine search for identifying patterns of potential interest, with its sophisticated text-mining capability to unearth relationships with specific events or outcomes. For instance, SAS text mining has superior entity extraction capabilities compared to most Hadoop-oriented tools or techniques, which can tally the incidence of terms but may not easily pinpoint where, in specific records, those incidences lie. In the long run, such capabilities could be useful for one-to-one, personalized mass marketing.
Prominent in SAS’s rearview mirror is the open source R statistical programming language that is drawing a growing ecosystem of vendors who are now extending it from its original single-threaded client-side operation to leverage Hadoop’s multithreaded processing capability. However, SAS still holds an edge in analytic libraries and data management that for now often must be developed from scratch in the R world.
SAS is hardly the only analytics player to adapt its tools to take advantage of Hadoop’s massively parallel performance and scale. But on the BI side, most established players still rely on intermediary technology, using Hadoop as a “data lake,” while query results are manipulated and analyzed on external servers. The ability to execute natively on Hadoop is coming, however, thanks to a variety of interactive SQL frameworks (of which Cloudera Impala and Hortonworks’ Hive-on-steroids are the best-known alternatives) that promise to turn Hadoop into a true SQL-like MPP platform, and more closely resembling existing enterprise data warehousing platforms.