Thursday, November 29, 2012

Amazon's Redshift Accelerates Data Warehouse As A Service

Amazon continues to play the role of disrupter in the data marketplace, announcing a new data warehousing service that could blow the doors off existing data warehousing vendors in terms of price. The question is, will lower prices be enough to change the game?


The announcement of the new Amazon Redshift service at yesterday's Amazon Web Service re:Invent conference was one of those "known unknowns" former Secretary of Defense Donald Rumsfeld used to go on about. We knew AWS would want to lead big with something in it's first-ever live conference being held in Las Vegas this week; we just didn't know what it would be. Now that the cat's out of the bag, many analysts seem pretty excited about the prospect. But Redshift also has some red flags.


Redshift 101


Let's take a look and see what's under the Redshift hood.


Amazon.com CTO Werner Vogels lauds Amazon Redshift as "a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud."


Kinda buzzwordy, but the key terms in that statement are "petabyte-scale" - which means this service is going to be easy to grow into if your data needs ever get that insanely high - and "service in the cloud" - a statement that means this is a hosted service on AWS's public cloud infrastructure - with all of the risks and rewards that come with that situation.


Vogels gets a little more specific later in his blog:


"Redshift has a massively parallel processing (MPP) architecture, which enables it to distribute and parallelize queries across multiple low cost nodes. The nodes themselves are designed specifically for data warehousing workloads. They contain large amounts of locally attached storage on multiple spindles and are connected by a minimally oversubscribed 10 Gigabit Ethernet network. This configuration maximizes the amount of throughput between your storage and your CPUs while also ensuring that data transfer between nodes remains extremely fast."

The MPP architecture is very important, because it gives some insights into Redshift's origins. Redshift is a columnar-based relational database that seems to based on the open source PostgreSQL database - a hot commodity in the open source world that's been making inroads against the venerable MySQL database partly because PostgreSQL handles parallelism so well.


All the bits Vogel mentions about the oversubscribed network connections are critical, too, because if his claims are right, this means that Redshift will be fast. The architecture of this new service is also important, because it means that unlike Hadoop, where data just sits cheaply waiting to be batch processed, data stored in Redshift can be worked on fast - fast enough for even transactional work.


Latency will be one of only a few areas any competing vendor will be able to go after - because the competition certainly can't touch AWS on price.


Redshift's Pricing Shift


One of the big parts of the Redshift announcement message yesterday was very much about price: just buying on-demand data capacity costs $3,723 per terabyte (TB) annually, which sounds like a lot except when you know how much traditional on-site data warehousing solution can run. In his re:Invent keynote yesterday, senior vice president of AWS Andy Jassy claimed such solutions can run $19,000 to $25,000/TB a year. So right off the bat, if Redshift is indeed offering comparable service, customers will save 80-85% off their data warehousing bill.


But wait, there's more. If customers reserve three years of service, the price drops to a jaw-dropping $999/TB annual fee. That's a 95-96% reduction in potential costs for data storage.


In this case, disruptive may not have been hyperbole. It may have been an understatement.


Redshift's Potential Issues


On paper, this sounds pretty good, but there are some potential issues that should be raised. For one, this is a public cloud service, which means your data will be out past your corporate firewall and in some ways sitting outside of your control. If one of Amazon's data centers has a hiccup, you could be out of luck.


The public status also means you'd better have bandwidth costs, security and infrastructure figured out, because somehow your company is going to have to get its data out and back to that cloud in a timely and safe manner


One wildcard with this new Redshift service is how easily it will be to build apps or convert existing apps to work with it. Amazon's APIs are open, but only to the point that you can point your software to them. Once you invest in Amazon's APIs, it will be more painful to pull out to another cloud-based service should you decide to down the road.


If you have been keeping your data and applications local, shifting to Redshift could also mean shifting your applications to some other part of the AWS ecosystem as well, just to keep the latency times and bandwidth costs reasonable. In some ways, Redshift may be the AWS equivalent of putting the milk in the back of the grocery store.


If it is at all reasonable in its service, though, Redshift's pricing will definitely put pressure on the data warehousing vendors to lower their prices to compete - good news for anyone looking at data warehousing.


Image courtesy of Shutterstock.






No comments:

Post a Comment