AWS Announces New Analytics Capabilities to Help Customers Embrace Data at Scale

Amazon Web Services (AWS), an Amazon.com company, announced significant new analytics capabilities that help customers embrace data at today and tomorrow’s scale. AWS introduced several new Redshift capabilities that bring more than an order of magnitude better query performance and deliver greater flexibility for customers when they are working across their data storage, data warehouse, and operational databases at scale. AWS also announced a new innovative highly-scalable, cost-saving warm storage tier for Amazon Elasticsearch Service.

Customers today are regularly trying to operate on petabytes and even exabytes of data. This new scale of data, along with new application requirements, mean that analytics tools will have to change significantly to scale effectively. Customers want to be able to perform analytics across all of their data, regardless of the format or where the data lives, and scale their applications to support millions of users anywhere in the world. AWS provides the broadest and deepest set of analytics services of any cloud provider, and is constantly innovating based on customer needs for this new scale of data.

Amazon Redshift RA3 instances with Managed Storage allow customers to cost-effectively scale and run 3x faster than any other cloud data warehouse

As the scale of data continues to get much bigger– reaching petabytes per week– customers are ingesting even more data into their Amazon Redshift data warehouse. To scale their data warehouse, customers use Redshift’s Elastic resize capability to add additional instances to their cluster. Today, Redshift’s instances include a fixed amount of compute and storage, so it’s possible for customers to end up over-provisioned on either, and paying for capacity they don’t use. Customers have asked for the ability to grow their storage without over-provisioning compute, and for more flexibility to grow their compute capacity without increasing their storage costs.

New Amazon Redshift RA3 instances with Managed Storage (available today) allow customers to optimize their data warehouse by scaling and paying for compute and storage independently. With Amazon Redshift RA3 instances, customers choose the number of instances they need based on their data warehousing workload’s performance requirements, and only pay for the managed storage that they use. Redshift Managed Storage uses large, high-performance SSDs in each Amazon Redshift RA3 instance for fast local storage and Amazon S3 for longer-term durable storage. If the data in an instance grows beyond the size of the large local storage, Redshift Managed Storage automatically offloads that data to Amazon S3. Customers pay the same low rate for Redshift Managed Storage regardless of whether the data sits in high-performance local storage or in Amazon S3, and they only pay for the amount of storage they use on a local RA3 storage, meaning they don’t end up wasting spend on unused storage capacity. For workloads that require a lot of storage, but not as much compute capacity, customers can automatically scale their data warehouse storage capacity without adding and paying for additional instances. Redshift Managed Storage uses a variety of advanced data management techniques to optimize how efficiently data is offloaded to and retrieved from Amazon S3. In addition, Amazon Redshift RA3 instances are built on the AWS Nitro System and feature high bandwidth networking that further reduces the time taken for data to be offloaded and retrieved from Amazon S3. Together, these capabilities enable Amazon Redshift RA3 instances with Managed Storage to deliver 3x the performance of any other cloud data warehouse service, and existing Amazon Redshift customers using Dense Storage (DS2) instances will get up to 2x better performance and 2x more storage capacity at the same cost. RA3 16xlarge instances are generally available today to support workloads with petabytes of data (up to 8 PB compressed), with RA3 4xlarge instances coming early next year. To get started with Redshift RA3 instances, visit https://aws.amazon.com/redshift.

AQUA (Advanced Query Accelerator) for Amazon Redshift brings compute to the storage layer for 10x faster performance than any other cloud data warehouse

Rapid growth in the volume of data that customers need to process in their data warehouse has led to a difficult balancing act between performance and cost-effective scaling. The prevailing approach to data warehousing has been to build out an architecture in which large amounts of centralized storage is moved to waiting compute nodes to process the data. The challenge with this approach is that there is a lot of data movement between the shared data and compute nodes. As data volumes continue to grow at a rapid clip, this data movement saturates available networking bandwidth and slows down performance. Additionally, even if the networking bottleneck can be overcome, because SSD storage throughput to and from storage nodes has scaled 6x faster over the last seven years than the ability for CPUs to process data from memory, absent some significant change, CPUs aren’t able to keep up with the faster storage capabilities, which will either become a performance bottleneck itself or create more cost as customers are forced to provision more compute to get the work done quickly.

AQUA (Advanced Query Accelerator) for Amazon Redshift (available mid-2020) is a new distributed and hardware-accelerated cache for Amazon Redshift that provides the next phase of performance improvement and innovation for analytics at the new scale of data. AQUA brings compute to the storage layer, so data doesn’t have to move back and forth between the two, enabling Redshift to run 10x faster than any other cloud data warehouse. AQUA is a big, high-speed cache architecture on top of Amazon S3 that can scale out and process data in parallel across many nodes. Each node possesses a hardware module comprised of AWS designed analytics processors that dramatically accelerate data compression, encryption, and data processing (including filtering and aggregation). This new architecture makes queries run so much faster than today’s cloud data warehouses that customers will be able to query raw data directly, even at scale, giving them more up-to-date dashboards, less development time, and easier to maintain systems. AQUA-powered Amazon Redshift will remain 100% compatible with the current version of Amazon Redshift, so customers can easily migrate existing data warehouses with no code changes. AQUA provides the next phase of performance innovation for analytics at the new scale of data, and will be available in mid-2020. To learn more about AQUA, visit https://pages.awscloud.com/AQUA_Preview.html.

Amazon Redshift Data Lake Export makes it easy to save query results directly to a data lake

Customers require data to be combined across their data warehouse and data lake, and don’t want data locked in silos and proprietary formats. For example, an organization may want to understand what their customer was browsing before they made a purchase, which requires them to combine the order history sitting in the data warehouse with the clickstream data sitting in an Amazon S3 data lake. Amazon Redshift enables customers to directly query and join data across both their Amazon Redshift data warehouse and Amazon S3 data lake, giving customers a ‘lake house’ approach to data warehousing. In this lake house world, where data is stored both in Amazon Redshift and Amazon S3, customers also need an easy way to get the results from Amazon Redshift queries back into Amazon S3 in an open format that can be used by other services.

Amazon Redshift Data Lake Export (available today) allows customers to export data directly from Amazon Redshift to Amazon S3 in an open data format (Apache Parquet) that is optimized for analytics. Customers can now save the results of a query they have done in Amazon Redshift into their data lakes in open formats so that they can analyze that data with other analytics services like Amazon SageMaker, Amazon Athena, and Amazon EMR. No other cloud data warehouse makes it as easy to both query data and write data back to a data lake in open formats. To get started with Amazon Redshift Data Lake Export, visit https://aws.amazon.com/redshift.

Amazon Redshift Federated Query allows customers to analyze data across data warehouses, data lakes, and operational databases

Aggregating, transforming, and uploading large amounts of data from a relational database to a data warehouse can be resource-intensive and time-consuming, which is why many customers choose to do so only once a day. This can create problems when customers need to query their data warehouse for certain types of timely information that is initially stored in an operational database. For example, a customer service representative helping a customer resolve an issue with a recent order might be served day-old results when they pull up the customer’s purchase history, making the information irrelevant. Customers can work around this problem today by writing custom application code to query the operational database directly, but building integrated systems that do this is expensive, time consuming, and difficult to maintain.

Amazon Redshift Federated Query (available in preview) gives customers the ability to run queries in Amazon Redshift on live data across their Amazon Redshift data warehouse, their Amazon S3 data lake, and their Amazon RDS and Amazon Aurora (PostgreSQL) operational databases. This simplifies application development by allowing customers to use familiar SQL statements to combine all of this data across their various data stores. With this capability, Amazon Redshift queries can now provide timely and up-to-date data from operational databases to drive better insights and decisions. To get the best possible performance, the Redshift query optimizer intelligently distributes as much work as possible to the underlying databases. To learn more about Amazon Redshift Federated Query, visit https://aws.amazon.com/redshift.

UltraWarm for Amazon Elasticsearch Service provides fast, interactive analytics on log data at one-tenth the cost

As more and more applications are built using microservices, containers, and purpose-built data stores, they produce an ever-increasing amount of log data. Amazon Elasticsearch Service makes it simple to collect, analyze, and visualize machine-generated log data from websites, mobile devices, and sensors. Amazon Elasticsearch Service is fully managed, so customers can deploy production-ready clusters in minutes, scale clusters up and down, and secure data at rest and in transit. However, given the explosive growth of log data, storing and analyzing months’ or years’ worth of data is cost-prohibitive at scale. This has led customers to use multiple analytics tools, or delete valuable data, missing out on important insights that the longer-term data could yield.

To solve for this customer challenge, AWS built a new storage tier for Amazon Elasticsearch Service called UltraWarm, which finally gives Elasticsearch customers a warm storage tier that both stores large amounts of data cost-effectively and provides the type of snappy, interactive experience that Elasticsearch customers expect. UltraWarm offers a distributed cache for more frequently accessed data, while using advanced placement techniques to determine which blocks of data are less frequently accessed and should be moved outside of the cache to Amazon S3. UltraWarm also uses high-performance EC2 instances to interact with data stored in S3, providing 50% faster query execution versus competing warm-tier solutions, and giving customers the same interactive analytics experience with all their log data. UltraWarm reduces costs by up to 90% to store the same amount of data in Elasticsearch today, and is 80% lower than the cost of warm-tier storage from other managed Elasticsearch offerings. With UltraWarm, customers can manage up to 3 PB of log data with a single Amazon Elasticsearch Service cluster; and with the ability to query across multiple clusters, customers can effectively retain any amount of current and historical log data for interactive operational analysis and visualization. UltraWarm is a seamless extension of the Amazon Elasticsearch Service. Customers can easily query and visualize across both their recent and longer-term operational data, all from their Kibana interface, at a fraction of the cost today. This allows developers, DevOps engineers, and InfoSec experts to use Amazon Elasticsearch Service for the analysis of recent (weeks) and longer-term (months or years) operational data without needing to spend days restoring data from archives (Amazon S3 or Amazon Glacier) to an active searchable state in an Elasticsearch cluster. UltraWarm Service is available in preview today. To learn more about UltraWarm, visit https://aws.amazon.com/elasticsearch-service/features.

“Our customers tell us they are regularly dealing with petabytes, and even exabytes of data, and their existing analytics systems can’t keep up,” said Raju Gulabani, Vice President, Database Services, AWS. “These customers want to perform fast analytics on all of their raw data across their data warehouse and data lake, and cost effectively deal with the explosion in log data to retain information that might help them run their businesses better. With today’s announcements we are helping AWS customers do all of this and fearlessly embrace data at scale.”

Duolingo is the most popular language-learning platform and the most downloaded education app in the world, with more than 300 million users. The company’s mission is to make education free, fun, and accessible to all. “We use Amazon Redshift to analyze the events from our app to gain insight into how users learn with Duolingo. We load billions of events each day into Amazon Redshift, have hundreds of terabytes of data, and that is expected to double every year. While we store and process all of our data, most of the analysis only uses a subset of that data,” said Jonathan Burket, Senior Software Engineer, Duolingo. “The new Redshift RA3 instances with Managed Storage deliver 2x performance for most of our queries compared to our previous DS2 instances-based Redshift cluster. Redshift Managed Storage automatically adapts to our usage patterns. This means we don’t need to manually maintain hot and cold data tiers, and we can keep our costs flat when we process more data.”

Yelp’s mission is to connect people with great local businesses; to do so, data mining and efficient data analysis is important in order to build the best user experience. “We continue to adopt new Redshift features and are thrilled with the new RA3 instance type,” said Stephen Moy, Software Engineer, Yelp. “We have observed a 1.9x performance improvement over DS2 and 1.5x performance improvement over DC2 in our workload, while keeping the same costs and providing scalable managed storage. This allows us to keep pace with explosive data growth and have the necessary fuel to train our machine learning systems.”

Western Digital (WD) is a leading global data storage brand that empowers users to create, experience, and preserve digital content across a range of devices. WD enables users to be in control and smartly save what matters to them most in one secure place. “At WD we use Amazon Redshift to enable the enterprise to gain value and insights from large, complex, and dispersed datasets,” said Fayaz Syed, Sr. Manager, Big Data Platform, Western Digital. “Our data is nearly doubling every year and we run six Redshift clusters with a total of 78 nodes and 631+ TB of compressed data stored to get insights that our business analysts and leadership depend on. The new Redshift RA3 instances offer us the ability to process our growing data more cost-effectively while we double our storage capacity compared to our previous Redshift cluster. We also like that our ETL, BI, and data ingestion process did not have to change to take advantage of the RA3 instances with Managed Storage.”

NTT DOCOMO is the largest mobile service provider in Japan, serving more than 79 million customers. “Migrating to Amazon Redshift in 2014 allowed us to scale to over ten petabytes of uncompressed data with a 10x performance improvement over our prior on-premises system. Today, it is the center of our analytics environment,” said Takaaki Sato, General Manager of Service Innovation Department, NTT DOCOMO. “Since we started using Amazon Redshift, both our data and number of users have increased dramatically. We are impressed with the flexibility and ease of use, even as we scale users and data. The new Amazon Redshift Data Lake Export feature allows us to simplify our workflows to make use of more data across our data lake. We are excited about the new Amazon Redshift RA3 instances with Managed Storage, enabling us to scale compute and storage separately. We also look forward to realizing the benefits of AQUA (Advanced Query Accelerator) for Amazon Redshift as we continue to increase the performance and scale of our Amazon Redshift data warehouse. We appreciate AWS’s continual innovation on behalf of its customers.”

Intuit, makers of TurboTax, QuickBooks and Mint, is a global financial platform company designed to empower consumers, self-employed, and small businesses to improve their financial lives. “We are looking forward to exploring how AQUA can empower our team to spend more time innovating on behalf of customers,” said Alex Balazs, Chief Architect, Intuit. “These new capabilities complement our strategy to create more data-driven insights at scale with speed and efficiency across our platform.”

Warner Bros. Interactive Entertainment is a premier worldwide publisher, developer, licensor, and distributor of entertainment content for the interactive space across all platforms, including console, handheld, mobile, and PC-based gaming for both internal and third-party game titles. “We utilize many AWS and third party analytics tools, and we are pleased to see Amazon Redshift continue to embrace the same varied data transform patterns that we already do with our own solution,” said Kurt Larson, Technical Director of Analytics Marketing Operations, Warner Bros. Analytics. “We’ve harnessed Amazon Redshift’s ability to query open data formats across our data lake with Redshift Spectrum since 2017, and now with the new Redshift Data Lake Export feature, we can conveniently write data back to our data lake. This all happens with consistently fast performance, even at our highest query loads. We look forward to leveraging the synergy of an integrated big data stack to drive more data sharing across Amazon Redshift clusters, and derive more value at a lower cost for all our games.”

FOX Corporation produces and distributes content through some of the world’s leading and most valued brands, including: FOX News, FOX Sports, the FOX Network, and the FOX Television Stations. FOX empowers a diverse range of story creators to imagine and develop culturally significant content, while building an organization that thrives on creative ideas, operational expertise, and strategic thinking. “Amazon Redshift allows us to ingest, optimize, transform, and aggregate billions of transactional events per day at scale, coming to us from a variety of first and third party sources,” said Alex Tverdohleb, Vice President Data Services, Consumer Products & Engineering, FOX Corporation. “We query live data across our data warehouse and data lake, and now with the new Amazon Redshift Federated Query feature we can easily query and analyze live data across our relational databases as well. Our petabyte scale data is rapidly growing, and with the innovation in Amazon Redshift RA3 instances and AQUA (Advanced Query Accelerator) for Amazon Redshift we’re thrilled about the prospect of getting 10x faster performance for our most demanding workloads, while keeping our costs flat. AQUA for Amazon Redshift is a great example of how AWS innovates across every layer of the stack to deliver the best solution for their customers.”

Sophos is a worldwide leader in next-generation cybersecurity. “Amazon Web Services, including Amazon Redshift, give us the power to make live data generated by our range of next-gen security solutions available to more than 409,000 organizations for analysis,” said John Peterson, Vice President, Central Content Group, Sophos. “The new Federated Query feature in Amazon Redshift could help us take this to the next level, allowing us to query data directly across our Aurora and RDS PostgreSQL databases without having to setup workflows for data movement. We’re excited to see how this could speed up our time to insight and help to make it easier to incorporate the most up to date data from a number of transactional databases with the data in our data warehouse and our data lake.”

Ancestry is the global leader in family history and consumer genomics, empowering journeys of personal discovery to enrich lives. “With Amazon Elasticsearch Service we collect and analyze our company’s operational logs in real time,” says Clint Smith, Senior Manager, Engineering Development, Ancestry. “Now UltraWarm for Amazon Elasticsearch Service will help us identify correlations between logging events and quickly root cause application problems. Before UltraWarm for Amazon Elasticsearch Service, our cost constraints meant we could only store five days of data. With UltraWarm for Amazon Elasticsearch Service we will be able to extend that window to 90 days, and analyze the data via Kibana at a significantly lower cost. This extra data will help us identify application problems that we just couldn’t see with the five days of data we were storing before.”

AWS Announces New Analytics Capabilities to Help Customers Embrace Data at Scale

Recent Posts