AWS Lake Formation now generally available; enables users to build secure data lakes in days instead of months

Amazon Web Services Inc. (AWS), an company, announced the general availability of AWS Lake Formation, a fully managed service that makes it much easier for customers to build, secure, and manage data lakes. AWS Lake Formation simplifies and automates many of the complex manual steps usually required to create a data lake, including collecting, cleaning, and cataloging data, and securely making that data available for analytics. 

Users can bring their data into a data lake from a variety of sources using pre-defined templates, automatically classify and prepare the data, and centrally define granular data access policies to govern access by different groups within an organization. 

Customers can then analyze this data using their choice of AWS analytics and machine learning services, including Amazon Redshift, Amazon Athena, and AWS Glue, with Amazon EMR, Amazon QuickSight, and Amazon SageMaker following in the next few months. There are no additional charges required to use AWS Lake Formation, and customers pay only for the underlying AWS services used. 

A data lake is a centralized, curated and secured repository that stores data, both in its original form and prepared for analysis. A data lake enables users to break down data silos and combine different types of analytics to gain insights and guide better business decisions.

However, setting up and managing data lakes involves a lot of manual, complicated, and time-consuming tasks. This work includes loading data from diverse sources, monitoring those data flows, setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, re-organizing data into a columnar format, configuring access control settings, deduplicating redundant data, matching linked records, granting access to data sets, and auditing access over time.

Creating a data lake with Lake Formation is as simple as defining data sources and what data access and security policies are to be applied. Lake Formation also helps users collect and catalog data from databases and object storage, move the data into new Amazon S3 data lake, clean and classify data using machine learning algorithms, and secure access to sensitive data. Users can access a centralized data catalog, which describes available data sets and their appropriate usage. 

Customers want to be able to perform analytics and machine learning across all of their data, regardless of the format or where the data lives. A data lake removes data silos and allows data to reside in a central place so customers can more easily apply different types of analytics and machine learning across all of their data.

Amazon Simple Storage Service (Amazon S3) has become a popular place for customers to build data lakes because of its scale, cost-effectiveness, durability, and easy integration with AWS’s analytics and machine learning services. However, even with those significant benefits, building and managing a data lake can still be a complex and time-consuming process. 

Customers need to provision and configure storage, move data from disparate sources into the data lake, and extract the schema and add metadata tags to make it accessible from a searchable data catalog. In order to do so, customers must clean and prepare the data – including partitioning, indexing, and transforming the data – to optimize the performance and cost that comes with running analytics on the data. Then, they have to set up data access roles and enforce security policies across their storage and each of their different analytics engines, and update the security policies when permissions change or new end users are added. 

Users are also required to make the data available in a secure way to their data analysts so that they can analyze and process the data using any of the available analytics engines. These steps require customers to perform a lot of manual work, and as a result, most customers can take up to several months to set up a data lake.

AWS Lake Formation simplifies the process and removes the heavy lifting from setting up a data lake. AWS Lake Formation automates manual, time-consuming steps, like provisioning and configuring storage, crawling the data to extract schema and metadata tags, automatically optimizing the partitioning of the data, and transforming the data into formats like Apache Parquet and ORC that are ideal for analytics. 

AWS Lake Formation cleans and deduplicates data using machine learning to improve data consistency and quality. To simplify data access and security, AWS Lake Formation provides a single, centralized place to set up and manage data access policies, governance, and auditing across Amazon S3 and multiple analytics engines. 

“Our customers tell us that Amazon S3 is the ideal place to house their data lakes, which is why AWS hosts more data lakes than anyone else – with tens of thousands and growing every day. They’ve also told us that they want it to be easier and faster to set up and manage their data lakes,” said Raju Gulabani, vice president for databases, analytics, and machine learning at AWS. “That’s why we built AWS Lake Formation, so customers can spend more time learning from their data and innovating, rather than wrestling that data into functioning data lakes. AWS Lake Formation is available today and we’re excited to see how customers use it as one of the building blocks for growing and transforming their businesses and customer experiences.”

AWS Lake Formation is available in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) with additional regions coming soon.

IoT Innovator Newsletter

Get the latest updates and industry news in your inbox! Enter your email address and name below to be the first to know.