2012 and Data Lake - Data Leaders Brief

2012

Data Lake

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks.

Data Lake

Data Lake Publishing Metadata Data-driven

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Unlocking the true value of data often gets impeded by siloed information. Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Use cases for Hive metastore federation for Amazon EMR Hive metastore federation for Amazon EMR is applicable to the following use cases: Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

Sesha Sanjana Mylavarapu is an Associate Data Lake Consultant at AWS Professional Services. She specializes in cloud-based data management and collaborates with enterprise clients to design and implement scalable data lakes. For instructions, see Creating an IAM role (console).

Snapshot

Snapshot Strategy Dashboards Data Lake

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

AWS Big Data

OCTOBER 30, 2024

Eren Baydemir , a Technical Product Manager at AWS, has 15 years of experience in building customer-facing products and is currently focusing on data lake and file ingestion topics in the Amazon Redshift team. He was the CEO and co-founder of DataRow, which was acquired by Amazon in 2020.

Data Warehouse

Data Warehouse Sales Data Lake Recreation/Entertainment

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

Note that the extra package ( delta-iceberg ) is required to create a UniForm table in AWS Glue Data Catalog. The extra package is also required to generate Iceberg metadata along with Delta Lake metadata for the UniForm table. He’s passionate about helping customers use Apache Iceberg for their data lakes on AWS.

Metadata

Metadata Data Warehouse Big Data Data Lake

Use AWS Glue Data Catalog views to analyze data

AWS Big Data

MAY 9, 2024

Additionally, you can use the power of SQL in a view to express complex boundaries in data across multiple tables that can’t be expressed with simpler permissions. Data lakes provide customers the flexibility required to derive useful insights from data across many sources and many use cases.

Data Lake

Data Lake Metadata Management Big Data

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

AWS Big Data

JANUARY 6, 2025

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Analytics

Analytics Data Warehouse Big Data Metrics

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

About the Authors Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. Choose the created IAM role.

Visualization

Visualization Data Processing Testing Publishing

Accelerate data integration with Salesforce and AWS using AWS Glue

AWS Big Data

SEPTEMBER 4, 2024

This solution also allows you to update certain fields of the account object in the data lake and push it back to Salesforce. To achieve this, you create two ETL jobs using AWS Glue with the Salesforce connector, and create a transactional data lake on Amazon S3 using Apache Iceberg.

Data Integration

Data Integration Data Lake Data-driven Cost-Benefit

Scaling RISE with SAP data and AWS Glue

AWS Big Data

NOVEMBER 29, 2024

Customers often want to augment and enrich SAP source data with other non-SAP source data. Such analytic use cases can be enabled by building a data warehouse or data lake. Customers can now use the AWS Glue SAP OData connector to extract data from SAP.

Visualization

Visualization Data Processing Data-driven Cost-Benefit

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

AWS Big Data

SEPTEMBER 26, 2023

AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3 ) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features.

Data Lake

Data Lake Metadata Management Modeling

Introducing Amazon Q data integration in AWS Glue

AWS Big Data

APRIL 30, 2024

Amazon Q Developer can now generate complex data integration jobs with multiple sources, destinations, and data transformations. His team works on distributed systems & new interfaces for data integration and efficiently managing data lakes on AWS. Configure an IAM role to interact with Amazon Q.

Data Integration

Data Integration Data Lake Data Warehouse Software

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

One of the bank’s key challenges related to strict cybersecurity requirements is to implement field level encryption for personally identifiable information (PII), Payment Card Industry (PCI), and data that is classified as high privacy risk (HPR). Only users with required permissions are allowed to access data in clear text.

Data Lake

Data Lake Data Warehouse Management Risk

Amazon DataZone announces custom blueprints for AWS services

AWS Big Data

JUNE 26, 2024

New feature: Custom AWS service blueprints Previously, Amazon DataZone provided default blueprints that created AWS resources required for data lake, data warehouse, and machine learning use cases. You can build projects and subscribe to both unstructured and structured data assets within the Amazon DataZone portal.

Data Lake

Data Lake Data Warehouse Unstructured Data Data Governance

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

AWS Big Data

AUGUST 5, 2024

These business units have varying landscapes, where a data lake is managed by Amazon Simple Storage Service (Amazon S3) and analytics workloads are run on Amazon Redshift , a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data.

Data Lake

Data Lake Finance Sales Management

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

AWS Big Data

FEBRUARY 15, 2024

You might be modernizing your data architecture using Amazon Redshift to enable access to your data lake and data in your data warehouse, and are looking for a centralized and scalable way to define and manage the data access based on IdP identities. For IAM role , choose a Lake Formation user-defined role.

Management

Management Data Lake Sales Data Warehouse

Apply enterprise data governance and management using AWS Lake Formation and AWS IAM Identity Center

AWS Big Data

SEPTEMBER 26, 2024

In this blog post, there are three personas: Data Lake Administrator (with admin level access) User Silver from the Data Engineering group User Lead Auditor from the Auditor group. You will see how different personas in an organization can access the data without the need to modify their existing enterprise entitlements.

Data Governance

Data Governance Enterprise Management Data Lake

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.

Analytics

Analytics Data Lake Management Enterprise

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

AWS Big Data

JANUARY 30, 2023

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities. Choose Create policy.

Insurance

Insurance Data Lake Data-driven Management

Run Spark SQL on Amazon Athena Spark

AWS Big Data

OCTOBER 23, 2023

Modern applications store massive amounts of data on Amazon Simple Storage Service (Amazon S3) data lakes, providing cost-effective and highly durable storage, and allowing you to run analytics and machine learning (ML) from your data lake to generate insights on your data.

Data Lake

Data Lake Visualization Optimization Interactive

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

AWS Big Data

DECEMBER 21, 2023

As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. Attach the AWS managed policy GlueServiceRole. Attach the following policy to the role.

Analytics

Analytics IT Data Lake Visualization

Integrate custom applications with AWS Lake Formation – Part 1

AWS Big Data

NOVEMBER 19, 2024

Download the code zip bundle for the Lambda function used to populate the data lake data ( datalake-population-function.zip ). For s3KeyLambdaDataPopulationCode , enter the Amazon S3 location containing the code zip bundle for the Lambda function used to populate the data lake data ( datalake-population-function.zip ).

Data Lake

Data Lake Metadata Testing Data Processing

Author data integration jobs with an interactive data preparation experience with AWS Glue visual ETL

AWS Big Data

JULY 10, 2024

To learn more about using the interactive data preparation authoring experience in AWS Glue Studio, check out the following video and read the AWS News Blog. About the Authors Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads.

Interactive

Interactive Visualization Data Integration Statistics

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

AWS Big Data

JULY 28, 2023

Amazon Redshift now makes it easier for you to run queries in AWS data lakes by automatically mounting the AWS Glue Data Catalog. You no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog. There are additional changes required in IAM policy.

Data Lake

Data Lake Data Governance Data Warehouse Data-driven

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Sisense

DECEMBER 11, 2019

The company has integrated data analysis throughout its organization to power decision making. From a startup in 2012, it is now valued at $3.2 Optimizing data pipelines: How Kongregate uses Periscope Data. Diving deeper into the datasphere: Data lakes — best practices. A true unicorn.

Data Lake

Data Lake Big Data Sales Data-driven

Simplify and speed up Apache Spark applications on Amazon Redshift data with Amazon Redshift integration for Apache Spark

AWS Big Data

APRIL 20, 2023

For sales across multiple markets, the product sales data such as orders, transactions, and shipment data is available on Amazon S3 in the data lake. The data engineering team can use Apache Spark with Amazon EMR or AWS Glue to analyze this data in Amazon S3. enableHiveSupport().getOrCreate()

Data Lake

Data Lake Data Warehouse Sales Data-driven

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. Create and attach a new inline policy ( AWSGlueDataQualityBucketPolicy ) with the following content.

Data Quality

Data Quality Measurement Testing Visualization

Attribute Amazon EMR on EC2 costs to your end-users

AWS Big Data

AUGUST 27, 2024

His background is in data warehouse/data lake – architecture, development and administration. He is in data and analytical field for over 14 years. Ramesh Raghupathy is a Senior Data Architect with WWCO ProServe at AWS. While not at work, Ramesh enjoys traveling, spending time with family, and yoga.

Metrics

Metrics Dashboards Data Lake Optimization

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

He is passionate about big data and data analytics. Sandeep Singh is a Lead Consultant at AWS ProServe, focused on analytics, data lake architecture, and implementation. He helps enterprise customers migrate and modernize their data lake and data warehouse using AWS services.

Metadata

Metadata Data Lake Testing Consulting

Migrate workloads from AWS Data Pipeline

AWS Big Data

JULY 25, 2024

AWS Data Pipeline helps customers automate the movement and transformation of data. With Data Pipeline, customers can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You can visually create, run, and monitor ETL pipelines to load data into your data lakes.

Visualization

Visualization Management Data Integration Testing

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Founded in 2012, SumUp is the financial partner for more than 4 million small merchants in over 35 markets worldwide, helping them start, run and grow their business. Unless, of course, the rest of their data also resides in the Google Cloud. The Data Science teams also use this data for churn prediction and CLTV modeling.

Analytics

Analytics Data Lake Testing Optimization

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Users can also raise requests to producers to improve the way the data is presented or to enrich the data with new data points for generating a higher business value. At the same time, each team can also map other catalogs to their own account and use their own data, which they produce along with the data from other accounts.

Data-driven

Data-driven Advertising Metadata Data Architecture

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. We deploy Debezium MySQL source Kafka connector on Amazon MSK Connect.

Data Warehouse

Data Warehouse Snapshot Data Processing Internet of Things

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud. Srikanth Baheti is a Specialized World Wide Principal Solutions Architect for Amazon QuickSight.

Metadata

Metadata Dashboards Business Intelligence Data Lake

How The Cloud Made ‘Data-Driven Culture’ Possible | Part 1

BizAcuity

MAY 10, 2022

2012: Amazon Redshift, the first of its kind cloud-based data warehouse service comes into existence. Google launches BigQuery, its own data warehousing tool and Microsoft introduces Azure SQL Data Warehouse and Azure Data Lake Store. Data lakes or data lake houses alone cannot solve the efficiency problem.

Data-driven

Data-driven IoT Unstructured Data Data Lake

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

Somehow, the gravity of the data has a geological effect that forms data lakes. Also, data science workflows begin to create feedback loops from the big data side of the illo above over to the DW side. DG emerges for the big data side of the world, e.g., the Alation launch in 2012.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

AWS Big Data

AUGUST 28, 2023

Optionally, specify the Amazon S3 storage class for the data in Amazon Security Lake. For more information, refer to Lifecycle management in Security Lake. Review the details and create the data lake. Choose Next. Additionally, the principal must have permission to pass the pipeline role to OpenSearch Ingestion.

Dashboards

Dashboards Visualization Metadata Management

Themes and Conferences per Pacoid, Episode 12

Domino Data Lab

AUGUST 8, 2019

Once upon a time, circa 2012-ish, data science conferences were replete with talks about an industry hellbent on loading amazing enormous Big Data into some kind of data lake, and applying all kinds of odd astrophysics-ish approaches…for eventual PROFIT! Or something. Nothing Spreads Like Fear”.

Data Science

Data Science Machine Learning Data Governance Statistics

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

In this example, the analytics tool accesses the data lake on Amazon Simple Storage Service (Amazon S3) through Athena queries. As the data mesh pattern expands across domains covering more downstream services, we need a mechanism to keep IdPs and IAM role trusts continuously updated.

Data Governance

Data Governance Management Data-driven Analytics

Why We Started the Data Intelligence Project

Alation

JULY 7, 2022

To answer these questions we need to look at how data roles within the job market have evolved, and how academic programs have changed to meet new workforce demands. In the 2010s, the growing scope of the data landscape gave rise to a new profession: the data scientist. The data scientist.

Metadata

Metadata Data-driven Insurance Statistics

Q&A with Greg Rahn – The changing Data Warehouse market

Cloudera

DECEMBER 12, 2018

And so I actually transitioned out of that group and into the Big Data Appliance group at Oracle, but soon realized that if that was what I wanted to keep doing, this up and coming company called Cloudera might be a better place to do it since these new technologies weren’t just a hobby at Cloudera. As you mentioned, Qlik is in there.

Data Warehouse

Data Warehouse Marketing Big Data Data-driven

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Webinars

Trending Sources

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Webinars

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Use AWS Glue Data Catalog views to analyze data

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Accelerate data integration with Salesforce and AWS using AWS Glue

Scaling RISE with SAP data and AWS Glue

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

Introducing Amazon Q data integration in AWS Glue

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Amazon DataZone announces custom blueprints for AWS services

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

Apply enterprise data governance and management using AWS Lake Formation and AWS IAM Identity Center

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

Run Spark SQL on Amazon Athena Spark

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

Integrate custom applications with AWS Lake Formation – Part 1

Author data integration jobs with an interactive data preparation experience with AWS Glue visual ETL

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Simplify and speed up Apache Spark applications on Amazon Redshift data with Amazon Redshift integration for Apache Spark

Measure performance of AWS Glue Data Quality for ETL pipelines

Attribute Amazon EMR on EC2 costs to your end-users

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Migrate workloads from AWS Data Pipeline

How SumUp made digital analytics more accessible using AWS Glue

Design a data mesh on AWS that reflects the envisioned organization

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Federate Amazon QuickSight access with open-source identity provider Keycloak

How The Cloud Made ‘Data-Driven Culture’ Possible | Part 1

Themes and Conferences per Pacoid, Episode 8

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

Themes and Conferences per Pacoid, Episode 12

How Novo Nordisk built distributed data governance and control at scale

Why We Started the Data Intelligence Project

Q&A with Greg Rahn – The changing Data Warehouse market

Stay Connected