2023, Data Lake and Testing - Data Leaders Brief

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. AWS Glue 3.0 The following diagram illustrates the solution architecture.

Data Lake

Data Lake Data Processing Metadata Snapshot

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

These features allow efficient data corrections, gap-filling in time series, and historical data updates without disrupting ongoing analyses or compromising data integrity. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. Note the configuration parameters s3.write.tags.write-tag-name write.tags.write-tag-name and s3.delete.tags.delete-tag-name

Data Lake

Data Lake Snapshot Metadata Optimization

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Sales

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

DataKitchen

SEPTEMBER 18, 2024

Key statistics highlight the severity of the issue: 57% of respondents in a 2024 dbt Labs survey rated data quality as one of the three most challenging aspects of data preparation (up from 41% in 2023). 73% of data practitioners do not trust their data (IDC).

Scorecard

Scorecard Data Quality Measurement Testing

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Data Lake

Data Lake Data Warehouse Marketing Management

Lessons from the field: How Generative AI is shaping software development in 2023

CIO Business Intelligence

SEPTEMBER 6, 2023

Juergen Sussner, Lead Cloud Platform Engineer at DATEV eG, advises organizations to try to implement small use cases and test them well, if they work, scale them, if not, try another use case. For example, litigation has surfaced against companies for training AI tools using data lakes with thousands of unlicensed works.

Software

Software Risk Experimentation Uncertainty

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

AWS Big Data

NOVEMBER 8, 2023

Save the date: AWS re:Invent 2023 is happening from November 27 to December 1 in Las Vegas, and you cannot miss it. In today’s data-driven landscape, the quality of data is the foundation upon which the success of organizations and innovations stands. Reserve your seat now! All work and no play … not at re:Invent!

Data-driven

Data-driven Machine Learning Data Lake Cost-Benefit

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Porsche Carrera Cup Brasil gets real-time data boost

CIO Business Intelligence

MAY 21, 2024

Defining a strategic relationship In July 2023, Dener Motorsport began working with Microsoft Fabric to get at that data in real-time, specifically Fabric components Synapse Real-Time Analytics for data streaming analysis, and Data Activator to monitor and trigger actions in real-time.

Broadcasting

Broadcasting Recreation/Entertainment Manufacturing Data Lake

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime. During feature development, data engineers require a seamless interface to the EDW. Previous solution process In the previous solution, product team data engineers spent 30 minutes per run to manually expose Redshift data to Spark.

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift was launched in preview during AWS re:Invent 2023. Safety features Amazon Q generative SQL has built-in safety features to warn if a generated SQL statement will modify data and will only run based on user permissions. To test this, let’s ask Amazon Q to “delete data from web_sales table.”

Metadata

Metadata Sales Data Warehouse Optimization

CIOs press ahead for gen AI edge — despite misgivings

CIO Business Intelligence

OCTOBER 18, 2023

If anything, 2023 has proved to be a year of reckoning for businesses, and IT leaders in particular, as they attempt to come to grips with the disruptive potential of this technology — just as debates over the best path forward for AI have accelerated and regulatory uncertainty has cast a longer shadow over its outlook in the wake of these events.

Risk

Risk Manufacturing Enterprise Technology

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

AWS Big Data

MARCH 27, 2024

Amazon Redshift integrates with AWS HealthLake and data lakes through Redshift Spectrum and Amazon S3 auto-copy features, enabling you to query data directly from files on Amazon S3. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog.

Data Analytics

Data Analytics Analytics Data Warehouse Data Lake

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Educating ChatGPT on Data Lakehouse

Cloudera

MARCH 17, 2023

I took the free version of ChatGPT on a test drive (in March 2023) and asked some simple questions on data lakehouse and its components. Hopefully this blog will give ChatGPT an opportunity to learn and correct itself while counting towards my 2023 contribution to social good.

Unstructured Data

Unstructured Data Data Lake Data Warehouse Machine Learning

Accelerate your data warehouse migration to Amazon Redshift – Part 7

AWS Big Data

OCTOBER 17, 2023

Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. _cdc_unit" t2 WHERE t2.deletexid_ _cdc_unit" t2 WHERE t2.deletexid_

Data Warehouse

Data Warehouse Data Processing Data Lake Management

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Use Lake Formation to grant permissions to users to access data. Test the solution by accessing data with a corporate identity. Audit user data access. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane. Select Named Data Catalog resources.

Analytics

Analytics Data Lake Management Enterprise

Aaand the New NiFi Champion is…

Cloudera

JUNE 5, 2023

On May 3, 2023, Cloudera kicked off a contest called “Best in Flow” for NiFi developers to compete to build the best data pipelines. The flow he built differentiates between test or true API call before initiating a secure log in. Completeness is estimated by comparing a test result with “estimated total.”

Testing

Testing Data Lake Data Processing IT

Cybersecurity e NIS2: come si muovono i CIO per dormire sonni (un po’) più tranquilli

CIO Business Intelligence

APRIL 22, 2024

L’ultimo Rapporto Clusit ha contato 2.779 incidenti gravi a livello globale nel 2023 (+12% rispetto al 2022), di cui 310 in Italia, ovvero l’11% del totale mondiale e un incremento addirittura del 65% in un anno. Nella cybersicurezza sto procedendo in questo modo, con i test per la control room”, rivela il manager.

Data Lake

Data Lake Testing Management IoT

Intelligenza artificiale e gen AI: i quattro elementi per passare al “next level”

CIO Business Intelligence

MARCH 13, 2024

Infatti, secondo il “Report Imprese e Ict 2023” di Istat, la mancanza di competenze è il primo freno all’adozione delle tecnologie IA in Italia: il 55,1% delle imprese che hanno preso in considerazione il suo utilizzo senza poi adottarla ha rinunciato per carenza di skill e comprensione delle possibilità per il proprio business.

Machine Learning

Machine Learning Deep Learning Big Data Testing

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making.

Optimization

Optimization Data Lake Cost-Benefit Reporting

The year’s top 10 enterprise AI trends — so far

CIO Business Intelligence

SEPTEMBER 21, 2023

To make all this possible, the data had to be collected, processed, and fed into the systems that needed it in a reliable, efficient, scalable, and secure way. Data warehouses then evolved into data lakes, and then data fabrics and other enterprise-wide data architectures.

Enterprise

Enterprise Consulting Modeling Cost-Benefit

Visualize Confluent data in Amazon QuickSight using Amazon Athena

AWS Big Data

MARCH 27, 2023

Although this approach works well for many use cases, it requires data to be moved, and therefore duplicated, before it can be visualized. Enriching data with reference data in another data store With ksqlDB queries, the source and destination are always Kafka topics. Choose Create data source.

Visualization

Visualization Data Lake Interactive Data-driven

Make Better Data-Driven Decisions with DataRobot AI Platform Single-Tenant SaaS on Microsoft Azure

DataRobot Blog

MARCH 7, 2023

DataRobot on Azure accelerates the machine learning lifecycle with advanced capabilities for rapid experimentation across new data sources and multiple problem types. This generates reliable business insights and sustains AI-driven value across the enterprise. For more information, visit [link]. DATAROBOT LAUNCH EVENT From Vision to Value.

Data-driven

Data-driven Machine Learning Experimentation Data Lake

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Performance with materialized views In order to evaluate the performance of queries in the presence of materialized views in Iceberg table format, we used a TPC-DS data set at 1 TB scale factor. Furthermore, it is partitioned on the d_year column. We ran the ANALYZE command to gather both table and column statistics on all the base tables.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

ChatGPT: le nuove sfide della strategia sui dati nell’era dell’IA generativa

CIO Business Intelligence

MARCH 27, 2024

Le aziende italiane investono in infrastrutture, software e servizi per la gestione e l’analisi dei dati (+18% nel 2023, pari a 2,85 miliardi di euro, secondo l’Osservatorio Big Data & Business Analytics della School of Management del Politecnico di Milano), ma quante sono giunte alla data maturity?

Data Governance

Data Governance Data Lake Data Strategy Data-driven

L’AI cresce. E, con lei, aumentano anche le difficoltà delle infrastrutture

CIO Business Intelligence

OCTOBER 23, 2024

Le aziende che sperimentano la GenAI di solito creano account di livello aziendale con servizi basati sul cloud, come ChatGPT di OpenAI o Claude di Anthropic, e i primi test sul campo e i vantaggi in termini di produttività le portano a cercare altre opportunità per implementare la tecnologia. “Le

Testing

Testing Data Lake Data Warehouse Management

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

Watsonx.data is built on 3 core integrated components: multiple query engines, a catalog that keeps track of metadata, and storage and relational data sources which the query engines directly access. How you can get started today Test out watsonx.ai and watsonx.data for yourself with our watsonx trial experience. Within the watsonx.ai

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

Showpad accelerates data maturity to unlock innovation using Amazon QuickSight

AWS Big Data

APRIL 5, 2023

Showpad also struggled with data quality issues in terms of consistency, ownership, and insufficient data access across its targeted user base due to a complex BI access process, licensing challenges, and insufficient education. As of January 2023, Showpad’s QuickSight instance includes over 2,433 datasets and 199 dashboards.

Dashboards

Dashboards Reporting Cost-Benefit Visualization

GenAI e dati: le difficoltà per i CIO di oggi. Tra privacy, compliance e anonimizzazione

CIO Business Intelligence

JULY 22, 2024

L’attività di web scraping può essere diretta (effettuata dallo stesso soggetto che sviluppa il modello) o indiretta (effettuata su dataset creati mediante tecniche di web scraping da soggetti terzi rispetto allo sviluppatore del modello, quindi attingendo a data lake di terze parti precedentemente creati mediante scraping).

Machine Learning

Machine Learning Data Lake Management Testing

CIO 100 Award winners drive business results with IT

CIO Business Intelligence

AUGUST 7, 2024

But Barnett, who started work on a strategy in 2023, wanted to continue using Baptist Memorial’s on-premise data center for financial, security, and continuity reasons, so he and his team explored options that allowed for keeping that data center as part of the mix. This is a new way to interact with the web and search.

IT

IT Insurance Cost-Benefit Testing

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg also helps guarantee data correctness under concurrent write scenarios. On the Code tab, choose Test , then Configure test event.

Data Lake

Data Lake Metadata Testing Snapshot

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance.

Data Lake

Data Lake IoT Metadata Testing

Amazon EMR streamlines big data processing with simplified Amazon S3 Glacier access

AWS Big Data

NOVEMBER 27, 2024

Although S3 Lifecycle policies could move data to S3 Glacier, EMR jobs couldn’t easily incorporate this archived data into their processing without manual intervention or separate data retrieval steps. and later versions offer improved integration with S3 Glacier storage, enabling cost-effective data analysis on archived data.

Big Data

Big Data Data Processing Cost-Benefit Optimization

L’AI ha cambiato tutto: ecco perché per i CIO è la tecnologia del decennio

CIO Business Intelligence

DECEMBER 23, 2024

Nel 2025 arriveranno gli AI Agent Gli OsservatoriStartup ThinkingeDigital Transformation Academydel Politecnico di Milano prevedono per il 2025 un aumento dell1,5% dei budget in ICT delle imprese, in linea con il trend degli ultimi nove anni, seppur con un tasso di crescita leggermente inferiore rispetto al 2023 (+1,9%).

Machine Learning

Machine Learning Data Governance Data Lake ROI

Redefining enterprise transformation in the age of intelligent ecosystems

CIO Business Intelligence

JANUARY 16, 2025

The mega-vendor era By 2020, the basis of competition for what are now referred to as mega-vendors was interoperability, automation and intra-ecosystem participation and unlocking access to data to drive business capabilities, value and manage risk. edge compute data distribution that connect broad, deep PLM eco-systems.

Enterprise

Enterprise Digital Transformation Scorecard Interactive

SAP BPC Alternatives: Which One is Right for You?

Jet Global

MARCH 27, 2025

for Ease of Use’ in the latest BPM Pulse Survey 2023. data lakes & warehouses like Cloudera, Google Big Query, etc., Scalability: Your source systems, data volumes, and calculation complexities change as your business evolves. Our customers ranked us #1 with a rating of 4.9

Finance

Finance Reporting Cost-Benefit Forecasting

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Trending Sources

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Build a high-performance quant research platform with Apache Iceberg

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Lessons from the field: How Generative AI is shaping software development in 2023

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Porsche Carrera Cup Brasil gets real-time data boost

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Write queries faster with Amazon Q generative SQL for Amazon Redshift

CIOs press ahead for gen AI edge — despite misgivings

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

Unleashing the power of Presto: The Uber case study

Educating ChatGPT on Data Lakehouse

Accelerate your data warehouse migration to Amazon Redshift – Part 7

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Aaand the New NiFi Champion is…

Cybersecurity e NIS2: come si muovono i CIO per dormire sonni (un po’) più tranquilli

Intelligenza artificiale e gen AI: i quattro elementi per passare al “next level”

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

The year’s top 10 enterprise AI trends — so far

Visualize Confluent data in Amazon QuickSight using Amazon Athena

Make Better Data-Driven Decisions with DataRobot AI Platform Single-Tenant SaaS on Microsoft Azure

Materialized Views in Hive for Iceberg Table Format

ChatGPT: le nuove sfide della strategia sui dati nell’era dell’IA generativa

L’AI cresce. E, con lei, aumentano anche le difficoltà delle infrastrutture

Exploring the AI and data capabilities of watsonx

Showpad accelerates data maturity to unlock innovation using Amazon QuickSight

GenAI e dati: le difficoltà per i CIO di oggi. Tra privacy, compliance e anonimizzazione

CIO 100 Award winners drive business results with IT

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Amazon EMR streamlines big data processing with simplified Amazon S3 Glacier access

L’AI ha cambiato tutto: ecco perché per i CIO è la tecnologia del decennio

Redefining enterprise transformation in the age of intelligent ecosystems

SAP BPC Alternatives: Which One is Right for You?

Stay Connected