Data Transformation, IT and Snapshot

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

Why: Data Makes It Different. In contrast, a defining feature of ML-powered applications is that they are directly exposed to a large amount of messy, real-world data which is too complex to be understood and modeled by hand. However, the concept is quite abstract. Can’t we just fold it into existing DevOps best practices?

IT

IT Testing Experimentation Software

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

How dbt Core aids data teams test, validate, and monitor complex data transformations and conversions Photo by NASA on Unsplash Introduction dbt Core, an open-source framework for developing, testing, and documenting SQL-based data transformations, has become a must-have tool for modern data teams as the complexity of data pipelines grows.

Data Transformation

Data Transformation Testing Unstructured Data Data Quality

10 Examples of How Big Data in Logistics Can Transform The Supply Chain

datapine

MAY 2, 2023

Table of Contents 1) Benefits Of Big Data In Logistics 2) 10 Big Data In Logistics Use Cases Big data is revolutionizing many fields of business, and logistics analytics is no exception. The complex and ever-evolving nature of logistics makes it an essential use case for big data applications. Did you know?

Big Data

Big Data Internet of Things Cost-Benefit Optimization

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

Apache Iceberg is an open table format for data lakes that manages large collections of files as tables. It supports modern analytical data lake operations such as create table as select (CTAS), upsert and merge, and time travel queries. However, this requires knowledge of a table’s current snapshots.

Data Lake

Data Lake Snapshot Optimization Data Transformation

How to Use Apache Iceberg in CDP’s Open Lakehouse

Cloudera

AUGUST 8, 2022

The general availability covers Iceberg running within some of the key data services in CDP, including Cloudera Data Warehouse ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ). Next, one of the most common data management tasks is to modify the schema of the table. 1 2008 7009728.

Snapshot

Snapshot Data Warehouse Machine Learning Cost-Benefit

Cloudera Data Engineering 2021 Year End Review

Cloudera

DECEMBER 21, 2021

Packaging Apache Airflow and exposing it as a managed service within CDE alleviates the typical operational management overhead of security and uptime while providing data engineers a job management API to schedule and monitor multi-step pipelines. This also enables sharing other directories with full audit trails.

Snapshot

Snapshot Data-driven Optimization Management

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

AWS Big Data

AUGUST 9, 2024

We carried out the migration as follows: We created a new cluster with eight ra3.4xlarge nodes from the snapshot of our four-node dc2.8xlarge cluster. TB of data. It represents a significant decrease in the delivery time of our critical data analytics processes. We removed the DC2 cluster and completed the migration.

Data Lake

Data Lake Analytics Data Warehouse Data-driven

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

The Orca Platform is powered by a state-of-the-art anomaly detection system that uses cutting-edge ML algorithms and big data capabilities to detect potential security threats and alert customers in real time, ensuring maximum security for their cloud environment. Why did Orca build a data lake? Why did Orca choose Apache Iceberg?

Data Lake

Data Lake Analytics Snapshot Data Quality

Applying Fine Grained Security to Apache Spark

Cloudera

AUGUST 3, 2022

Customers will now get the same consistent view of their data with the analytic processing engine of their choice without any compromises. . Within CDP, Shared Data Experience (SDX) provides centralized governance, security, cataloging, and lineage. Fine grained access control (FGAC) with Spark. Introducing Spark Secure Access Mode.

Snapshot

Snapshot Cost-Benefit Machine Learning Data Science

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

AWS Big Data

JULY 26, 2023

To grow the power of data at scale for the long term, it’s highly recommended to design an end-to-end development lifecycle for your data integration pipelines. The following are common asks from our customers: Is it possible to develop and test AWS Glue data integration jobs on my local laptop?

Data Integration

Data Integration Snapshot Testing Visualization

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

The lift and shift migration approach is limited in its ability to transform businesses because it relies on outdated, legacy technologies and architectures that limit flexibility and slow down productivity. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds.

Management

Management Metadata Analytics Dashboards

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

The Amazon EMR record server component supports table-, column-, row-, cell-, and nested attribute-level data filtering functionality. Deploy the solution via AWS CloudFormation We provide an AWS CloudFormation template that automatically sets up the following services and components: An S3 bucket for the data lake. Choose Create.

Data Lake

Data Lake Snapshot Big Data Data-driven

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

However, you might face significant challenges when planning for a large-scale data warehouse migration. Identify all upstream and downstream applications, as well as business processes that rely on the data warehouse. Data transformation experts to convert database stored functions in the producer or consumer.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

AWS Big Data

MARCH 16, 2023

This post is co-written by Anish Moorjani, Data Engineer at SafetyCulture. Amazon Redshift is a fully managed data warehouse service that tens of thousands of customers use to manage analytics at scale. A source of unpredictable workloads is dbt Cloud , which SafetyCulture uses to manage data transformations in the form of models.

Data Warehouse

Data Warehouse Testing Snapshot Modeling

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

Although Tricentis has amassed such data over a decade, the data remains untapped for valuable insights. Each of these tools has its own reporting capabilities that make it difficult to combine the data for integrated and actionable business insights.

Software

Software Data Lake Testing Cost-Benefit

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.

Data Lake

Data Lake Dashboards Metrics Metadata

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

To build a data-driven business, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The Amazon EMR Flink CDC connector reads the binlog data and processes the data.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Discover Efficient Data Extraction Through Replication With Angles Enterprise for Oracle

Jet Global

NOVEMBER 7, 2023

Add in the de facto requirement to combine all your reporting data and it presents quite a challenge. As more companies move their data into the cloud, methods for storing and managing that data also adapt and grow. This growth is caused, in part, by the increasing use of cloud platforms for data storage and processing.

Enterprise

Enterprise Data Warehouse Operational Reporting Reporting

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

These include managing complex extract, transform, and load (ETL) processes, handling schema validation, providing reliable delivery, and maintaining custom code for data transformations. Firehose delivers streaming data with configurable buffering options that can be optimized for near-zero latency.

Snapshot

Snapshot Optimization Data Lake Metadata

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

As businesses generate more data from a variety of sources, they need systems to effectively manage that data and use it for business outcomes—such as providing better customer experiences or reducing costs. Second, it allows customers to read and write data concurrently using different frameworks.

Metadata

Metadata Data Lake Management Internet of Things

“You Complete Me,” said Data Lineage to DataOps Observability.

DataKitchen

JANUARY 23, 2023

What is data lineage? Data lineage traces data’s origin, history, and movement through various processing, storage, and analysis stages. It is used to understand the provenance of data and how it is transformed and to identify potential errors or issues. What is missing in data lineage?

Testing

Testing Data Governance Data Quality Data-driven

Data Leaders Brief

MLOps and DevOps: Why Data Makes It Different

Ensuring Data Transformation Quality with dbt Core

Webinars

Trending Sources

10 Examples of How Big Data in Logistics Can Transform The Supply Chain

Webinars

Data Engineers Are Using AI to Verify Data Transformations

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

How to Use Apache Iceberg in CDP’s Open Lakehouse

Cloudera Data Engineering 2021 Year End Review

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Applying Fine Grained Security to Apache Spark

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Build a data lake with Apache Flink on Amazon EMR

Discover Efficient Data Extraction Through Replication With Angles Enterprise for Oracle

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

“You Complete Me,” said Data Lineage to DataOps Observability.

Stay Connected