Data Leaders Brief

Why Best-of-Breed is a Better Choice than All-in-One Platforms for Data Science

O'Reilly on Data

AUGUST 18, 2020

Do you buy a solution from a big integration company like IBM, Cloudera, or Amazon? Integrated all-in-one platforms assemble many tools together, and can therefore provide a full solution to common workflows. However some assembly is required because they need to be used alongside other products to create full solutions.

Data Science

Data Science Machine Learning Data Warehouse Deep Learning

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

Delta Lake UniForm can be a solution to meet this requirement. After creating the Studio Workspace is complete, you are redirected to Jupyter Notebook. Upload Jupyter Notebook Complete the following steps to configure a Jupyter Notebook to use Delta Lake UniForm with Amazon EMR. Download delta-lake-uniform-on-aws.ipynb.

Metadata

Metadata Data Warehouse Big Data Data Lake

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Solution overview This post demonstrates text-to-SQL generation for Athena using an example implemented using Amazon Bedrock. The solution architecture and workflow. The relevant CloudFormation template, Jupyter Notebooks, and details of launching the necessary AWS services are covered in this section.

Metadata

Metadata Data Lake Modeling Data Warehouse

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

There is a decades-long tradition of data-centric programming : developers who have been using data-centric IDEs, such as RStudio, Matlab, Jupyter Notebooks, or even Excel to model complex real-world phenomena, should find this paradigm familiar. To plug this gap, frameworks like Metaflow or MLFlow provide a custom solution for versioning.

IT

IT Testing Experimentation Software

Scalable analytics and centralized governance for Apache Iceberg tables using Amazon S3 Tables and Amazon Redshift

AWS Big Data

MAY 22, 2025

Solution overview In this solution, we show how to query a dataset stored in Amazon S3 Tables for further analysis using data managed in Amazon Redshift. Also enter your IP or VPN range for Jupyter Notebook access in the SourceCidrForNotebook parameter in CloudFormation. Replace the routes with your organizations IP addresses.

Analytics

Analytics Data Lake Management Insurance

Accelerate your data workflows with Amazon Redshift Data API persistent sessions

AWS Big Data

NOVEMBER 22, 2024

aws redshift-data execute-statement --sql "select count(*) from dev.stage_stores" --session-id 5a254dc6-4fc2-4203-87a8-551155432ee4 --session-keep-alive-seconds 10 Solution walkthrough You will use AWS Step Functions to call the Data API because this is one of the more straightforward ways to create a codeless ETL.

Data Warehouse

Data Warehouse Recreation/Entertainment Cost-Benefit Data-driven

How CFM built a well-governed and scalable data-engineering platform using Amazon EMR for financial features generation

AWS Big Data

SEPTEMBER 13, 2024

In recent years, driven by the commoditization of data storage and processing solutions, the industry has seen a growing number of systematic investment management firms switch to alternative data sources to drive their investment decisions. The bulk of our data scientists are heavy users of Jupyter Notebook.

Interactive

Interactive Strategy Cost-Benefit Data Governance

The state of data quality in 2020

O'Reilly on Data

FEBRUARY 11, 2020

Data quality solutions almost always boil down to two big issues: politics and cost. Another one-fifth use a notebook environment (such as Jupyter ). The problem (and partial solution) is that they need quality data to power their AI projects. The remaining 50% (i.e., This includes deciding what is not worth addressing.

Data Quality

Data Quality Metadata Data Governance Publishing

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

AWS Big Data

MAY 4, 2023

The sheer volume of data captured daily continues to grow, calling for platforms and solutions to evolve. Services such as Amazon Simple Storage Service (Amazon S3) offer a scalable solution that adapts yet remains cost-effective for growing datasets. This solution was inspired by work with a key AWS customer, the UK Met Office.

Data Processing

Data Processing Metadata Informatics Interactive

Essential data science tools for elevating your analytics operations

CIO Business Intelligence

MAY 5, 2022

Jupyter Notebooks. Jupyter Notebooks let readers do more than absorb. Jupyter Notebooks let readers do more than absorb. Today, the standard Jupyter Notebook supports more than 40 programming languages, and it’s common to find R, Julia, or even Java or C within them. Jupyter Notebooks don’t just run themselves.

Data Science

Data Science Analytics Machine Learning Optimization

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

AWS Big Data

JUNE 26, 2023

Overview of solution In this post, we go through the various steps to apply ML-based fuzzy matching to harmonize customer data across two different datasets for auto and property insurance. The following diagram shows our solution architecture. Prerequisites To follow along with this walkthrough, you must have an AWS account.

Insurance

Insurance Visualization Data Lake Metrics

Explore visualizations with AWS Glue interactive sessions

AWS Big Data

SEPTEMBER 20, 2023

AWS Glue interactive sessions offer a powerful way to iteratively explore datasets and fine-tune transformations using Jupyter-compatible notebooks. Solution overview You can quickly provision new interactive sessions directly from your notebook without needing to interact with the AWS Command Line Interface (AWS CLI) or the console.

Interactive

Interactive Visualization Measurement Data Architecture

New Applied ML Prototypes Now Available in Cloudera Machine Learning

Cloudera

NOVEMBER 17, 2021

In recognition of the diverse workload that data scientists face, Cloudera’s library of Applied ML Prototypes (AMPs) provide Data Scientists with pre-built reference examples and end-to-end solutions, using some of the most cutting edge ML methods, for a variety of common data science projects. AutoML with TPOT.

Machine Learning

Machine Learning Visualization Data Science Dashboards

Introducing enhanced support for tagging, cross-account access, and network security in AWS Glue interactive sessions

AWS Big Data

SEPTEMBER 20, 2023

This technology is enabled by the use of notebook IDEs, such as the AWS Glue Studio notebook, Amazon SageMaker Studio , or your own Jupyter notebooks. Users can run AWS Glue interactive sessions by using both AWS Glue Studio notebooks via the AWS Glue console, as well as Jupyter notebooks that run on their local machine.

Interactive

Interactive Management Reporting IT

5 key areas for tech leaders to watch in 2020

O'Reilly on Data

FEBRUARY 18, 2020

Kubernetes has emerged as the de facto solution for orchestrating services and microservices in cloud native design patterns. relational database,” “Oracle database solutions,” “Hive,” “database administration,” “data models,” “Spark”—declined in usage, year-over-year, in 2019.

Data-driven

Data-driven Software Statistics Marketing

Enable remote reads from Azure ADLS with SAS tokens using Spark in Amazon EMR

AWS Big Data

JUNE 15, 2023

Amazon EMR , with its open-source Hadoop modules and support for Apache Spark and Jupyter and JupyterLab notebooks, is a good choice to solve this multi-cloud data access problem. Overview of solution Amazon EMR inherently includes Apache Hadoop at its core and integrates other related open-source modules.

Big Data

Big Data Data Lake Management Testing

Top 8 predictive analytics tools compared

CIO Business Intelligence

MAY 12, 2022

Composite AI mixes statistics and machine learning; industry-specific solutions. SageMaker is a full-service platform with data preparation tools such as the Data Wrangler, a nice presentation layer built out of Jupyter notebooks, and an automated option called Autopilot. On premises or in SAP cloud. Per user, per month. Free tier.

Predictive Analytics

Predictive Analytics Analytics Statistics Machine Learning

Build multimodal search with Amazon OpenSearch Service

AWS Big Data

JUNE 18, 2024

This blog post provides a step-by-step guide for building a multimodal search solution using OpenSearch Service. Multimodal search solution architecture We will provide the steps required to set up multimodal search using OpenSearch Service. The following image depicts the solution architecture. OpenSearch version is 2.13

Dashboards

Dashboards Metadata Modeling Visualization

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

Open table formats, such as Apache Iceberg , provide a solution to this issue. In this post, we show you how you can convert existing data in an Amazon S3 data lake in Apache Parquet format to Apache Iceberg format to support transactions on the data using Jupyter Notebook based interactive sessions over AWS Glue 4.0. Choose ETL Jobs.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Extend geospatial queries in Amazon Athena with UDFs and AWS Lambda

AWS Big Data

MARCH 17, 2023

In this post, we present a solution that uses Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons. Solution overview The solution extends Athena’s built-in geospatial capabilities by creating a UDF powered by AWS Lambda. Open the notebook instance by choosing Jupyter or JupyterLab.

Visualization

Visualization Machine Learning Consulting Data Warehouse

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

Solution overview In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) integrated with Lake Formation. The following diagram illustrates the solution architecture. For example, the users only can access data rows that belong to their country.

Data Lake

Data Lake Snapshot Big Data Data-driven

Extract time series from satellite weather data with AWS Lambda

AWS Big Data

JULY 6, 2023

Solution overview To achieve our goal, we use parallel Lambda functions. Deployment of this solution: In this post, we provide step-by-step instructions to deploy each part of the architecture manually. The following screenshot illustrates running the preceding code in a Jupyter notebook.

Machine Learning

Machine Learning Visualization IoT Digital Transformation

9 business intelligence certifications to advance your BI career

CIO Business Intelligence

OCTOBER 7, 2022

BI developers, with an average salary of $83,091, work with databases and software to develop and fine-tune IT solutions. The cert demonstrates that you are up-to-date with BI technologies and are knowledgeable about best practices, solutions, and emerging trends.

Business Intelligence

Business Intelligence Data Warehouse Visualization Dashboards

Themes and Conferences per Pacoid, Episode 10

Domino Data Lab

JUNE 2, 2019

It’s that practice of engineering at Netflix which I find so interesting – really since Ariel Tseitlin led Cloud Solutions there. For example, while enjoying dinner the evening before Rev, a close friend questioned the use of Jupyter beyond exploratory data science. Jupyter fits brilliantly for that purpose. — Randi R.

Data Science

Data Science Data-driven Machine Learning Modeling

Snowflake and Domino: Better Together

Domino Data Lab

JANUARY 11, 2021

Domino provides data scientists with the ability to run code either in workspace environments that provide IDEs such as Jupyter, R Studio or VS Code or conversely to create a job that runs a particular piece of code. About Snowflake. About Domino Data Lab. Domino Data Lab is the system-of-record for enterprise data science teams.

Data Science

Data Science Recreation/Entertainment Data Warehouse Publishing

From Disparate Data to Visualized Knowledge Part III: The Outsider Perspective

Ontotext

DECEMBER 2, 2021

In our previous blog posts of the series, we talked about how to ingest data from different sources into GraphDB , validate it and infer new knowledge from the extant facts as well as how to adapt and scale our basic solution. With Ontotext’s products , LAZY is firmly on the way towards finding a solution.

Visualization

Visualization Interactive Dashboards Enterprise

What is data science? Transforming data into value

CIO Business Intelligence

APRIL 22, 2022

While that may involve one-off projects, more typically data science teams seek to identify key data assets that can be turned into data pipelines that feed maintainable tools and solutions. Examples include credit card fraud monitoring solutions used by banks, or tools used to optimize the placement of wind turbines in wind farms.

Data Science

Data Science Statistics Machine Learning Visualization

How to Design an Analytics Stack that Humans Actually Use

Alation

AUGUST 9, 2021

Today, Blackstone’s data analytics stack includes Fivetran, Snowflake, Sigma, Jupyter Notebooks, Alation, and various other tools. Pologruto solved this problem by giving people the option to analyze data in the tool of their choice: Jupyter Notebooks for data scientists and more technical teams, and Sigma for business teams.

Analytics

Analytics Data-driven Data Analytics Contextual Data

DataRobot Flies Higher with Zepl Acquisition, Adding Cloud Native Notebook Solution to AI Platform

DataRobot

MAY 11, 2021

Founded in 2016 by the creator of Apache Zeppelin, Zepl provides a self-service data science notebook solution for advanced data scientists to do exploratory, code-centric work in Python, R, and Scala. It supports both Zeppelin and Jupyter notebooks — new or imported. The Perfect Complement. DataRobot + Zepl. Sign Up Now.

Data Science

Data Science Cost-Benefit Data Processing Machine Learning

Simplify and speed up Apache Spark applications on Amazon Redshift data with Amazon Redshift integration for Apache Spark

AWS Big Data

APRIL 20, 2023

A Jupyter notebook to run using Amazon EMR Studio using Amazon EMR on an EC2 cluster A PySpark script to run using Amazon EMR Studio and Amazon EMR Serverless After the stack creation is complete, choose the stack name redshift-spark and navigate to the Outputs We utilize these output values later in this post.

Data Warehouse

Data Warehouse Data Lake Sales Data-driven

Use DeepSeek with Amazon OpenSearch Service vector database and Amazon SageMaker

AWS Big Data

FEBRUARY 7, 2025

This example provides a solution for enterprises looking to enhance their AI capabilities. Solution overview The following diagram illustrates the solution architecture. In an actual solution, you would encapsulate the code in classes and pass the values where needed. Use these scripts as examples to pull from.

Data Processing

Data Processing Dashboards Modeling Statistics

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

AWS Big Data

FEBRUARY 1, 2024

Solution overview In the following sections, we first introduce the Common Crawl dataset and how to explore and filter the data we need. SageMaker JumpStart provides a set of solutions for the most common use cases that can be deployed with just a few clicks. The following diagram illustrates the architecture of this solution.

Metadata

Metadata Modeling Data Processing Unstructured Data

Regeneron turns to IT to accelerate drug discovery

CIO Business Intelligence

NOVEMBER 4, 2022

Then the data is consumed by SaaS-based computational tools, but it still sits within our organization and sits within the controls of our cloud-based solutions.” From a language perspective, scientists use Python and Jupyter Notebooks. Much of Regeneron’s data, of course, is confidential.

Data Lake

Data Lake IT Experimentation Data-driven

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

AWS Big Data

MAY 28, 2024

Solution overview Our solution demonstrates how financial analysts can use generative artificial intelligence (AI) to adapt their investment recommendations based on financial reports and earnings transcripts with RAG to use LLMs to generate factual content. This makes RAG adaptive for situations where facts could evolve over time.

Unstructured Data

Unstructured Data Data Warehouse Structured Data Testing

Azure Data Sources for Data Science and Machine Learning

Jen Stirrup

MAY 5, 2020

You can use Eclipse technology, Jupyter and Zeppelin. In the Data Science world, Jupyter is very common. Apache Spark is an open source software solution until allows you to work on data that’s held in memory. We can also use it with Jupyter Notebooks as well. So, let’s take a look!

Machine Learning

Machine Learning Data Science Data Lake Big Data

DataOps Observability: Taming the Chaos (Part 2)

DataKitchen

OCTOBER 25, 2022

Your “simple” pipeline involves a toolchain that features Fivetran, DBT, SQL, a Jupyter notebook, and Tableau. While quite valuable, these solutions all produce lagging indicators. For more information about DataKitchen's Observability solution, contact us!

Testing

Testing Data-driven Visualization Dashboards

The Future of Data Science – Mining GTC 2021 for Trends

Domino Data Lab

APRIL 29, 2021

If we can crack the nut of enabling a wider workforce to build AI solutions, we can start to realize the promise of data science. In the future it will be as easy to train on hundreds of GPUs as it is to train on a Jupyter notebook in a managed workspace. An important part of these foundational layers is Keras Tuner and AutoKeras.

Data Science

Data Science Deep Learning Modeling Forecasting

Business Intelligence vs Data Science vs Data Analytics

FineReport

JULY 28, 2021

By observing and analyzing data, we can develop more accurate theories and formulate more effective solutions. Because sending Jupyter Notebook or Python scripts to non-technical people will undoubtedly cause some problems, which prevents them from being widely promoted in the traditional business of large enterprises.

Business Intelligence

Business Intelligence Data Science Data Analytics Analytics

Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments

AWS Big Data

SEPTEMBER 13, 2024

This post proposes a solution to this challenge by introducing the Batch Processing Gateway (BPG) , a centralized gateway that automates job management and routing in multi-cluster environments. Solution overview Martin Fowler describes a gateway as an object that encapsulates access to an external system or resource.

Management

Management Snapshot Cost-Benefit Testing

Attribute Amazon EMR on EC2 costs to your end-users

AWS Big Data

AUGUST 27, 2024

Solution overview The solution is designed to help you track the cost of your Spark applications running on EMR on EC2. The proposed solution uses a scheduled AWS Lambda function that operates on a daily basis. The utilization of these AWS services incurs additional costs for implementing this solution.

Metrics

Metrics Dashboards Data Lake Optimization

Themes and Conferences per Pacoid, Episode 5

Domino Data Lab

JANUARY 6, 2019

We do not and cannot have a “one size fits all” solution for data science training. NASA persistently misspells Jupyter. In terms of teaching and learning data science, Project Jupyter is probably the biggest news over the past decade – even though Jupyter’s origins go back to 2001! Translated: MOOCs are no panacea.

Data Science

Data Science Machine Learning Visualization Reporting

From Disparate Data to Visualized Knowledge Part I: Moving from Spreadsheets to an RDF Database

Ontotext

NOVEMBER 18, 2021

Through this series of blog posts, we’ll discuss how to best scale and branch out an analytics solution using a knowledge graph technology stack. Our main weapons when beating that beast will be GraphDB , Ontotext Platform , Kafka , Elasticsearch , Kibana and Jupyter.

Visualization

Visualization Reporting Metadata Enterprise

Techniques for Collecting, Prepping, and Plotting Data: Predicting Social Media-Influence in the NBA

Domino Data Lab

OCTOBER 23, 2019

Often it is easy to collect data manually, that is, downloading from a web site and cleaning it up manually in Excel, Jupyter Notebook, or RStudio. Install a few packages that we will use for this chapter: i.e., Pandas, Jupyter. To inspect the data, start a Jupyter Notebook using the command: jupyter notebook.

Statistics

Statistics Machine Learning Testing Modeling

Sensors, signals and synergy: Enhancing Downer’s data exploration with IBM

IBM Big Data Hub

NOVEMBER 29, 2023

” No-code and low-code solutions for time series data exploration IBM introduced Downer to the realm where no-code and low-code solutions could build predictive models to provide faster insights. Speed and scalability : Downer efficiently adapted models to diverse data signals using tools like the Jupyter Notebook.

Predictive Modeling

Predictive Modeling Data Science Data-driven Forecasting

Why Best-of-Breed is a Better Choice than All-in-One Platforms for Data Science

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Webinars

Trending Sources

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

MLOps and DevOps: Why Data Makes It Different

Scalable analytics and centralized governance for Apache Iceberg tables using Amazon S3 Tables and Amazon Redshift

Accelerate your data workflows with Amazon Redshift Data API persistent sessions

How CFM built a well-governed and scalable data-engineering platform using Amazon EMR for financial features generation

The state of data quality in 2020

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

Essential data science tools for elevating your analytics operations

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

Explore visualizations with AWS Glue interactive sessions

New Applied ML Prototypes Now Available in Cloudera Machine Learning

Introducing enhanced support for tagging, cross-account access, and network security in AWS Glue interactive sessions

5 key areas for tech leaders to watch in 2020

Enable remote reads from Azure ADLS with SAS tokens using Spark in Amazon EMR

Top 8 predictive analytics tools compared

Build multimodal search with Amazon OpenSearch Service

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Extend geospatial queries in Amazon Athena with UDFs and AWS Lambda

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Extract time series from satellite weather data with AWS Lambda

9 business intelligence certifications to advance your BI career

Themes and Conferences per Pacoid, Episode 10

Snowflake and Domino: Better Together

From Disparate Data to Visualized Knowledge Part III: The Outsider Perspective

What is data science? Transforming data into value

How to Design an Analytics Stack that Humans Actually Use

DataRobot Flies Higher with Zepl Acquisition, Adding Cloud Native Notebook Solution to AI Platform

Simplify and speed up Apache Spark applications on Amazon Redshift data with Amazon Redshift integration for Apache Spark

Use DeepSeek with Amazon OpenSearch Service vector database and Amazon SageMaker

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

Regeneron turns to IT to accelerate drug discovery

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

Azure Data Sources for Data Science and Machine Learning

DataOps Observability: Taming the Chaos (Part 2)

The Future of Data Science – Mining GTC 2021 for Trends

Business Intelligence vs Data Science vs Data Analytics

Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments

Attribute Amazon EMR on EC2 costs to your end-users

Themes and Conferences per Pacoid, Episode 5

From Disparate Data to Visualized Knowledge Part I: Moving from Spreadsheets to an RDF Database

Techniques for Collecting, Prepping, and Plotting Data: Predicting Social Media-Influence in the NBA

Sensors, signals and synergy: Enhancing Downer’s data exploration with IBM

Stay Connected