Machine Learning with Heatwave AutoML and Lakehouse

Overview

This lab is designed to guide you through the process of loading unstructured data into a Lakehouse table using MySQL and leveraging this data for machine learning applications. We assume that you have already uploaded your unstructured data to an object storage and have a Pre-Authenticated Request (PAR) URL to access this data.

Objectives:

Prerequisites:

Part 1: Heatwave Lakehouse for your unstructured data

Building on the strengths of MySQL and the HeatWave query accelerator, the MySQL HeatWave Lakehouse allows direct querying of data across both MySQL databases and data lakes (like those on Amazon S3 or Oracle Object Storage).

This means that it can perform analytics on a combination of transactional data stored in MySQL and massive datasets stored in data lakes without moving or transforming the data. This integration effectively turns MySQL into a lakehouse, supporting both operational and analytical workloads at scale, thereby eliminating the complexity and delays involved in traditional ETL processes.

LakeHouse tables allow you to query data stored in external storage systems as if it were inside your MySQL database. They are especially useful for managing large, unstructured datasets without importing them directly into your database, offering flexibility and efficiency.

Traditionally, loading data into a high-performance analytics engine could involve several steps, including data transformation, staging, and finally loading, each possibly requiring manual intervention or scripting. heatwave_load simplifies this by providing a direct way to load data from MySQL tables into the HeatWave cluster.

The ability of MySQL HeatWave to automatically infer the schema during the data loading process is a significant feature that enhances its usability and efficiency. When you use the heatwave_load function, the system’s capability to automatically determine the schema of the data being loaded simplifies the preparation and integration of data for analysis

For more information about Heatwave Lakehouse, have a look at the Technical Overview.

To locate the data file, a PAR (Pre-Authenticated Request) URL is used to securely access the data in the object storage, ensuring that data access is controlled and secure. More information about PAR can be found here.

Below is an example of how to use heatwave_load to load the unstructed banking data we have in our bucket. All you have to set is the correct PAR_URL:

CREATE DATABASE ml_data;

USE ml_data; 

SET @db_list = '["ml_data"]';

SET @ext_tables = '[
{
  "db_name": "ml_data",
  "tables": [{
    "table_name": "bank_marketing",
    "dialect": {
        "format": "csv",
         "field_delimiter": ";",
          "record_delimiter": "\\n",
          "has_header": true,
          "skip_rows": 0
      },
    "file": [{
      "par": "<PAR_URL>"
        }]
   }]
}
]';

SET @options = JSON_OBJECT('external_tables', CAST(@ext_tables AS JSON));

CALL sys.heatwave_load(@db_list, @options);

DESCRIBE ml_data.bank_marketing;

SELECT * FROM ml_data.bank_marketing limit 10;

Part 2: Heatwave AutoML to train your Machine Learning model

Once the data is loaded, you can use Heatwave AutoML to train a model on your loaded data.

HeatWave AutoML is deeply integrated with MySQL HeatWave, allowing users to leverage their existing database infrastructure for machine learning tasks without the need for external tools or platforms.

This integration means that users can easily execute machine learning workflows directly on their operational data stored in MySQL databases, benefiting from HeatWave’s high-performance analytics capabilities.

Heatwave AutoML fully automated the entire cycle of data processing and Machine Learning model deployment:

SET @model_bank = "model_bank";

-- Train the model
CALL sys.ML_TRAIN('ml_data.bank_marketing', 'y', JSON_OBJECT('task', 'classification'), @model_bank);

-- Describe the Modal catalog
DESCRIBE ML_SCHEMA_admin.MODEL_CATALOG ;

-- Check the model metadata 
SELECT JSON_PRETTY(model_metadata) FROM ML_SCHEMA_admin.MODEL_CATALOG WHERE model_handle=@model_bank \G

-- Load the model into HeatWave
CALL sys.ML_MODEL_LOAD(@model_bank, NULL);

-- Score the model on the test data
CALL sys.ML_SCORE('ml_data.bank_marketing', 'y', @model_bank, 'accuracy', @score_bank, null);

-- Print the score
SELECT @score_bank;

After training the model we are ready to use it to make some predictions.

CREATE TABLE bank_marketing_test
        AS SELECT * from bank_marketing LIMIT 20;
    
CALL sys.ML_PREDICT_TABLE('ml_data.bank_marketing_test', @model_bank, 
        'ml_data.bank_marketing_predictions', NULL);

SELECT * FROM ml_data.bank_marketing_predictions;

Part 3: Interpret your model predictions with explainability tools

Explainability in machine learning, including within the context of MySQL HeatWave AutoML, is a critical aspect that addresses the need to understand and trust the decisions made by machine learning models.

As machine learning models, particularly complex ones, can often act as “black boxes,” where the decision-making process is not transparent, explainability aims to make the workings of these models more understandable to humans. This is crucial for diagnosing issues, ensuring fairness, and complying with regulatory requirements, among other reasons.

The main benefits of explainability in the development of Machine Learning models are:

HeatWave AutoML supports both model explainer and prediction explainer functionalities, covering different use cases.

Model Explainer

Focus: Provides a broad overview of how the model generally makes decisions.

Insights: Highlights overall feature importance and model behavior across the dataset.

Use Case: Useful for understanding and improving the model, and for explaining its general decision-making process.

Check the Heatwave AutoML documentation for training explainers.

CALL sys.ML_EXPLAIN('ml_data.bank_marketing', 'y', @model_bank, JSON_OBJECT('prediction_explainer', 'permutation_importance'));

Prediction Explainer

Focus: Offers detailed reasons behind individual predictions made by the model.

Insights: Details the impact of each feature on a specific prediction.

Use Case: Essential for analyzing specific decisions, especially in high-stakes or personalized contexts.

Check the Heatwave AutoML documentation for prediction explanations.

CALL sys.ML_EXPLAIN_TABLE('ml_data.bank_marketing_test', @model_bank, 
        'ml_data.bank_marketing_test_explanations', 
        JSON_OBJECT('prediction_explainer', 'permutation_importance'));

SELECT JSON_PRETTY(ml_results) FROM ml_data.bank_marketing_test_explanations limit 1 \G
SELECT JSON_PRETTY(sys.ML_PREDICT_ROW(JSON_OBJECT(
    'age', bank_marketing_test.age,
    'job', bank_marketing_test.job,
    'marital', bank_marketing_test.marital,
    'education', bank_marketing_test.education,
    'default', bank_marketing_test.default,
    'balance', bank_marketing_test.balance,
    'housing', bank_marketing_test.housing,
    'loan', bank_marketing_test.loan,
    'contact', bank_marketing_test.contact,
    'day', bank_marketing_test.day,
    'month', bank_marketing_test.month,
    'duration', bank_marketing_test.duration,
    'campaign', bank_marketing_test.campaign,
    'pdays', bank_marketing_test.pdays,
    'previous', bank_marketing_test.previous,
    'poutcome', bank_marketing_test.poutcome),
     @model_bank, NULL))
    FROM bank_marketing_test LIMIT 1 \G

SELECT JSON_PRETTY(sys.ML_EXPLAIN_ROW(JSON_OBJECT(
    'age', bank_marketing_test.age,
    'job', bank_marketing_test.job,
    'marital', bank_marketing_test.marital,
    'education', bank_marketing_test.education,
    'default', bank_marketing_test.default,
    'balance', bank_marketing_test.balance,
    'housing', bank_marketing_test.housing,
    'loan', bank_marketing_test.loan,
    'contact', bank_marketing_test.contact,
    'day', bank_marketing_test.day,
    'month', bank_marketing_test.month,
    'duration', bank_marketing_test.duration,
    'campaign', bank_marketing_test.campaign,
    'pdays', bank_marketing_test.pdays,
    'previous', bank_marketing_test.previous,
    'poutcome', bank_marketing_test.poutcome),
     @model_bank, JSON_OBJECT('prediction_explainer', 'permutation_importance')))
    FROM bank_marketing_test LIMIT 1 \G