My initial goal is to try out Amazon SageMaker Debugger and see if I can get some useful information apart from what DeepRacer stack provides.

However, after many trial and errors, I found that it’s not as easy as AWS’s sample codes show. Though, I think my journey would still be a good example to show how to make SageMaker Debugger works in customised environments.

What is Amazon SageMaker Debugger

Amazon SageMaker Debugger is a tool for debugging ML training. It helps us do many heavy lifting, like collecting data, monitoring training process, detecting abnormal behaviour, etc.

How does Amazon SageMaker Debugger work?

Amazon SageMaker Debugger consists of 2 parts: Collections/Hooks and Rules.

Collections/Hooks

Collections are groups of artifacts (a.k.a. tensors) generated by the training. It can be the tensors storing model losses, weights, etc.

In order to do debugging on the models, we need to retrieve those artifacts. Thus, SageMaker Debugger uses hooks to emit the tensors from SageMaker container to other storage (most commonly, S3).

Rules

Besides the training job itself, SageMaker will spin up another process job if you choose to use SageMaker debugger for that training.

The process job will read the artifact emitted by the hooks, and do the analysis based on the rule definition. It can be checking if losses not decreasing; detecting vanishing gradient; or whatever rule you can think of based on the artifacts you have.

Built-in hook

To make things easier, the hook is already included inside AWS Deep Learning Containers in certain versions. If we use those container versions to do training, we can simply add the rules or debugger_hook_config parameters and the training job will automatically emit the tensors.

from sagemaker.debugger import Rule
from smdebug_rulesconfig import vanishing_gradientestimator = TensorFlow(
role=role,
train_instance_count=1,
train_instance_type=train_instance_type,
rules=[Rule.sagemaker(vanishing_gradient())]
)

Image from Amazon SageMaker Debugger documentation

DeepRacer training stack

DeepRacer is an autonomous racing car using reinforcement learning to train.

In short, it is not directly using Tensorflow, MXNet, nor PyTorch to do the training. Instead, it is using RL Coach framework on top of Tensorflow to do the training.

Image from AWS DeepRacer documentation

Plugging debugger hook into the stack

Because DeepRacer training is not using Tensorflow directly, we cannot make use of the built-in SageMaker container to hook training artifacts out of the training job. Instead, we need to change the DeepRacer training code.

smdebug

smdebug is a Python library that does the hook jobs that we have discussed.

Think of it as an agent, if we want to get information from a black-box (in this case, the training environment). We need to install an agent into it, and at some point invoke the agent to send information out.

For those “unofficial” training environments (including RL Coach, which DeepRacer is using), we can plug this library in and put the hook into the code in order to use SageMaker Debugger.

Deep Dive into the code

Thanks to the effort from AWS DeepRacer community, especially Matt and Larsll, we already have a robust stack to run a customised DeepRacer training.

In my experiment, I am using:

SageMaker container is the environment in which the training runs on. In order to use smdebug I rebuilt the image with Tensorflow 1.15 instead of the original 1.13.

build.sh

Changed line 10:

TF_PATH="https://files.pythonhosted.org/packages/d0/12/984a9f65385e72196ebfead8d5d5b24d1053d61a54a4035f315cf39858f9/intel_tensorflow-1.15.0-cp36-cp36m-manylinux2010_x86_64.whl"

Changed line 49–59:

docker build $OPT_NOCACHE . \
-t $PREFIX/sagemaker-tensorflow-container:cpu \
-f ../1.15.0/py3/Dockerfile.cpu \
--build-arg py_version=3 \
--build-arg framework_support_installable='sagemaker_tensorflow_*.tar.gz' \
--build-arg TF_URL=$TF_PATH

To use smdebug library, we also need to install it into the container image through pip

Dockerfile (Added this line):

RUN pip install smdebug

After building the training environment, we need to modify our training code to plug the hook in.

According to smdebug documentation, there are several steps to put the hook inside our training code.

Create a hook
Register the hook to your model
Wrap the optimizer

RLCoach framework uses the following hierarchy to control the training:
Graph Manager > Level Manager > Agent

In Level Manager layer, it will create a Session for each level, so I have created a SessionHook inside Graph Manager, and change the session creation code to inject the smdebug hook:

multi_agent_graph_manager.py

Initialize smdebug hook (Insert into line 120):

import smdebug.tensorflow as smd
self.smdebug_hook = smd.SessionHook.create_from_json_file()
self.smdebug_hook.set_mode(smd.modes.TRAIN)

Use MonitoredTrainingSession instead of Session (Changed line 226):

self.sess = {agent_params.name: tf.train.MonitoredTrainingSession(config=config, hooks=[self.smdebug_hook]) for agent_params in self.agents_params}

Run the training

To run my training, I used Amazon SageMaker Studio and cloned the AWS workshop example.

To make use of my modified SageMaker container image and training code, I have made a few changes:

1. Replace the src directory with my modified training codes

2. Push my own container image to ECR, then changed custom_image_name variable

3. Add a code block to declare SageMaker Debugger rules to use:

from sagemaker.debugger import Rule, CollectionConfig, rule_configsrules = [
Rule.sagemaker(
base_config=rule_configs.loss_not_decreasing(),
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"train.save_interval": "50"}
)
]
)
]

4. Put the rules into RLEstimator

estimator = RLEstimator(entry_point="training_worker.py",
source_dir='src',
...
rules=rules,
)

Result

After running the code, we will spin up a SageMaker training job.

If we go to the output S3 bucket, we will see a new directory called debug-output , it is the place where SageMaker put the emitted information. If we click into the collections directory, we will see a collections JSON file.

debug-output store the information collected by SageMaker Debugger

From this file, we can see how many information SageMaker Debugger has collected from the training.

SageMaker Debugger has collected the loss tensors

From the file I got, we can see SageMaker Debugger has collected the loss of the PPO agent network.

Plot graph

We can also use smdebug library to create a trail and plot a graph to see the trend of the value we got:

from smdebug.trials import create_trial
import matplotlib.pyplot as plt
import seaborn as snss3_output_path = estimator.latest_job_debugger_artifacts_path()
trial = create_trial(s3_output_path)fig, ax = plt.subplots(figsize=figsize)
sns.despine()for tensor_name in tensors:
steps, data = get_data(trial, tensor_name)
ax.plot(steps, data, label=tensor_name)ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_xlabel('Iteration')

Plotting graph to see the trend of training

Code repository I’m working on: https://github.com/richardfan1126/deepracer-experiment

What’s next

In this blog post, I haven’t really made use of the SageMaker Debugger rules. The information I got alone is also not really useful yet.

However, plugging SageMaker Debugger into such a complex DeepRacer training stack is really a good chance for me to learn both pieces of stuff.

My next step would be dive even deeper into the agent layer (the most bottom layer of RLCoach framework) and try to generate more customised information to make using Debugger more sense.