Using Amazon SageMaker Debugger on DeepRacer stack — Part 1
What is Amazon SageMaker Debugger
Richard Fan
My initial goal is to try out Amazon SageMaker Debugger and see if I can get some useful information apart from what DeepRacer stack provides.
However, after many trial and errors, I found that it’s not as easy as AWS’s sample codes show. Though, I think my journey would still be a good example to show how to make SageMaker Debugger works in customised environments.
What is Amazon SageMaker Debugger
Amazon SageMaker Debugger is a tool for debugging ML training. It helps us do many heavy lifting, like collecting data, monitoring training process, detecting abnormal behaviour, etc.
How does Amazon SageMaker Debugger work?
Amazon SageMaker Debugger consists of 2 parts: Collections/Hooks and Rules.
Collections/Hooks
Collections are groups of artifacts (a.k.a. tensors) generated by the training. It can be the tensors storing model losses, weights, etc.
In order to do debugging on the models, we need to retrieve those artifacts. Thus, SageMaker Debugger uses hooks to emit the tensors from SageMaker container to other storage (most commonly, S3).
Rules
Besides the training job itself, SageMaker will spin up another process job if you choose to use SageMaker debugger for that training.
The process job will read the artifact emitted by the hooks, and do the analysis based on the rule definition. It can be checking if losses not decreasing; detecting vanishing gradient; or whatever rule you can think of based on the artifacts you have.
Built-in hook
To make things easier, the hook is already included inside AWS Deep Learning Containers in certain versions. If we use those container versions to do training, we can simply add the rules or debugger_hook_config parameters and the training job will automatically emit the tensors.
from smdebug_rulesconfig import vanishing_gradientestimator = TensorFlow(
role=role,
train_instance_count=1,
train_instance_type=train_instance_type,
rules=[Rule.sagemaker(vanishing_gradient())]
)

Image from Amazon SageMaker Debugger documentation
DeepRacer training stack
DeepRacer is an autonomous racing car using reinforcement learning to train.
In short, it is not directly using Tensorflow, MXNet, nor PyTorch to do the training. Instead, it is using RL Coach framework on top of Tensorflow to do the training.

Image from AWS DeepRacer documentation
Plugging debugger hook into the stack
Because DeepRacer training is not using Tensorflow directly, we cannot make use of the built-in SageMaker container to hook training artifacts out of the training job. Instead, we need to change the DeepRacer training code.
smdebug
smdebug is a Python library that does the hook jobs that we have discussed.
Think of it as an agent, if we want to get information from a black-box (in this case, the training environment). We need to install an agent into it, and at some point invoke the agent to send information out.
For those “unofficial” training environments (including RL Coach, which DeepRacer is using), we can plug this library in and put the hook into the code in order to use SageMaker Debugger.
Deep Dive into the code
Thanks to the effort from AWS DeepRacer community, especially Matt and Larsll, we already have a robust stack to run a customised DeepRacer training.
In my experiment, I am using:
SageMaker container is the environment in which the training runs on. In order to use smdebug I rebuilt the image with Tensorflow 1.15 instead of the original 1.13.
Changed line 10:
Changed line 49–59:
-t $PREFIX/sagemaker-tensorflow-container:cpu \
-f ../1.15.0/py3/Dockerfile.cpu \
--build-arg py_version=3 \
--build-arg framework_support_installable='sagemaker_tensorflow_*.tar.gz' \
--build-arg TF_URL=$TF_PATH
To use smdebug library, we also need to install it into the container image through pip
Dockerfile (Added this line):
After building the training environment, we need to modify our training code to plug the hook in.
According to smdebug documentation, there are several steps to put the hook inside our training code.
- Create a hook
- Register the hook to your model
- Wrap the optimizer
RLCoach framework uses the following hierarchy to control the training:
Graph Manager > Level Manager > Agent
In Level Manager layer, it will create a Session for each level, so I have created a SessionHook inside Graph Manager, and change the session creation code to inject the smdebug hook:
Initialize smdebug hook (Insert into line 120):
self.smdebug_hook = smd.SessionHook.create_from_json_file()
self.smdebug_hook.set_mode(smd.modes.TRAIN)
Use MonitoredTrainingSession instead of Session (Changed line 226):
Run the training
To run my training, I used Amazon SageMaker Studio and cloned the AWS workshop example.
To make use of my modified SageMaker container image and training code, I have made a few changes:
1. Replace the src directory with my modified training codes
2. Push my own container image to ECR, then changed custom_image_name variable
3. Add a code block to declare SageMaker Debugger rules to use:
Rule.sagemaker(
base_config=rule_configs.loss_not_decreasing(),
collections_to_save=[
CollectionConfig(
name="losses",
parameters={
"train.save_interval": "50"}
)
]
)
]
4. Put the rules into RLEstimator
source_dir='src',
...
rules=rules,
)
Result
After running the code, we will spin up a SageMaker training job.
If we go to the output S3 bucket, we will see a new directory called debug-output , it is the place where SageMaker put the emitted information. If we click into the collections directory, we will see a collections JSON file.

debug-output store the information collected by SageMaker Debugger
From this file, we can see how many information SageMaker Debugger has collected from the training.

SageMaker Debugger has collected the loss tensors
From the file I got, we can see SageMaker Debugger has collected the loss of the PPO agent network.
Plot graph
We can also use smdebug library to create a trail and plot a graph to see the trend of the value we got:
import matplotlib.pyplot as plt
import seaborn as snss3_output_path = estimator.latest_job_debugger_artifacts_path()
trial = create_trial(s3_output_path)fig, ax = plt.subplots(figsize=figsize)
sns.despine()for tensor_name in tensors:
steps, data = get_data(trial, tensor_name)
ax.plot(steps, data, label=tensor_name)ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_xlabel('Iteration')

Plotting graph to see the trend of training
Code repository I’m working on: https://github.com/richardfan1126/deepracer-experiment
What’s next
In this blog post, I haven’t really made use of the SageMaker Debugger rules. The information I got alone is also not really useful yet.
However, plugging SageMaker Debugger into such a complex DeepRacer training stack is really a good chance for me to learn both pieces of stuff.
My next step would be dive even deeper into the agent layer (the most bottom layer of RLCoach framework) and try to generate more customised information to make using Debugger more sense.
Upvote
Richard Fan
AWS DeepRacer League Finalist | AWS Community Builder | Cloud Engineer

Related Articles