Troubleshooting the fast.ai AWS setup scripts (setup_p2.sh)

Kevin Firko
Published:

Fast.ai offers a well-regarded free online course on Deep Learning that I thought I’d check out.

It seems that a lot of people struggle getting the fast.ai setup scripts running. Complaints and requests for help are on reddit, in forums, etc. This doesn’t surprise me because the scripts are not very robust. On top of that, AWS has a learning curve so troubleshooting following a script failure can be a challenge.

Hopefully this post helps other people that have hit snags. It is based on my experience on MacOS, however should be very compatible for those running Linux or Windows with Cygwin.

Understanding the setup script’s behaviour

It leaves a mess when it fails

If running the setup script fails, which is possible for a number of reasons, it will potentially have created a number of AWS resources in your account and a local copy of an SSH key at ~/.ssh/aws-key-fast-ai.pem. It does not clean up after itself in failure cases.

The setup script doesn’t check for existing fast-ai tagged infrastructure, so subsequent runs can create additional VPC’s and related resources on AWS, especially as you attempt to resolve the reason(s) it failed. The setup script might generate fast-ai-remove.sh and fast-ai-commands.txt but it overwrites these each time its run with only its current values, potentially leaving “orphan” infrastructure.

Thankfully all AWS resources are created with the same “fast-ai” tags so they are easy to spot within the AWS Console.

It makes unrealistic assumptions

The setup script assumes your aws config’s defaults specify a default region in one of its three supported regions: us-west-2, eu-west-1, and us-east-1.

I’m not sure why the authors assumed that a global tech crowd interested machine learning would be unlikely to have worked with AWS in the past and thus no existing aws configuration that might conflict.

The commands in the script do not use the --region argument to specify an explicit region so they will use whatever your default is. If your default happens to be one of the three supported ones, but you don’t have a sufficient InstanceLimit or there’s another problem, more issues could follow.

Troubleshooting

If you encountered an error after running the script, prior to re-running the script, take note of the following checks when attempting to resolve:

Check 1: Ensure you have an InstanceLimit > 0

Most AWS users will have a default InstanceLimit of 0 on P2 instances. You may need to apply for an increase and get it approved (this is covered in the fast.ai setup video).

If a first run of the script gave you something like the following, there was an issue with your InstanceLimit:

Error: *An error occurred (InstanceLimitExceeded) when calling the RunInstances operation: You have requested more instances (1) than your current instance limit of 0 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.*

InstantLimits are specific to a given resource in a given region. Take note of which region your InstanceLimit increase request was for and verify that it was granted in the same region.

Check 2: Ensure the right region

Verify your current default aws region by running: aws configure get region. The script assumes this is one of three supported regions: us-west-2, eu-west-1, or us-east-1.

The script also assumes that you have an InstanceLimit > 0 for P2 instances in whichever region you would like to use (or T2 instances if you are using setup_t2.sh).

To get things running quickly, I personally found it easiest to make the script happy and temporarily set my aws default to a supported region in ~/.aws/config, i.e.:

[default]
region=us-west-2

Another option is to modify the scripts and add an explicit --region argument to every aws command that will override the default region. If you have multiple aws profiles defined as named profiles, and the profile that you wish to use for fast.ai specifies a default region, you can use the --profile PROFILENAME argument instead.

For example, the following hypothetical aws config file (~/.aws/config) specifies a profile called “fastai”. A --profile fastai argument could then be added to every aws command in the setup script:

[default]
region=ca-central-1

[profile fastai]
region=us-west-2

Check 3: Delete cruft from previous failed runs

This check is what inspired me to write this post!

Delete AWS resources

Review any resources were created in your AWS Console, and delete any VPC’s (and any dependencies) that were spun up. They can be identified because they were created with the “fast-ai” tag which is shown in any tables of resources in the AWS Console.

Cruft resources will have been created in any region that the setup script was working with (i.e. whatever your default region was at the time you ran it).

If you’ve found cruft, start by trying to delete the VPC itself, as this generally will delete most if not all dependencies. If this fails because of a dependency issue, you will need to find and delete those dependencies first.

IMPORTANT: AWS creates a default VPC and related dependencies (subnets, etc.) in every region available to your account. Do NOT delete any region’s default VPC. Only delete resources tagged with “fast-ai”.

Delete SSH keys

Check to see if ~/.ssh/aws-key-fast-ai.pem was created, and if so, delete it before running the script again.

The setup script has logic that checks for this pem file. We do not want the script to find the file on a fresh run.

After a successful run

After the setup script ran successfully, I got output similar to:

{
    "Return": true
}
Waiting for instance start...

All done. Find all you need to connect in the fast-ai-commands.txt file and to remove the stack call fast-ai-remove.sh
Connect to your instance: ssh -i /Users/username/.ssh/aws-key-fast-ai.pem ubuntu@ec2-XX-YY-ZZ-XXX.us-west-2.compute.amazonaws.com

Reference fast-ai-commands.txt for information about your VPC and EC2 instance. An ssh command to connect is in the file, and you can find your “InstanceUrl”.

I suggest picking up the video from here and following along from the point where you connect to your new instance. It guides you through checking the video card with the nvidia-smi command and running jupyter: http://course.fast.ai/lessons/aws.html

Starting and stopping your instance

The fast-ai-commands.txt file outlines the commands to start and stop your instance after the setup has completed successfully, e.g.:

aws ec2 start-instances --instance-ids i-0XXXX
aws ec2 stop-instances --instance-ids i-0XXXX

Its important to stop instances when you are finished using them so that you don’t get charged hourly fees for their continued operation. P2 instances run about $0.90/hr at the time of writing.