When EC2 Hardware Changes Underneath You…

Comments:"When EC2 Hardware Changes Underneath You… | PiCloud Blog"

URL:http://blog.picloud.com/2013/01/08/when-ec2-hardware-changes-underneath-you/

At PiCloud, we’ve accumulated over 100,000 instance requests on Amazon EC2. Our scale has exposed us to many odd behaviors and outright bugs, which we’ll be sharing in a series of blog posts to come. In this post, I’ll share one of the strangest we’ve seen.

The Bug

It started with a customer filing a support ticket about code that had been working flawlessly for months suddenly crashing. Some, but not all, of his jobs were failing with an error that looked something like:

Fatal Python error: Illegal instruction

File “/usr/local/lib/python2.6/dist-packages/numpy/linalg/linalg.py”, line 1319 in svd

File “/usr/local/lib/python2.6/dist-packages/numpy/linalg/linalg.py”, line 1546 in pinv

That’s odd, I thought. I had never before seen the Python interpreter use an Illegal Instruction! Naturally, I checked the relevant line that was crashing:

results = lapack_routine(option, m, n, a, m, s, u, m, vt, nvt, work, lwork, iwork, 0)

A call to numpy’s C++ lapack_lite. Great, the robust numpy was crashing out.

More surprising was that a minority of jobs were failing, even though the customer indicated that all jobs were executing the problematic line. We did notice that the job failures were linked to just a few servers and those few servers ran none of the customer’s jobs successfully. Unfortunately, our automated scaling systems had already torn down the server.

Debugging

The first thing I did was Google the error. Most results were unhelpful, but one old, though now solved, bug with Intel’s Math Kernel Library (MKL) seemed notable. MKL would crash with an illegal instruction error when AVX (Advanced Vector Extensions, a 2011 extension to x86) instructions were being executed on CPUs that lacked support. Why notable? We compile numpy and scipy libraries with MKL support to give the best possible multi-threading performance, especially on the hyperthreading & AVX capable f2 core.

Still though, why did only a few servers crash out? Having not much to go on, I launched a hundred High-Memory m2.xlarge EC2 instances (200 m2 cores in PiCloud nomenclature) and reran all the user’s jobs over the nodes. A few jobs, all on the same server, failed.

As I compared the troublesome instance to the sane ones, one difference stood out. The correctly operating m2.xlarge instances were running 2009-era Intel Xeon X5550 CPUs. But the troublesome instance was running a more modern (2012) Xeon E5-2665 CPU. And returning back to the MKL bug noted earlier, this new chip supported AVX.

Examining /proc/cpuinfo showed as much; AVX was supported on the failing instance, but not the new ones. To test it out, I compiled some code from stackoverflow with ‘g++ -mavx”. Sure enough, running the binary produced an Illegal Instruction.

From my perspective as an instance user, the processor was lying, claiming to support AVX but actually crashing when any AVX code would run.

Analysis

Turns out the actual answer was subtle. Per the Intel manual, it is possible for the operating system to disable AVX instructions by disabling the processor’s OSXSAVE feature. By the spec, any application wishing to use AVX first must check if OSXSAVE is enabled.

Amazon seems to have disabled the OSXSAVE feature at the hypervisor layer on their new Xeon E5-2665 based m2.* series of instances. This may just be because their version of the Xen hypervisor that manages these instances lacks support for handling AVX registers in context switching. But even if support does exist in the hypervisor, it makes sense to disable AVX for the m2.* family as long as there are Xeon X5550 based instances. Imagine compiling a program on an m2.xlarge EBS instance, thinking you had AVX support, and then upon stopping/starting the instance, finding that the program crashes, because your instance now runs on older hardware that doesn’t have AVX support! A downside of VM migration is that all your hardware must advertise the least common denominator of capabilities.

Unfortunately, Amazon did not ensure that the Guest OS saw that OSXSAVE was disabled. This led to MKL thinking it had the capabilities to run AVX code, when it actually didn’t.

Ultimately, there was not much to do but:

Given how rare the Xeon E5-2665 instances are, we now simply self-destruct if an m2.*’s /proc/cpuinfo claims that both avx and xsave is enabled File a support case with Amazon. They have been quite responsive and as I publish this post, it seems that a fix has at least been partially pushed.

So, if you use instances in the m2.* family, be sure to check /proc/cpuinfo. If the instance claims it has both avx and xsave, it is probably lying to you.

Alternatively, if you are doing high performance computation in the cloud, you may just want to pass on the responsibility for such dirty details to us at PiCloud.

Tags: avx, ec2 bugs

Categories: Battle Stories

You can follow any responses to this entry through the RSS 2.0 feed.

When EC2 Hardware Changes Underneath You… | PiCloud Blog

The Bug

Debugging

Analysis

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112