10.11.17

OracleVM 3.3.4 and Data Corruption When Cloning VM from Template

We faced quite serious issues once trying to create a virtual server from a template that resided in NFS mount, target repository was on iSCSI storage.

The problem was that for some reason OracleVM 3.3.4 kernel 3.8.13 started corrupting the image while cloning the virtual server from template. Symptoms were that after the clone operation everything looks good from OracleVM Manager point ot view but when trying to startup the server, it fails with error stating there is no bootable operating system.

During the cloning operations there were huge amount of following errors in the /var/log/messages. The errors were the same despite I changed the utilility server to be different, so this is not hardware issue:

Nov  9 18:35:17 vs9 kernel: sd 5:0:0:0: [sdd] CDB:
Nov  9 18:35:17 vs9 kernel: Write(10): 2a 00 26 bf f0 00 00 0a 00 00
Nov  9 18:35:17 vs9 kernel: sd 5:0:0:0: [sdd] Invalid command failure
Nov  9 18:35:17 vs9 kernel: sd 5:0:0:0: [sdd]
Nov  9 18:35:17 vs9 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov  9 18:35:17 vs9 kernel: sd 5:0:0:0: [sdd]
Nov  9 18:35:17 vs9 kernel: Sense Key : Illegal Request [current]
Nov  9 18:35:17 vs9 kernel: sd 5:0:0:0: [sdd]
Nov  9 18:35:17 vs9 kernel: Add. Sense: Invalid field in cdb
Nov  9 18:35:17 vs9 kernel: sd 5:0:0:0: [sdd] CDB:
Nov  9 18:35:17 vs9 kernel: Write(10): 2a 00 26 bf fa 00 00 0a 00 00
Nov  9 18:35:17 vs9 kernel: JBD2: Detected IO errors while flushing file data on dm-3-617
When searching for the explanation, looks like this is an issue with using iSCSI and jumbo frames with certain 3.9 kernel versions. Could be that this is an issue with OVM 3.8 kernel as well.What makes this particularly nasty is that we’ve made several copies of virtual servers for backup purposes and there is no guarantee that those copies are valid and functional any more.

Possible Solution

To troubleshoot the fix I decided to upgrade the whole OracleVM park to the latest OVM 3.4.4 that uses Kernel 4.1.

After upgrading the all the pools and OVM Manager to 3.4.4 looks like we got rid of this nasty behaviour.

Tried exactly the same way ot cloning, using the same servers, no errors and the cloned virtual server works just fine.

Recommendation

I strongly recommend to upgrade to OracleVM 3.4.x as soon as possible if you are using 3.3.x, iSCSI and jumbo frames AND you are seeing these errors.

Check your OracleVM servers, if you see any of these errors in /var/log/messages, you might have data corruption issues in the images.

No comments: