I recently started working with a new client who has a hosting provider hosting their Oracle database on Linux under VMware. An excellent choice, but this client is experiencing major performance issues – data for forms taking a minute or more to come up is just one example.
As I learned more about their environment I found that virtualization (VMware in this case, though the issue isn’t specific to any particular virtualization vendor) actually made their system performance worse. I know, I’m a VMware groupie (heck a VMware vExpert!) and we’re all amazed I’d write such a thing, but alas, it’s true.
The database is around 80GB in size. Each day this hosting provider would take a full (level 0 incremental) backup of the Oracle database via RMAN. The hosting provider wrote this RMAN backup to the same mount point in the VM that the database uses.
Please take a moment to catch your breath and stop clenching your hands into fists over this very very bad idea.
So why is this such a bad idea? For a couple of reasons.
One is performance – you’re now greatly degrading the performance of your database by writing a full backup to the same disks that are trying to handle database requests. You have, at the least, doubled the amount of I/O going to those disks.
Two is the ability to recover. If your ESX host or your VM experiences an issue (running out of disk space, disk corruption, fire, whatever), you can no longer access the mount point in the VM where you backed up the data.
Best practice for implementing RMAN in a situation like this is to backup your database to another set of disks on another machine in another physical location. A typical example is to have an NFS export on your backup destination server (in another datacenter) and have RMAN write direLet’s say ctly to that NFS mount. This way you aren’t writing your backup to the same disks (thereby not impacting production performance much) and you’re covered in the case of issues with the hardware or VM itself.
So where does VMware fit into this? I mentioned that the hosting provider was also performing VM-level backups. In particular, they were performing VM-level backups at the same time they were running RMAN backups. All to the same set of disks.
Now I’ve got the VMware Admins and the Oracle DBAs cringing.
When you initiate a VM level backup, VMware takes a snapshot of the VM. This means it makes a delta file on the same ESX datastore and stops writing to the VMDK(s) that make up the VM. All changes to the VM get written to the delta file instead. That delta file can grow (8 megs at a time) up till it’s the same size as the original VMDK.
When you are taking a VM level backup, you want to choose a time when you’re not doing many writes to the VM. This way the delta file won’t grow so big that you could run out of space on the datastore (LUN) and your performance impact is decreased.
So here they are writing their full Oracle backup of 80GB out to a mount point inside their VM. That’s 80GB of writes you’re doing. VMware see those writes and has to write them to it’s snapshot (delta file). So now not only are you serving up database queries on your disks, you’re also scanning every block of your database on those disks for changes (this database did not employ Oracle Changed Block Tracking), you’re writing a full RMAN backup to those disks and VMware is having to copy all those writes into a delta file on those same disks.
Virtualization can be wonderful and solve or simplify many of the issues an administrator faces, but it can also magnify fundamental architecture flaws.