Tag Archives: snapshots

Which would your CEO prefer

Would your CEO choose to give up a negligible amount of system performance at peak system load in exchange for a reduction of risk during system upgrades and an alternative to hours of downtime?

I suspect the answer is yes.

What’s the cost of such a technology? Believe it or not, it’s free. It’s the snapshot functionality that’s built into VMware’s vSphere Hypervisor (based on ESXi) that you can leverage once you virtualize your Oracle database or any other x86 systems. Is the performance impact truly negligible? Don’t take it from me; you can read it in Oracle’s own words.

Recently RedHat and Oracle have come out with updates to their mainstream distributions (RedHat Enterprise Linux 5.6 and Oracle Linux 5.6) and each has come out with an entirely new version (RedHat Enterprise Linux 6 and Oracle Linux 6). System and database administrators all around the world are updating their critical systems. Applying those updates to critical systems requires testing and downtime.

The best practice for operating system upgrades is to test the upgrade on the same exact hardware, software and data as in production. But are your test and production systems identical? Same motherboard? Same processors? Same amount and brand of RAM and same exact operating system packages installed? Pretty unlikely. With virtualization, all those configurations are identical.

With VMware virtualization, I can take take a clone of the entire production server while it’s live and being used by users. Now I have a truly identical copy of production to test my upgrade. Note that to do a clone of the entire production server while it’s live requires VMware vCenter which isn’t free, but vSphere Essentials (which includes vCenter) starts at $1000 as of the time of this writing.

With snapshots (which are free and don’t require vCenter), you take a snapshot (3 mouse clicks and less than 10 seconds) of the virtual machine and then do the upgrade. If you run into issues, just revert / rollback the snapshot to the state it was in when you took the snapshot. The time to do that revert / rollback is only the time necessary to reboot your virtual machine – 5 minutes? If the upgrade and testing goes smoothly, you just merge the snapshot into the virtual machine while the system is up and available to users. Total time spent doing non-upgrade activities such as backups or restores? Essentially zero.

Compare this to how you would handle a critical system without virtualization: you’d take the system offline for the upgrade, take a full backup of the system, do the upgrade (hoping it works just like it did in the similar but most likely not identical test system), have the users test, and, if there’s an issue, possibly spend hours restoring from backup. You do trust your backups… right?

Enterprise DBAs and the companies that employ them tend to be risk adverse when it comes to losing data or experiencing downtime. Virtualizing the hardware allows you to ensure your test and development systems are the exact same systems as production, thereby reducing risk of unforeseen issues during the upgrade. Utilizing snapshots allows you to very quickly take a save point of your entire server and rollback to it very quickly in the event of unforeseen issue, almost entirely eliminating (minutes vs. hours) the downtime associated with recovering from an unforeseen issue.

Too big to fail: Virtualizing Big Iron databases

Recently I was talking with a company in Houston,TX running Oracle E-Business (EBS) 11i on Solaris with an Oracle 9i database. They run VMware for other parts of their business and wanted to leverage the features of VMware vSphere and Site Recovery Manager (SRM) to virtualize their EBS environment and have the ability to quickly move their EBS environment in the event of a hurricane or other natural disaster bearing down on Houston to their geographically diverse DR site.

This call had lots of moving parts. They were on Oracle 9i database and wanted to move to Oracle 11g to ease their support costs. They wanted to move from Solaris hardware to commodity x86 hardware and RedHat linux to ease their support costs. Their existing Oracle 9i database was running on and using 8 SPARC processors at peak levels throughout the business day.

in VMware vSphere, the maximum processors you can make visible to a VM is 8 virtual x86 processors. Is a virtualized x86 processor as fast as a physical SPARC processor? Would their SQL run faster on Oracle 11g than it did on Oracle 9i? Is RedHat Linux going to allow the database to process requests as fast as Solaris? Will their SAN storage and LUN layout be fast enough? Will their file system be a limiter?

Besides building up the environment and just going for it in production, how can you know?

By leveraging some very cool tools from both Oracle and VMware.

For Oracle database, Oracle offers an add-on called Real Application Testing which has a feature called Database Replay. Database replay allows you to capture the workload on your production database server and replay it on another environment to see if things are faster or slower. Although this was a new feature on Oracle 11g database, Oracle made backports of this available for 9i and 10g databases – exactly for purposes like this (well, maybe not to aid in virtualizing to VMware, but you get my drift).

Using Database Reply and Real Application Testing (both licensable features from Oracle Enterprise Edition) allows companies to test SAN changes, hardware changes, database upgrades, OS changes, etc., all with a production load, but without risking actual production issues.

Where does VMware fit into this? The way Real Application Testing and Database Reply work is by capturing all the transactions generated in production, massaging them a little bit, and then playing them back against a clone of Production. That clone needs to be at the exact same point in time (or SCN – System Change Number in database speak) as PROD so that the replay is playing back against an exact replica of the database. Although setting up a clone to an exact isn’t hard for an Oracle DBA, it does require time – time to build the test system, time to restore a backup of the database and time to apply archive logs and roll the database forward to match PROD’s SCN.

Even in cases such as this where the Production database isn’t virtualized, by making the test system virtualized, we can not only test all these changes, but we can leverage VMware snapshot technology to allow us to very quickly take our database back to the SCN we want to run Database Replay against, without having to continually restore the database. Using snapshots you just go thru that setup effort once, take a snapshot and then just keep rolling back to your snapshot as many times as necessary to test performance.

Of course, you may find that the 8 processor limit in VMware or the OS or the SAN can’t handle your production load. Time to give up and stay physical? No. In Oracle 10g and further refined in Oracle 11g, Oracle has greatly improved the ability the database has to help a DBA manage the system load and even to have the database tune itself. By leveraging features such as Advanced Compression and SecureFiles (to reduce the physical I/O), Automatic Optimizer Statistics Collection and Automatic SQL Tuning Advisor (to tune queries to use less CPU and/or disk resources), you can give your database more room to grow yet still stay on the same (or less!) hardware.

Lessons learned from a virtualized Oracle upgrade

So about a week ago, we did a rather massive upgrade at my main client to the Oracle E-Business infrastructure. The main things in this upgrade were:

Licensing modules necessary for us to have a full installation of Oracle HR
Upgrade Oracle database from 11.1.0.7 64-bit to 11.2.0.1 64-bit
Apply all CPU security patches thru Apr 2010
Upgrade memory on DB server from 8G to 12G
Upgrade server side java from 1.6.0_16 to 1.6.0_20
Upgrade client side java from 1.6.0_16 to 1.6.0_20b5 (see this link on why the special b5 version)
Apply approximately 350 (not a typo) individual E-Business patches, for the following things:
o Minimum Baseline Patch Requirements for Extended Support on Oracle E-Business Suite 11.5.10 (Note 883202.1)
o Upgrading from Financials Family Pack F to Family Pack G (FIN_PF.G)
o Recommended 11i Apps patches for all our products
o Java related patches
o Latest DST v11 related patches (see here)
o Implement WebADI

As you might gather from this list, it was a rather large upgrade. The apps patches alone totaled about 10GB of patches once merged into one patch and the backup directory for the merged patches ended up totaling 6GB. Test runs had the upgrade running about 24 hours with 8 CPUs on some scratch disk storage I had in the SAN . Like I mentioned in previous posts, we utilized VMware snapshots on our boxes at various points in the upgrade in case we needed to roll back or experienced an unforeseen issue.

One of the VMware best practices we follow with our VMs is to break the boot “disk” and the data “disk” for our VMs into their own virtual disks. Besides during booting up / shutting down of a VM, the boot disk generally experiences very low traffic. So it’s pretty typical, especially with a replicated SAN system such as ours, to put your boot “disks” (VMDKs) for a bunch of VMs on one VMware datastore, possibly with slower disks, and your data “disks” (VMDKs) on another dedicated datastore. In our case, the boot disk datastore is a 2 disk RAID 1 (mirrored) set with Fiber Channel drives and the data disk datastore is a 9 disk (8+1) RAID 5 datastore of SSDs (aka EFDs aka super super fast disks).

Although I had run multiple dry runs before the upgrade, one thing I failed to notice / realize is that by default VMware snapshots are stored where the VM lives, or more specifically, where the VM’s configuration file lives… in this case on my slowest disks.

This became extremely clear during our large merged patch of 330+ Apps patches – things got slower and slower. At that point, shutting down the VM and moving the snapshots wasn’t really an option. It was just a matter of suffering thru and learning for next time. Luckily the business had fully planned on the upgrade taking 24 hours for the patching even though I expected us to be at roughly 1/2 that time with SSDs.

By the time the upgrade was done and the business analysts had finished their testing and calling the upgrade good (and hence when we were ready to delete the 5 sets of snapshots), the snapshots for my two VMs that utilize about 450GB of space had grown to about 200GB. It took about 5 hours for the snapshots to be merged into the base VMDKs. Although the system was usable during that time, it was quite laggy. Luckily it was still the weekend for most of our users and they weren’t too inclined to utilize Oracle.

On the subject of VMware snapshot deletions, I recently came across two notes that should be of use to other VMware admins
1) With the latest version of vSphere (4.0 Update 2), VMware has greatly improved the speed and efficiency of deleting all the snapshots for a VM. You can read more about it here. Unfortunately at the time of my Oracle upgrade I was on vSphere 4.0 Update 1.
2) When you delete a large snapshot, it will frequently appear to “hang” at 95% – check out this knowledge base article on how to monitor snapshot deletions.

Overall the upgrade was a success and minus the occasional user issues Monday morning (first business day after the upgrade) was pretty much a non-event.

These are the sorts of situations that make sending your people to training, or giving them the time and inclination to read manuals and blogs, so essential. Not as a result of this, but somewhat related, I’ll be attending the VMware vSphere troubleshooting class in the next month or two and will be (assuming I pass the test) earning my VCP and possibly trying to earn a VCAP-DCA by end of year.

How virtualization can magnify your architecture problems

I recently started working with a new client who has a hosting provider hosting their Oracle database on Linux under VMware. An excellent choice, but this client is experiencing major performance issues – data for forms taking a minute or more to come up is just one example.

As I learned more about their environment I found that virtualization (VMware in this case, though the issue isn’t specific to any particular virtualization vendor) actually made their system performance worse. I know, I’m a VMware groupie (heck a VMware vExpert!) and we’re all amazed I’d write such a thing, but alas, it’s true.

The database is around 80GB in size. Each day this hosting provider would take a full (level 0 incremental) backup of the Oracle database via RMAN. The hosting provider wrote this RMAN backup to the same mount point in the VM that the database uses.

Please take a moment to catch your breath and stop clenching your hands into fists over this very very bad idea.

So why is this such a bad idea? For a couple of reasons.

One is performance – you’re now greatly degrading the performance of your database by writing a full backup to the same disks that are trying to handle database requests. You have, at the least, doubled the amount of I/O going to those disks.

Two is the ability to recover. If your ESX host or your VM experiences an issue (running out of disk space, disk corruption, fire, whatever), you can no longer access the mount point in the VM where you backed up the data.

Best practice for implementing RMAN in a situation like this is to backup your database to another set of disks on another machine in another physical location. A typical example is to have an NFS export on your backup destination server (in another datacenter) and have RMAN write direLet’s say ctly to that NFS mount. This way you aren’t writing your backup to the same disks (thereby not impacting production performance much) and you’re covered in the case of issues with the hardware or VM itself.

So where does VMware fit into this? I mentioned that the hosting provider was also performing VM-level backups. In particular, they were performing VM-level backups at the same time they were running RMAN backups. All to the same set of disks.

Now I’ve got the VMware Admins and the Oracle DBAs cringing.

When you initiate a VM level backup, VMware takes a snapshot of the VM. This means it makes a delta file on the same ESX datastore and stops writing to the VMDK(s) that make up the VM. All changes to the VM get written to the delta file instead. That delta file can grow (8 megs at a time) up till it’s the same size as the original VMDK.

When you are taking a VM level backup, you want to choose a time when you’re not doing many writes to the VM. This way the delta file won’t grow so big that you could run out of space on the datastore (LUN) and your performance impact is decreased.

So here they are writing their full Oracle backup of 80GB out to a mount point inside their VM. That’s 80GB of writes you’re doing. VMware see those writes and has to write them to it’s snapshot (delta file). So now not only are you serving up database queries on your disks, you’re also scanning every block of your database on those disks for changes (this database did not employ Oracle Changed Block Tracking), you’re writing a full RMAN backup to those disks and VMware is having to copy all those writes into a delta file on those same disks.

Virtualization can be wonderful and solve or simplify many of the issues an administrator faces, but it can also magnify fundamental architecture flaws.