With our most recent upgrade, we’ve implemented some of the Human Resources modules.
We’re now storing some sensitive employee data (stuff like salaries, Social Security Numbers, banking info, etc.) and one of the business mandates is to scramble this data in our cloned instances.
I knew from reading Steven Chan’s blog (this article by Sanchit Jindal) that Oracle offers two plugins for Oracle Enterprise Manager (OEM) that might serve our needs. These are the Data Masking Pack and the Application Management Pack’s for E-Business Suite.
While reviewing which of these might be best for our environment, I made note of the following resources, so for those interested:
Application Management Pack for E-Business Suite
Data Sheet- http://www.oracle.com/technology/products/oem/pdf/apps_mgmt_ebiz.pdf
Download- My Oracle Support, Patches, Search for Patch Number- 8333939
Data Masking Pack
Data Sheet- http://www.oracle.com/technology/products/manageability/database/pdf/ds/data-masking-pack-11gr2-datasheet.pdf
Canned Demo- http://download.oracle.com/technology/products/oem/screenwatches/data_masking/index.html
Download- http://www.oracle.com/technology/software/products/oem/index.html
I also found the comparison matrix at the Steven Chan blog post above (direct link is http://blogs.oracle.com/stevenChan/images/comparisonmatrix.jpg ) to be extremely useful.
For an EBS environment, AMP looks to be a better fit – It has pre-defined masking for EBS, which was exactly what I needed since I was going to be modifying data in pre-defined EBS tables / objects. AMP also allows you to hot-clone an Oracle EBS instance. Many newer Apps DBAs don’t have methods / scripts written to do this themselves.
In our environments, AMP would have required some significant and PROD performance impacting changes in order to use the data scrambling.
The data scrambling can only take place as part of a clone and AMP only supports cloning thru RapidClone integration and specifically does NOT support cloning from an RMAN backup. This means that utilizing AMP would require I clone directly from PROD, causing a non-negligible performance impact.
Another issue was cost. AMP is licensed per processor at a list price of $7k per processor. The number of processor licenses needed is obtained by adding up your Database, Application Server and Web Server (if on separate system from Application Server) processors to get the # of processor licenses you need to buy. In our case we utilize 8 Oracle Database licenses and 4 Oracle Application Server / Web Server licenses, so 12 * $7k or $84,000 list price, plus yearly support fees.
As a result of these two issues with AMP, the business asked me to see if I could come up with a cheaper home grown solution that would allow us to utilize our existing cloning strategies and not cost tens of thousands of dollars.
We had 3 main things we needed to scramble: National Identifiers (aka Social Security Numbers or SSNs), Salary data, and banking direct deposit information like routing and account numbers.
National Identifiers (SSNs):
National identifiers are stored in table per_all_people_f . This is NOT the primary key however. The primary key for per_all_people_f is
(PERSON_ID, EFFECTIVE_START_DATE, EFFECTIVE_END_DATE) and each time a person changes positions or leaves / comes back to the company, a new row is added with an appropriate effective date range. This means I had to change the SSN to the same value for each record for a person. For example, Bob might have 4 records in this table due to leaving and coming back or switching positions. In all those records in PROD, he’d have the same SSN of “123-45-6789” and I needed to ensure when I cloned the instance and scrambled the data that his new SSN was consistently the same (but of course different from PROD) in the clone.
First I needed to generate a bunch of random but valid SSNs. There’s actually quite a few rules associated with this (for example, if you receive your SSN while living in Deleware, the first three digits must be 221 or 222) and luckily I found some scripts at fakenamegenerator.com which allowed me to create a long list of fake random SSNs. The fngssn.class.php file is GPL and is available at this link. I also modified their example.php script to call the fngssn.class.php file and output a SSN. I then just looped thru this program however many hundreds or thousands of times I needed, outputting and concatenating the SSNs to a dump file.
I then bring this list of SSNs into an excel spreadsheet as column B.
Column A is update per_all_people_f set national_identifier = ZZ
Column B is my list of fake SSNs
Column C is ZZ where national_identifier = ZZ
Column D is my list of real SSNs – this can be obtained with the following SQL: select distinct(national_identifier) from per_all_people_f where national_identifier IS NOT NULL;
Column E is ZZ;
I then save this spreadsheet out to a plain text file and edit the file, replacing all the ZZ with ‘ (Excel handles ‘ weirdly, so I just use vi afterward to do a global replace of ZZ with ‘ – it’s one command and takes a couple seconds).
At this point I’m left with a sql file with commands like
update per_all_people_f set national_identifier = ‘567-42-3477’ where national_identifier = ‘123-45-6789’;
I just need to run these and poof, the SSNs are now scrambled.
Salary data:
This was the easiest to change. Some people may get paid hourly, some salary. In either case though, their salary or hourly rate is stored in per_pay_proposals.proposed_salary_n . For someone that gets a salary of $100k a year this field would have 100000 and for someone paid $10 an hour the field would have 10 . It’s perfectly ok for people to have the same salary in real life, and I figured for the purpose of our scrambled clone, I’d go with a socialist ideal and just gave everyone the same value of 50. Admittedly, this means people who are salary now get $50 a year while the hourly people get $50 an hour. It’s good to be an hourly employee, at least in my scrambled instances and maybe now the hourly employees can afford health care!
update per_pay_proposals set proposed_salary_n = 50 where proposed_salary_n IS NOT NULL;
Banking data:
There are two parts to the banking data: the routing number (think of it as the bank number) and the account number. These are the two sets of numbers at the bottom of your checks (there’s also the check number as the third set of numbers on your checks but that’s not relevant).
The pay_external_accounts table holds the routing and account number info in some of the segment columns – where exactly depends on your implementation. We only wanted to update this data for people’s personal accounts, so our SQL looks like this:
update pay_external_accounts
set segment6 = ‘44444444444’, segment7 = ‘1111111111’
where external_account_id IN (select b.external_account_id from pay_personal_payment_methods_f a, pay_external_accounts b
where a.external_account_id = b.external_account_id);
—
There you go, all done. I know it sounds like a lot of manual steps, but once you’ve built the excel spreadsheet and generated your list of fake SSNs, all you really need to do is run the SQL for Column D and paste those values in, change the ZZ to ‘ and run the file. The SQL for salary and banking data take a few seconds each. Call it 10 or 15 minutes total per clone.
Now could this all be better? Sure. I could generate my fake SSNs via a SQL cursor and never need excel or php. But for now, this works. Please feel free to improve on my system and I’ll update the blog with the changes.
Interesting post. From an ‘absolute’ security point of view there may be some small holes. For example, and especially for a clone to be used as ‘read-only’, data may sit around in UNDO and queryable through a flashback query (AS OF SCN…). In the underlying data files, some of the old values may still be left around. This can be deleted records (probably unlikely in your case) or where indexes have been jiggled. For example, an index on SSN may have had a lot of activity with the updates, and there may be some blocks which become unused (and unreferenced) with the new data but which contain the ‘residue’ of the old values. This is more likely if you decide to drop and rebuild the index.
It would be an interesting experiment to take a sample dozen pre-update SSNs and grep the data files for them to gauge the success of the masking
You raise good points. I guess I should point out our goals with the scrambling. It’s main goal is to allow developers and business analysts who access the development instances to be able to use the apps username/password without being able to see sensitive data.
They have no OS level access to the servers themselves or the data files / archivelogs / backups.
We also do not currently use flashback logs.
I think for the goals of the business we’re good, but let me know if you think of something I’ve missed.