Little Help with data de-identification/ data masking / data obfuscation
-
Wednesday, April 25, 2012 12:44 PM
Hi All,
I am Harsha, a java dev and the company I work for develops scheduling application for hospitals.
In our database we have a table called client which stores patients info in columns such as first name last name, address1, address2, etc.,
we have to take the data from the production system for testing purposes.
I was given a task to write a sql query which will de-identify/mask/obfuscate this tables data.The data should be randomly shuffled in between rows.
Can anybody help me with the best way to do this. Any code bits will help a lot. What's the process you follow to mask sensitive data?
Thanks in advance,
Harsha
All Replies
-
Wednesday, April 25, 2012 10:02 PM
When I've done this in the past, I've grabbed a list of names from somewhere, and divided into first names and last names. I've also created a list of street names and cities.
Make each list into a table, then run a cursor and foreach row do:
UPDATE client
SET firstname = (SELECT TOP 1 name FROM firstnames ORDER BY newid()),
lastname = (SELECT TOP 1 name FROM firstnames ORDER BY newid()),
...
WHERE client_id = @current_idThe point is that ORDER BY newid() will give you a random choice.
You could make it more efficient by doing it set-based, but as long as it does not take too long to run, there is no reason to make it overly complex.
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se- Marked As Answer by Maggie LuoMicrosoft Contingent Staff, Moderator Monday, May 07, 2012 5:41 AM
-
Thursday, April 26, 2012 4:38 AM
Thanks Erland,
It's a great idea.
I want to use the data inside the table itself, I will make the sensitive columns into a table and then do the process that you suggested.
I just have one question, What's the @current_id in your query? do you mean the unique id for each row in the client table.
And in our client table the unique column is client_id.
Thanks again.
-
Thursday, April 26, 2012 7:32 AM
I just have one question, What's the @current_id in your query? do you mean the unique id for each row in the client table.
I suggested that you should run a cursor over the table (to keep it simple), and @current_id is just a variable for the cursor loop.
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

