locked
Data Masking RRS feed

  • Question

  •  

    How can we mask the actual data stored in sql server 2000 tables.

     

     

    for ex: if we are storing employees personal information in EMPLOYEE table, but if we are moving the data to another table EMPLOYEE_MASK table with the masked data, how to perform this action, whether the sql server 2000 provides any function or stored procedures to mask the existing in EMPLOYEE table.

     

     

    or else masking the data while inserting it in to table.

     

    Tuesday, October 21, 2008 1:04 PM

Answers

  • There are no data masking features in SQL Server 2000 - you need to either implement data masking functions yourself or rely on a third party solution.

     

    In SQL Server 2005 and in SQL Server 2008, you may be able to use cryptographic hashing functions to perform data masking, but you would still need some level of custom coding to implement a masking function - for example, if you want the masked data to be of the same data type as the original data.

     

    Data masking is tricky, if you expect security type of guarantees - there are publicized accounts of data being made available in masked form only to be broken by correlating the non-masked data portions with other related public databases. To my knowledge, there is no generic data masking technique that can be considered "secure" in the same way in which an encryption technique is considered secure and I don't think we'll see one soon - I would treat each data masking problem as a special scenario that would need a specialized solution.

    Wednesday, October 22, 2008 1:32 AM

All replies

  • There are no data masking features in SQL Server 2000 - you need to either implement data masking functions yourself or rely on a third party solution.

     

    In SQL Server 2005 and in SQL Server 2008, you may be able to use cryptographic hashing functions to perform data masking, but you would still need some level of custom coding to implement a masking function - for example, if you want the masked data to be of the same data type as the original data.

     

    Data masking is tricky, if you expect security type of guarantees - there are publicized accounts of data being made available in masked form only to be broken by correlating the non-masked data portions with other related public databases. To my knowledge, there is no generic data masking technique that can be considered "secure" in the same way in which an encryption technique is considered secure and I don't think we'll see one soon - I would treat each data masking problem as a special scenario that would need a specialized solution.

    Wednesday, October 22, 2008 1:32 AM
  •  

    Is it any features or stored procedures available to shuffle the sql server 2000 database tables contents in column wise.

     

    For ex:- EMPLOYEE table , fileds like {empl_id,name,addrs,ph_no,email} all the contents of these columns should be shuffled. and stored another table EMPLOYEE_CHG table. with the shuffled values. (ie) the empl_id of one employee should contain the information of another similarly all the column values.

    Wednesday, October 22, 2008 6:42 AM
  •  

    Is it any features or stored procedures available to shuffle the sql server 2000 database tables contents in column wise.

     

    For ex:- EMPLOYEE table , fileds like {empl_id,name,addrs,ph_no,email} all the contents of these columns should be shuffled. and stored another table EMPLOYEE_CHG table. with the shuffled values. (ie) the empl_id of one employee should contain the information of another similarly all the column values.




    PeopleSoft Developer
    Wednesday, October 22, 2008 6:44 AM
  • No, there are no builtin functions for doing such shuffles. It's actually the first time I hear of someone looking for something like this - what would you use such functions for?

     

    Wednesday, October 22, 2008 5:34 PM
  •   Laurentiu is correct, data masking is tricky as it will potentially leak information about the real data.

     

      If possible, it is better to provide randomly generated data that is meaningful/consistent instead of use a transformation on top of the real data, this will protect the real data and allow the development/test teams to work with a testing data set that does not contain sensitive data (or statistics on sensitive data).

     

      There is a video resource that may help you in this case How Do I: Generate Test Data using Visual Studio Team System Database Edition?.

     

      -Raul Garcia

      SDE/T

      SQL Server Engine

     

    Wednesday, October 22, 2008 5:47 PM
  •   Please correct me if my assumption is incorrect, but this question is related to this other thread: http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=4027414&SiteID=1, correct?

     

      If that is the case, please let us know so I can mark this thread as a duplicate post and avoid confusion.

     

      Thanks,

       -Raul Garcia

      SDE/T

      SQL Server Engine

     

    Wednesday, October 22, 2008 5:59 PM
  • Our issue is that we need this data to be able to test reports / BI. We need to maintain the relationships that exist without exposing the "real" data. There are certain aspects of the data that must be retained when creating our "test" data.

     

    I have found a third party tool that will mask columns while keeping relationships inttact, without expsoing the real data.

     

    Can you mask in VS2008 DB Pro's? I'd rather replace names / data than insert new rows into an empty database...

    Sunday, October 26, 2008 2:11 PM
  •  ansonee wrote:

    I have found a third party tool that will mask columns while keeping relationships inttact, without expsoing the real data.

     

    Well, that's the trick - if relationships are intact, they may be all that is needed to identify the masked data. Note that unlike encryption where there are standardized algorithms available, data masking is a problem with no proven solutions.

     

    Here's a more eye-opening discussion on the risk of trusting anonymizing tools (anonymization is another name for data masking):

     

    http://www.schneier.com/crypto-gram-0801.html#1

     

    Hope this helps

     

    Tuesday, October 28, 2008 10:03 PM
  • That's all well and good, but we are in the medical arena. Not only is our data just used to track patients, it is also used to drive future offerings and services, so just creating a bunch of random data is not going to foot the bill.

     

    We also not only use medical conditions, but geographic regions to drive what services will be offered where.

     

    If this were some Northwinds type or Adventureworks type of data and we were just checking how many sales took place that would be fine. But we have different departments that need to actually know WHAT happened without knowing WHO it happened to...that way we see thatthere were x number of vaccinations in Y region across so many dimensions...

    Tuesday, October 28, 2008 11:36 PM
  • I never said you need to generate random data. I am saying that relationships can betray the data. This means that how secure the anonymization is depends not only on the masking method but also on the data to which you apply it. This means that even if you have in your hands the most perfect data masking solution, you have still not solved the problem - you still need to consider whether your anonymized data is secure.

    Wednesday, October 29, 2008 1:10 AM
  • I follow ya!  ;-)

    Wednesday, October 29, 2008 1:43 AM
  • I was in the same dilemma, spent a lot of time but what saved me was Dataguise. They have a solution for SQL server, very easy to deploy. Tell them Hari from Austin Texas recommended and to give you a trial POC.

    Thursday, November 13, 2008 8:32 AM
  • http://www.calsql.com/2009/09/ms-sql-server-data-masking.html

    check the above link to mask the data including the relational keys.
    Monday, October 5, 2009 1:53 PM
  • http://www.calsql.com/2009/09/ms-sql-server-data-masking.html
    checkout the above link contains all the steps to do data masking in sql server.
    Tuesday, October 27, 2009 10:25 PM
  • this threads a bit old, but the topic is pertinent to what I'm searching for.

    As I'm also in the med industry, I think I get what Ansonee is asking for.

    The crux of it seems to be that for demonstration purposes, a copy of the database is needed with scrubbed identifiers, such that a user can interact with the program, and see example data.  But not violate HIPPAA in seeing patient information.

    I'm not familiar with masking or anonymization, but I gather the intent was it would act like a view and blank out certain columns.  The problem being, hooking that into existing code.  So I suspect the OP, like me, simply wanted to randomize some of the scary parts of the data, so that the basic shape of it was still comparable to the production server.

    For my part, I wrote UPDATE scripts to replace SSNs, Group numbers, Subscriber IDs with the value of their containing table's primary key.  Fairly trivial, but only half way there.  Ideally, I'd like the user to see "names" of patients, I just don't want them to be real patients.  For that, (and this is where the OP was headed, I think), I want to shuffle all the first names, then shuffle all the last names, then shuffle all the addresses.

    The result is, the original data simply provided me with credible names, but not matching up to anybody.  Now, if somebody got ahold of a copy of this database, I don't think they'd have anything (unless there's logs they could check).  If they got to my production DB, well, I'd be hosed for different reasons.

    I'm still googling an answer, but the gist is, can we find SQL to do:

    update Patients set FirstName=GetRandom('Patients','FirstName'), LastName=GetRandom('Patients','LastName')

    I suspect if I wrote a udf for GetRandom() I'd be set...

    From there, the question should be, is that good enough, assuming all identifiable real-world numbers like SSN have been replaced, and names and addresses are scrambled?

     

     

     

    Monday, November 8, 2010 10:19 PM
  • The US Census Bureau has a list of names that you can use.  www.census.gov/genealogy/www/freqnames.html

    There are 3 lists: female first names, male first names, and last names.  You could easily use these to substitute names in your current data set. Since your data doubtless has gender, you can even apply gender appropriate first names.

    Here is a sample code I wrote once. It ignores gender and is based on substitution from the existing data (not the Census data), but it provides an idea:

    -- You need a sequential ID for this to work
    CREATE TABLE #scramble
    (ID INT IDENTITY,
     PersonID INT,
     FirstName NVARCHAR(100),
     LastName NVARCHAR(100))
    
    INSERT INTO #scramble (PersonID, FirstName, LastName)
    SELECT PersonID, FirstName, LastName
    FROM YourSourceTable
    
    DECLARE @HowMany INT
    DECLARE @LastNameModulo INT
    DECLARE @FirstNameModulo INT
    
    -- For the modulo to work correctly, you 
    -- must know how many rows you are working with
    SELECT @HowMany = COUNT(*) FROM #scramble
    SELECT @LastNameModulo = CAST(RAND() * @HowMany AS INT)
    SELECT @FirstNameModulo = CAST(RAND() * @HowMany AS INT)
    
    UPDATE a 
      SET a.FirstName = b.FirstName, a.LastName = c.LastName
    FROM #scramble a
     JOIN #scramble b ON b.ID = ((a.ID + @HowMany) % @FirstNameModulo) + 1
     JOIN #scramble c ON c.ID = ((a.ID + @HowMany) % @LastNameModulo) + 1
    

    This just scrambles the temp table of course, but if satisfactory, could then be applied to the source table.  You could modify this to use the names provided by the census.

    All strictly FWIW,
    RLF

    Tuesday, November 9, 2010 2:17 PM
  • I know this thread is OLD, but I saw someone just replied and thought I would provide the answer I finally found...and have been using for years!!

    DataMasker...it is an AMAZING piece of software!! It gave me everythign I wanted: anonymization, data generation, etc. And even more...all at a great price!!

    I HIGHLY reccomend this software to anyone who has a need for data masking / obfuscation!


    A. M. Robinson
    Thursday, November 18, 2010 5:29 PM
  • I have explored some of the third party software's for Data obfuscation and I do agree with ansonee on DataMasker (http://www.datamasker.com/). Its very easy to use, have almost all the features that an enterprise needs, and performance is awesome. The only disadvantage i can see with it is that the client can be installed on windows only.

    On the other hand the one referred by harieta i.e. Dataguise has a very good number of features but it runs very slow while working on the large databases and at some points the application crashes showing some run time errors.

    I am looking at some other solutions also and will update here with my feedback.

    Friday, November 26, 2010 8:01 AM