locked
Returning Random Row based on Subset of Data within Table RRS feed

  • Question

  • Hi,

    Please see below.  Running SQL Server 2008 R2.

    Sample DDL:

    CREATE TABLE [dbo].[TestPersons]
    	(
    	[TestPersonID] [int] NOT NULL IDENTITY(1,1),
    	[FirstName] [varchar](50) NULL,
    	[LastName] [varchar](50) NULL,
    	[AreaID] [varchar](50) NULL,
    	CONSTRAINT [PK_TestPersons_TestPersonID] PRIMARY KEY CLUSTERED ([TestPersonID] ASC)
    	WITH
    		(
    		PAD_INDEX = OFF,
    		STATISTICS_NORECOMPUTE = OFF,
    		IGNORE_DUP_KEY = OFF,
    		ALLOW_ROW_LOCKS = ON,
    		ALLOW_PAGE_LOCKS = ON
    		)
    	ON [PRIMARY]
    	)
    ON [PRIMARY]
    GO

    Sample Data:

    INSERT INTO
    	[dbo].[TestPersons] ([FirstName], [LastName], [AreaID])
    VALUES
    	('Carlos', 'Matlock', 'A0009'),
    	('William', 'Rivas', 'A0001'),
    	('Kathryn', 'Rice', 'A0008'),
    	('John', 'Ball', 'A0009'),
    	('Robert', 'Barnhill', 'A0009'),
    	('Timothy', 'Stein', 'A0009'),
    	('Christopher', 'Smith', 'A0001'),
    	('Brian', 'Speakman', 'A0001'),
    	('Harold', 'Clark', 'A0009'),
    	('Tim', 'Henson', 'A0009'),
    	('Victor', 'Chilson', 'A0009')

    The above insert statement is a small example of the data contained in this table.  Normally the table would contain several thousand rows.  We use the following query to replace actual data with random rows from our test table:

    UPDATE
    	[P]
    SET
    	[P].[FirstName] = [T].[FirstName],
    	[P].[LastName] = [T].[LastName],
    	[P].[AreaID] = [T].[AreaID]
    FROM
    	[dbo].[Persons] [P]
    INNER LOOP JOIN
    	[dbo].[TestPersons] [T] ON ([T].[TestPersonID] = (1 + ABS(CRYPT_GEN_RANDOM(8)%5000)))
    This query works as it selects a random row from the entire set of data in the table.  However there are cases where we need to specify a restricted subset to randomize from.  For example, we may need to randomize data only for Persons with an AreaID of A0001 or A0008.  So in that case, and using the sample data above, we would want the  randomization to only select from rows in TestPersons that have an AreaID of A0001 or A0008.  How would I go about accomplishing this?  I've tried adding a WHERE clause but it seems it's ignored because of the INNER LOOP JOIN.  I've also tried including [P].[AreaID] = [T].[AreaID] in the join hint but to no avail.

    I also realize having sample data with only the set that we need would solve this problem but for our needs we need a large test set as our randomization requirements depend on the situation.

    Any assistance is greatly appreciated!

    Best Regards
    Brad
    Tuesday, November 4, 2014 5:35 PM

Answers

  • DECLARE @TestPersons TABLE (TestPersonID int NOT NULL IDENTITY(1,1), FirstName varchar(50),	LastName varchar(50), AreaID varchar(50))
    
    INSERT INTO	@TestPersons (FirstName, LastName, AreaID)
    VALUES	('Carlos', 'Matlock', 'A0009'),	('William', 'Rivas', 'A0001'),	('Kathryn', 'Rice', 'A0008'),	('John', 'Ball', 'A0009'),	('Robert', 'Barnhill', 'A0009'),	('Timothy', 'Stein', 'A0009'),	
            ('Christopher', 'Smith', 'A0001'),	('Brian', 'Speakman', 'A0001'),	('Harold', 'Clark', 'A0009'),	('Tim', 'Henson', 'A0009'),	('Victor', 'Chilson', 'A0009')
    
    ;WITH subset AS (
    SELECT ROW_NUMBER() OVER (ORDER BY TestPersonID) AS sID, *
              FROM @TestPersons
    		 WHERE FirstName LIKE '%e%'
    )
    
    SELECT *
      FROM subset
     WHERE sID = round((((SELECT COUNT(*) FROM subset) - 1 - 1)*rand())+1,0)

    This will grab a random row from a subset (defined in the CTE).

    Tuesday, November 4, 2014 6:50 PM

All replies

  • Since you didn't provide full DDL, I will only go through "verbally" how I'd do this.

    To get one random row:

    SELECT TOP(1) col1, col2, col3 FROM tbl ORDER BY NEWID()

    Since I loathe update based on join, I suggest you use MERGE instead. Something like

    MERGE Persons AS p1

    USING (SELECT TOP(1) col1, col2, col3 FROM tbl ORDER BY NEWID()) AS p2 ON ...

    WHEN MATCH

    UPDATE SET p1.FirstName = p2.FirstName

    ...

    If your TestPersons table is huge, then you want to combine NEWID() with TABLESAMPLE, so SQL Server don't have to order all rows in the table before picking a random row.

    Please post back with full DDL if you want executable code :-)


    Tibor Karaszi, SQL Server MVP | web | blog

    Tuesday, November 4, 2014 6:01 PM
  • DECLARE @TestPersons TABLE (TestPersonID int NOT NULL IDENTITY(1,1), FirstName varchar(50),	LastName varchar(50), AreaID varchar(50))
    
    INSERT INTO	@TestPersons (FirstName, LastName, AreaID)
    VALUES	('Carlos', 'Matlock', 'A0009'),	('William', 'Rivas', 'A0001'),	('Kathryn', 'Rice', 'A0008'),	('John', 'Ball', 'A0009'),	('Robert', 'Barnhill', 'A0009'),	('Timothy', 'Stein', 'A0009'),	
            ('Christopher', 'Smith', 'A0001'),	('Brian', 'Speakman', 'A0001'),	('Harold', 'Clark', 'A0009'),	('Tim', 'Henson', 'A0009'),	('Victor', 'Chilson', 'A0009')
    
    ;WITH subset AS (
    SELECT ROW_NUMBER() OVER (ORDER BY TestPersonID) AS sID, *
              FROM @TestPersons
    		 WHERE FirstName LIKE '%e%'
    )
    
    SELECT *
      FROM subset
     WHERE sID = round((((SELECT COUNT(*) FROM subset) - 1 - 1)*rand())+1,0)

    This will grab a random row from a subset (defined in the CTE).

    Tuesday, November 4, 2014 6:50 PM