Fetching multiple random rows from a database

May 5, 2009

Tutorials

As a follow up to my earlier article about fetching a single random row from MySQL I will tell you today, how you can fetch multiple random rows from a table without any hassle. Compared to the solution with fetching a single random row, fetching multiple random rows requires some tricks. Let us start with creating two new test tables:

CREATE TABLE `test` (
  `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `name` VARCHAR( 32 ) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE = InnoDB;

CREATE TABLE `test_gapless` (
  `test_id` INT UNSIGNED NOT NULL,
  `random_id` INT UNSIGNED NOT NULL,
  PRIMARY KEY (`test_id`),
  UNIQUE (`random_id`)
) ENGINE = InnoDB;

Now we have two tables. Our primary table containing the usual stuff like a primary ID and a name. Additionally we have a gapless table related to it. The tricky part here is the random_id. The idea is, that independent of deleting and adding rows, the random_id column always goes from 1 to the number of existent rows in the test table, without any holes or duplicates. To approach this, we are going to write two triggers for the main table. Let's start with the insert trigger:

DELIMITER //
CREATE TRIGGER `test_insert` AFTER INSERT ON `test`
  FOR EACH ROW BEGIN
    DECLARE `next_random_id` INT;
    
    SELECT (COALESCE(MAX(`random_id`), 0) + 1)
    INTO `next_random_id`
    FROM `test_gapless`;
    
    INSERT INTO `test_gapless` (
      `test_id`,
      `random_id`
    ) VALUES (
      NEW.`id`,
      `next_random_id`
    );
END
//
DELIMITER ;

This insert trigger will always assign the next highest random_id to the newly inserted rows. Now we have to keep track of deleted rows with another trigger:

DELIMITER //
CREATE TRIGGER `test_delete` AFTER DELETE ON `test`
  FOR EACH ROW BEGIN
    DECLARE `current_random_id` INT;
    
    SELECT `random_id`
    INTO `current_random_id`
    FROM `test_gapless`
    WHERE `test_id` = OLD.`id`;
    
    DELETE FROM `test_gapless` WHERE `test_id` = OLD.`id`;
    
    UPDATE `test_gapless`
    SET `random_id` = `random_id` - 1
    WHERE `random_id` > `current_random_id`;
END
//
DELIMITER ;

Now as we have our triggers ready, we can insert some dummy data into the table:

INSERT INTO `test` (
  `name`
) VALUES
  ('Foo'),
  ('Bar'),
  ('Baz'),
  ('PHP'),
  ('MySQL'),
  ('Zend Framework'),
  ('HTML'),
  ('JavaScript'),
  ('CSS'),
  ('AJAX'
);

Now let's take a look at the generated data:

mysql> SELECT * FROM `test` AS `t`
    -> INNER JOIN `test_gapless` AS `g`
    -> ON `g`.`test_id` = `t`.`id`;
+----+----------------+---------+-----------+
| id | name           | test_id | random_id |
+----+----------------+---------+-----------+
|  1 | Foo            |       1 |         1 | 
|  2 | Bar            |       2 |         2 | 
|  3 | Baz            |       3 |         3 | 
|  4 | PHP            |       4 |         4 | 
|  5 | MySQL          |       5 |         5 | 
|  6 | Zend Framework |       6 |         6 | 
|  7 | HTML           |       7 |         7 | 
|  8 | JavaScript     |       8 |         8 | 
|  9 | CSS            |       9 |         9 | 
| 10 | AJAX           |      10 |        10 | 
+----+----------------+---------+-----------+
10 rows in set (0.00 sec)

Everything looks fine and the random_id was always assigned correctly. Now let's try to delete a some rows from the table:

mysql> DELETE FROM `test` WHERE `id` BETWEEN 3 AND 4;

Then we check the random table again to make sure that there are no holes in the random_id:

mysql> SELECT * FROM `test` AS `t`
    -> INNER JOIN `test_gapless` AS `g`
    -> ON `g`.`test_id` = `t`.`id`;
+----+----------------+---------+-----------+
| id | name           | test_id | random_id |
+----+----------------+---------+-----------+
|  1 | Foo            |       1 |         1 | 
|  2 | Bar            |       2 |         2 | 
|  5 | MySQL          |       5 |         3 | 
|  6 | Zend Framework |       6 |         4 | 
|  7 | HTML           |       7 |         5 | 
|  8 | JavaScript     |       8 |         6 | 
|  9 | CSS            |       9 |         7 | 
| 10 | AJAX           |      10 |         8 | 
+----+----------------+---------+-----------+
8 rows in set (0.01 sec)

And as you can see, everything is still fine. We don't care about updates to the id-column for now, but that can easily be accomplished by with an update trigger or a constraint. Our next part is to use the random_id to actually select some random rows from out main table. We can do this in multiple ways, like completly in a stored procedure, but for this tutorial we will do the entire thing on client side for simplicity.

First we need to get as many random IDs as we want to fetch rows from the database Let's say we want to fetch three random rows:

<?php
$highestRandomId = /* SQL: SELECT MAX(`random_id`) FROM `test_gapless` */;
$numRows         = min(3, $highestRandomId);
$randomIds       = array();

while (count($randomIds) < $numRows) {
    $randomId = rand(1, $highestRandomId);
    
    if (!in_array($randomId, $randomIds)) {
        $randomIds[] = $randomId;
    }
}

Now we have an array filled with all random IDs we need. Now we can fetch our rows:

SELECT * FROM `test` AS `t`
INNER JOIN `test_gapless` AS `g`
ON `g`.`test_id` = `t`.`id`
WHERE `g`.`random_id` IN ($randomIds);

You will now get something like that:

+----+----------------+---------+-----------+
| id | name           | test_id | random_id |
+----+----------------+---------+-----------+
|  6 | Zend Framework |       6 |         4 | 
|  8 | JavaScript     |       8 |         6 | 
|  5 | MySQL          |       5 |         3 | 
+----+----------------+---------+-----------+

Hopefully this tutorial gave you some ideas, not only how to fetch random rows from MySQL while keeping a good performance, but also how you can trick around in MySQL to do stuff which is normally not easily possible.

8 Responses to “Fetching multiple random rows from a database”

  1. terminate Says:

    I was thinking about your problem and the overhead of keeping the random_id field sequential and gap-less and then came up with the, "why bother", answer.

    If you need 4 rows select 16 (just in case) and have the selection on a try-catch so that you can recover from the event of not having 4 valid fields.

    Some ID’s may have been deleted already but the majority will not. Even if you can’t pull the 4 rows in the first SQL pull, you’ll grab them over the next pulls, ideally your number of ‘extra’ ID’s will prevent multiple connections and everything will come on the first pull.

    It’s a bit dirtier but even more efficient – or am i wrong?

  2. _____anonymous_____ Says:

    Is there an advantage to using this elaborate method over a simple "ORDER BY RAND() LIMIT 3" in the MySQL query itself?

  3. dasprid Says:

    @Franky: Actually that was the approach I suggested in my previous article, but that can lead to more reads than necessary. And usually you have more reads on a database than writes (and caching random results is not really in the mind of randomness).

  4. dasprid Says:

    @Christine: Take a look at my previous article and you will know the difference.

  5. _____anonymous_____ Says:

    Thanks a lot for refreshing my data base procedural knowledge :)

    IN clause is expensive. Why not performing instead N select statments? This way you don’t need an auxiliar table, triggers or InnoDB. By the way, as far as I know MyISAM is faster than InnoDB when fetching data.

    Let’s assume $randomIds has a list of N distinct numbers between 1 and total items (Same to yours). So now:

    foreach($randomIds as $position) {
    $SQL = ‘SELECT * FROM table ORDER BY id LIMIT ‘.$position.’, 0′;

    // Code to store each row
    }

    I have never perform a benchmark… How did you get that friendly graph? :)

  6. dasprid Says:

    @Cuso:

    I don’t know if you are refering to older versions of MySQL, but IN-clauses aren’t really expensive. Also, InnoDB is way faster for reading since some time than MyISAM, see:
    http://www.mysqlperformanceblog.com/2007/01/08/innodb-vs-myisam-vs-falcon-benchmarks-part-1/

    Doing the selection by position does sound interesting for one in the first place, but always keep in mind; when doing a LIMIT statement, MySQL actually has to fetch X number of records, just to return the last one. In a huge table, this may take some time. And since you are reepating that query multiple times, it will get even slower.

    About the graph, I simply generated an CSV with PHP when executing the queries and imported those into OpenOffice Calc, created the graph out of it and made a screenshot.

  7. _____anonymous_____ Says:

    I have just executed a select to fetch one row with "LIMIT 1000000, 1". The table has 4 millions of registers. It took 3,68 sec.

    Thanks for the link. I have always heard MyISAM was faster than InnoDB. I will take in mind for future projects :)

  8. ilanhazan Says:

    There is another way to produce random rows using only a query and without order by rand().
    It involves User Defined Variables.
    See http://www.mysqldiary.com/how-to-produce-random-rows-from-a-table/