Samuel Sjöberg's weblog

Skip to navigation

Searching multiple tables using fulltext

Ever needed to search multiple tables and rank the results independent of origin? Multi-table search is an important feature in a web project I am working on and the solution I chose is using a temporary table, as proposed in a comment in the MySQL documentation, to collect the multiple search results.

A temporary table is a table that only exists during a session. This means that as soon as the session is over (i.e., the connection is closed), the table is dropped. However, the greatest thing with temporary tables is that they are locked to the current session. This means that each connection can use the temporary table without interfering with others (i.e., the table is private).

The basic idea is to create a temporary table when a search is to be performed. Then, we populate the temporary table with the matching rows from each searched table. This means that one SQL query is needed for each table that is searched.

Now, because the results are stored in one single table, we can fetch the results ordered by relevance, independent of the origin table. It is also easy to implement, for example paging by using LIMIT when fetching rows from the temporary table.

Creating the temporary table

A temporary table is created in basically the same way as a regular table with the addition of the TEMPORARY keyword. The table that I used looked like this:

CREATE TEMPORARY TABLE search (
    type int NOT NULL,
    url varchar(255) NOT NULL,			
    title varchar(255) NOT NULL,
    excerpt varchar(255) NULL,
    relevance float NOT NULL,
    KEY (type),
    KEY (relevance)
)

Above, the temporary table search is created. It has a type that indicates from which table the row originates. This is used to show the users what kind of information that is found. The url is simply the absolute path to the hit (e.g. /archive/2005/01/my-entry). The title of the hit must be provided and an excerpt should also be added if possible. Finally, the relevance is kept to make it possible to sort the temporary table.

The use of indexes is a payoff between insert and retrieval time. I indexed every column that is used in ORDER BY clauses to speed things up, but I have not tested what takes up most time, to insert with indexes, or retrieve without indexes.

Performing the searches

The columns in the temporary table defines how the output of the search queries must be structured. For each table that is searched, a type, url, title, excerpt and relevance should be stored. No more and no less.

A table might not necessary have a title, url or excerpt field. It will definitely not have a type field and the url might need to be a string concatenated by many other fields or constants. This means that extensive use of conditions and concatenation rules are necessary. It is also important to return the values in the correct order (however, this depends on the insert syntax being used).

A search statement constructed with PHP looks something like this.

$sql = "INSERT INTO search 
SELECT ". TABLE_PAGES ." as type, 
    CONCAT('/', p.url) as url, 
    p.title, 
    LEFT(p.search, ". EXCERPT_LENGTH .") as excerpt,
    MATCH(p.title, p.search) 
        AGAINST('". $this->query ."') as relevance
FROM published p
WHERE MATCH(p.title, p.search) 
    AGAINST('". $this->query ."')
HAVING relevance > ". RELEVANCE_THRESHOLD ."
LIMIT ". $this->limit;

This is an example of a table called pages. It actually contains fields named url and title already, but the url must still be constructed with concatenation. In this case, the excerpt is also straightforward to produce. It is simply the first 250 (the value of EXCERPT_LENGTH) characters of the searchable content. TABLE_PAGES is a PHP constant that is used to store the table type in the type field.

A side-note worth addressing is that I chose to use separate columns for the fulltext indexing. The search column in the example above is a replicate of a column with HTML formatted text called content. I took this approach to searching to avoid risking that tags mess up the index. I am not sure how MySQL handles XML in the indexes.

The LIMIT clause is used for performance tuning in two ways. The first case is if only a few hits is necessary to display. A key feature in the project is a AJAX-based search function that displays the top five hits in all categories together with a best hit, much like Apple's spotlight. The other reason for using LIMIT is to avoid fetching too many rows. A user will never find any useful results beyond the, let's say, 100 first hits. This becomes true for sure when multiple tables are added together, each contributing with maximum 100 rows.

HAVING is used as a way to reduce the result set. I use a threshold value of 0.2.

When fetching for example persons, the url and title is not present and must be constructed from other fields. This procedure is illustrated below and also shows how a query looks once passed to MySQL.

INSERT INTO search
SELECT 3 as type,
    CONCAT('/organisation/', p.firstname, 
        ' ', p.lastname) as url,
    CONCAT(p.firstname, ' ', p.lastname) as title,
    LEFT(p.description, 250) as excerpt,
    MATCH (p.firstname, p.lastname, 
            p.email, p.description) 
        AGAINST('Samuel Sjöberg') as relevance
FROM persons p
WHERE MATCH (p.firstname, p.lastname, 
        p.email, p.description) 
    AGAINST('Samuel Sjöberg')
HAVING relevance > 0.2
LIMIT 100

Fetching the results

Once the results are in the temporary table they can be fetched, deleted or updated just like in any ordinary table. The only difference is that the table will be complete gone once the session ends. In PHP, this means that the table will be gone once the processing is done and the results are returned to the browser.

To get the results ordered by the type field, this query is used:

SELECT * FROM search ORDER BY type ASC, relevance DESC

The result of this is the same as if a separate query had been made for each table. The real useful application of this approach is when you need to do the following:

SELECT * FROM search ORDER BY relevance DESC

This query sorts the rows by relevance, with the most relevant hit first. This means that the table types are mixed. When data is fetched in this way, the type can be used to notify the user of which type each search hit is. The query above can easily be extended to support paging.

SELECT * FROM search ORDER BY relevance DESC 
LIMIT ". PAGE_LIMIT . " OFFSET ". $offset

The query above limits the result set to PAGE_LIMIT number of rows, starting the count at $offset (which often is fetched via GET).

Pages linking to this entry

Pingback is enabled on all archived entries. Read more about pingback in the Pingback 1.0 Specification.

About this post

Created 8th February 2006 22:49 CET. Filed under PHP.

0 Comments
0 Pingbacks