In part III of this series, I’m going to cover a few more topics including Delta Indexing, Index Merging, and Distributed Indexes. If you used [part I](http://vocecommunications.com/blog/2010/07/extending-wordpress-search-with-sphinx-part-i/) to get Sphinx up and running, and [part II](http://vocecommunications.com/blog/2010/07/extending-wordpress-search-with-sphinx-part-ii/) to get it integrated with WordPress, we’ll continue from there.

## Delta Indexing

Using the `sphinx.conf` from part I, we’re going to build on our main source and index to create a second, separate index. This is called ‘delta indexing’ since we’re going to tell Sphinx to only index records that were created since the last time we ran the `indexer` command.

So why use delta indexing? Well, if your data is small you can run the full re-index as often as you’d like. Over time as your application grows with new posts, comments, users, etc. – or if your application already has a large amount of data – Sphinx will take longer and longer to index the data. Delta indexing keeps you from having to rebuild the entire index by keeping track of the last indexed ID that was added.

To start, we’ll need to create a “counter” table that will hold the last used ID that Sphinx added to the index. Here is an example create table statement:

[cc lang=”mysql”]
CREATE TABLE `wp_sph_counter` (
`counter_id` int(11) NOT NULL,
`max_doc_id` int(11) NOT NULL,
PRIMARY KEY (`counter_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
[/cc]

Replace `ENGINE=MyISAM` and `CHARSET=utf8` with your respective environment values.

[cc lang=”text”]
#
# INDEX GROUP:
# MY BLOG
#
# INDEX NAME:
# my_blog
#
# SOURCES:
# src_my_blog
# src_delta_my_blog
#
# INDEXES:
# idx_my_blog
# idx_delta_my_blog
#
#
source src_my_blog {
type = mysql
sql_host = localhost
sql_user = mysql_user
sql_pass = mysql_user_password
sql_db = mysql_table

sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO wp_sph_counter SELECT 1, MAX(id) FROM wp_posts

sql_query =
SELECT
p.ID*2+1 AS ID,
p.ID as post_ID,
p.post_title as title,
p.post_content as body,
UNIX_TIMESTAMP(p.post_date) AS date_added
FROM
wp_posts as p
WHERE
p.id ( SELECT max_doc_id FROM wp_sph_counter WHERE counter_id=1 ) AND
p.post_type = ‘post’ AND
p.post_status = ‘publish’;
}

index idx_delta_my_blog : idx_my_blog {
source = src_delta_my_blog
path = /var/data/idx_delta_my_blog
}
[/cc]

**Worth noting:**

* Before the `idx_my_blog` index is built, Sphinx will get the max used ID from the `wp_posts` table and insert that into the newly created “counter” table.
* The `sql_query` here now selects from our counter table. The main index will select all the records less than or equal to the last indexed ID, and the delta index will index all the records greater than the last indexed ID.
* The `src_delta_my_blog` and `idx_delta_my_blog` both ‘inherit’ from the main `src_my_blog` and `idx_my_blog` respectively. This prevents you from having to specify the same settings on both the main index and the delta index.

The delta index will index the new rows faster than re-indexing the main index. This is because we’re only getting records that have changed. Because of this, we can run our delta index much more frequently than our main index, especially when your tables may consist of millions of rows.

Here is an example cron tab to re-index both the main index (daily) and the delta index (every 15 minutes). These values can easily be changed to suit your environment or your application needs.

[cc lang=”text”]
*/15 * * * * /usr/bin/indexer –config /etc/sphinx/sphinx.conf idx_delta_my_blog –rotate
00 00 * * * /usr/bin/indexer –config /etc/sphinx/sphinx.conf –all –rotate
[/cc]

## Distributed Indexes

Looking at the above configuration – we now have two indexes; a delta index and a main index. It doesn’t make much sense to query these separately and then somehow combine the results. This is where a distributed index comes in. Update the above configuration with the below example:

[cc lang=”text”]
index my_blog {
type = distributed
local = idx_my_blog
local = idx_delta_my_blog
}
[/cc]

This is a distributed index. We’ve taken the delta index (`idx_delta_my_blog`) and the main index (`idx_my_blog`) and have instructed sphinx to treat them as one index called `my_blog`.

Now we have to tell Sphinx to use this index to search against. The `$sphinx->query` function takes a 2nd argument that can be used to specify the name of the index to search against. Update your call to `$sphinx->query`, using the example from part II to match:

[cc lang=”php”]

$result = $sphinx->query($search, ‘my_blog’);

[/cc]

## Index Merging

Every so often, you’ll want to merge your delta index and your main index together. From the [Sphinx documentation on index merging](http://www.sphinxsearch.com/docs/manual-0.9.9.html#index-merging):

> Merging two existing indexes can be more efficient that indexing the data from scratch, and desired in some cases (such as merging ‘main’ and ‘delta’ indexes instead of simply reindexing ‘main’ in ‘main+delta’ partitioning scheme). So indexer has an option to do that. Merging the indexes is normally faster than reindexing but still not instant on huge indexes. Basically, it will need to read the contents of both indexes once and write the result once. Merging 100 GB and 1 GB index, for example, will result in 202 GB of IO (but that’s still likely less than the indexing from scratch requires).

So index merging can save you the database overhead (useful when your indexes consist of millions of rows), but will cost you on the I/O side.

In this cron tab example, we’ll merge our delta index into our main index once a week (midnight on Sunday):

[cc lang=”text”]
00 00 * * 0 /usr/bin/indexer -c /etc/sphinx/sphinx.conf –rotate –merge idx_tkblog idx_delta_tkblog
[/cc]

Hint: you can also drop these cron commands into your `/etc/cron.weekly` or `/etc/cron.daily` folders.

## Also…

* Make sure your queries return unique IDs to Sphinx. You can easily test your query against your MySQL instance before putting it into Sphinx to make sure the results are expected.
* Your query can be as complex or as simple as your application needs it to be. Sometimes you’ll need to use complex joins, the `CAST` function, the `CONCAT` and `GROUP_CONCAT` functions, and control flow functions like `IF` or `CASE`.
* In a specific application we developed, setting `morphology = none` provided more accurate results than `morphology = stem_enru`. You can find out more about morphology in the [Sphinx documentation](http://www.sphinxsearch.com/docs/manual-0.9.9.html#conf-morphology). Be sure to _test_ your searches and results to ensure the accuracy of your index.