We Are Communication Architects

Building brand awareness through content creation and community engagement.

September 22nd, 2010

Extending WordPress search with Sphinx (Part III)

In part III of this series, I’m going to cover a few more topics including Delta Indexing, Index Merging, and Distributed Indexes. If you used part I to get Sphinx up and running, and part II to get it integrated with WordPress, we’ll continue from there.

Delta Indexing

Using the sphinx.conf from part I, we’re going to build on our main source and index to create a second, separate index. This is called ‘delta indexing’ since we’re going to tell Sphinx to only index records that were created since the last time we ran the indexer command.

So why use delta indexing? Well, if your data is small you can run the full re-index as often as you’d like. Over time as your application grows with new posts, comments, users, etc. – or if your application already has a large amount of data – Sphinx will take longer and longer to index the data. Delta indexing keeps you from having to rebuild the entire index by keeping track of the last indexed ID that was added.

To start, we’ll need to create a “counter” table that will hold the last used ID that Sphinx added to the index. Here is an example create table statement:

CREATE TABLE `wp_sph_counter` (
  `counter_id` int(11) NOT NULL,
  `max_doc_id` int(11) NOT NULL,
  PRIMARY KEY  (`counter_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

Replace ENGINE=MyISAM and CHARSET=utf8 with your respective environment values.

#
#  INDEX GROUP:
#  MY BLOG
#
#  INDEX NAME:
#  my_blog
#
#  SOURCES:
#  src_my_blog
#  src_delta_my_blog
#
#  INDEXES:
#  idx_my_blog
#  idx_delta_my_blog
#
#
source src_my_blog {
    type = mysql
    sql_host = localhost
    sql_user = mysql_user
    sql_pass = mysql_user_password
    sql_db = mysql_table

    sql_query_pre = SET NAMES utf8
    sql_query_pre = REPLACE INTO wp_sph_counter SELECT 1, MAX(id) FROM wp_posts

    sql_query = \
      SELECT \
        p.ID*2+1 AS ID, \
        p.ID as post_ID, \
        p.post_title as title, \
        p.post_content as body, \
        UNIX_TIMESTAMP(p.post_date) AS date_added \
      FROM \
        wp_posts as p \
      WHERE \
        p.id <= ( SELECT max_doc_id FROM wp_sph_counter WHERE counter_id=1 ) AND \
        p.post_type = 'post' AND \
        p.post_status = 'publish';

    sql_attr_uint = post_id
    sql_attr_uint = blog_id
    sql_attr_timestamp = date_added

    # document info query, ONLY for CLI search (ie. testing and debugging)
    # must contain $id macro and must fetch the document by that id
    sql_query_info = SELECT ID, post_title FROM wp_posts WHERE id=($id - 1)/2;
}

index idx_my_blog {
    source = src_my_blog
    path = /var/data/idx_my_blog

    docinfo = extern
    mlock = 0
    morphology = stem_enru
    min_stemming_len = 4
    min_word_len = 1
    charset_type = sbcs # or utf-8
    html_strip = 0
    html_index_attrs = img=alt,title; a=title;
    html_remove_elements = style, script, object, embed, span
}

source src_delta_my_blog : src_my_blog {
    sql_query_pre = SET NAMES utf8
   
    sql_query = \
      SELECT \
        p.ID*2+1 AS ID, \
        p.ID as post_ID, \
        p.post_title as title, \
        p.post_content as body, \
        UNIX_TIMESTAMP(p.post_date) AS date_added \
      FROM \
        wp_posts as p \
      WHERE \
        p.id > ( SELECT max_doc_id FROM wp_sph_counter WHERE counter_id=1 ) AND \
        p.post_type = 'post' AND \
        p.post_status = 'publish';
}

index idx_delta_my_blog : idx_my_blog {
    source = src_delta_my_blog
    path = /var/data/idx_delta_my_blog
}

Worth noting:

  • Before the idx_my_blog index is built, Sphinx will get the max used ID from the wp_posts table and insert that into the newly created “counter” table.
  • The sql_query here now selects from our counter table. The main index will select all the records less than or equal to the last indexed ID, and the delta index will index all the records greater than the last indexed ID.
  • The src_delta_my_blog and idx_delta_my_blog both ‘inherit’ from the main src_my_blog and idx_my_blog respectively. This prevents you from having to specify the same settings on both the main index and the delta index.

The delta index will index the new rows faster than re-indexing the main index. This is because we’re only getting records that have changed. Because of this, we can run our delta index much more frequently than our main index, especially when your tables may consist of millions of rows.

Here is an example cron tab to re-index both the main index (daily) and the delta index (every 15 minutes). These values can easily be changed to suit your environment or your application needs.

*/15 * * * * /usr/bin/indexer --config /etc/sphinx/sphinx.conf idx_delta_my_blog --rotate
00 00 * * * /usr/bin/indexer --config /etc/sphinx/sphinx.conf --all --rotate

Distributed Indexes

Looking at the above configuration – we now have two indexes; a delta index and a main index. It doesn’t make much sense to query these separately and then somehow combine the results. This is where a distributed index comes in. Update the above configuration with the below example:

index my_blog {
    type = distributed
    local = idx_my_blog
    local = idx_delta_my_blog
}

This is a distributed index. We’ve taken the delta index (idx_delta_my_blog) and the main index (idx_my_blog) and have instructed sphinx to treat them as one index called my_blog.

Now we have to tell Sphinx to use this index to search against. The $sphinx->query function takes a 2nd argument that can be used to specify the name of the index to search against. Update your call to $sphinx->query, using the example from part II to match:

  ...
  $result = $sphinx->query($search, 'my_blog');
  ...

Index Merging

Every so often, you’ll want to merge your delta index and your main index together. From the Sphinx documentation on index merging:

Merging two existing indexes can be more efficient that indexing the data from scratch, and desired in some cases (such as merging ‘main’ and ‘delta’ indexes instead of simply reindexing ‘main’ in ‘main+delta’ partitioning scheme). So indexer has an option to do that. Merging the indexes is normally faster than reindexing but still not instant on huge indexes. Basically, it will need to read the contents of both indexes once and write the result once. Merging 100 GB and 1 GB index, for example, will result in 202 GB of IO (but that’s still likely less than the indexing from scratch requires).

So index merging can save you the database overhead (useful when your indexes consist of millions of rows), but will cost you on the I/O side.

In this cron tab example, we’ll merge our delta index into our main index once a week (midnight on Sunday):

00 00 * * 0 /usr/bin/indexer -c /etc/sphinx/sphinx.conf --rotate --merge idx_tkblog idx_delta_tkblog

Hint: you can also drop these cron commands into your /etc/cron.weekly or /etc/cron.daily folders.

Also…

  • Make sure your queries return unique IDs to Sphinx. You can easily test your query against your MySQL instance before putting it into Sphinx to make sure the results are expected.
  • Your query can be as complex or as simple as your application needs it to be. Sometimes you’ll need to use complex joins, the CAST function, the CONCAT and GROUP_CONCAT functions, and control flow functions like IF or CASE.
  • In a specific application we developed, setting morphology = none provided more accurate results than morphology = stem_enru. You can find out more about morphology in the Sphinx documentation. Be sure to test your searches and results to ensure the accuracy of your index.

About the Author
Sean O'Shaughnessy is the Systems Administrator for our Voce Connect Platforms team. He ensures the platforms team has access to an efficient development environment and that all production environments for clients are running smoothly. Follow Sean on Twitter @seanosh.

Filed in Development, Programming, Search, WordPress

Add Your Comment4 Responses to “Extending WordPress search with Sphinx (Part III)”

mrlemonade on December 5th, 2010 at 5:06 pm

hello,

im currently using your wordpress sphinx plugin, but is there any way to include sphinx search based on post content too not just only searching by post title?

thx.

Sean O'Shaughnessy on December 6th, 2010 at 11:23 am

@mrlemonade,

The configuration example above will search ‘post_title’ and ‘post_content’. You can have Sphinx index anything you can write a MySQL query for (for example if you needed to pull in tags/categories or user information).

Joe on January 5th, 2011 at 1:57 am

Do you have a mysql query that can fetch a post’s title, content and tags? I’m having a bit of trouble with the last one, tags, since you need joins etc.

edtechre on January 9th, 2011 at 8:07 pm

I’m confused about something here. Why is an index merge necessary if main and delta get reindexed already? Wouldn’t the rebuilt main index already contain changes from the delta index?