We Are Communication Architects

Building brand awareness through content creation and community engagement.

July 12th, 2010

Extending WordPress search with Sphinx (Part I)

Sphinx is a powerful open source SQL full-text search engine. It runs as a single process in the background, and can be connected to over a specified IP and port.

It supports weighted ranking of search results, different search matching modes (all words in the query, any words in the query, exact phrase), and filtering on specific attributes. Sphinx also supports distributed searching, and phrase proximity ranking for better relevance.

Keep an eye out tomorrow – Mike Pretty will be providing some code examples in Part II on how to tie this setup and configuration into WordPress…

So why Sphinx? What’s wrong with the default WordPress search?

Nothing is wrong with the default WordPress search, it suits plenty of use-cases out of the box. Sometimes, you just need more features than the standard search can provide with post_content LIKE ‘%phrase%’.

What if searching comments was desired? Or custom post types? Using Sphinx can help extend the default search to include posts, pages, comments, custom post types and any other data that can be retrieved with a MySQL query.

There are plenty of resources out there that can help you install Sphinx. RPMforge carries a Sphinx package, and you can also download, compile, and install yourself.

Sources and Indexes

The two fundamental components to Sphinx are a source and an index. The source tells Sphinx where to get the data from and an index uses the source to define how to store the data. There are also various settings you can apply to the indexer process itself (such as port to listen on, memory limit, and where to store logs).

For now, we’re going to start with a simple configuration file that will allow us to search through posts in our blog.

#
#  GENERAL SETTINGS
#
indexer {
mem_limit = 32M
}

searchd {
listen = 9312

log = /var/log/searchd.log
query_log = /var/log/query.log

read_timeout = 5
client_timeout = 300

max_children = 30

pid_file = /var/run/searchd.pid

max_matches = 1000
seamless_rotate = 1
preopen_indexes = 0
unlink_old = 1

mva_updates_pool = 1M
max_packet_size = 8M

max_filters = 256
max_filter_values = 4096
}

#
#  INDEX GROUP:
#  MY BLOG
#
#  SOURCES:
#  src_my_blog
#
#  INDEXES:
#  idx_my_blog
#
#
source src_my_blog {
type = mysql
sql_host = localhost
sql_user = mysql_user
sql_pass = mysql_user_password
sql_db = mysql_table

sql_query_pre = SET NAMES utf8

sql_query = \
SELECT \
p.ID*2+1 AS ID, \
p.ID as post_ID, \
p.post_title as title, \
p.post_content as body, \
UNIX_TIMESTAMP(p.post_date) AS date_added \
FROM \
wp_posts as p \
WHERE \
p.post_type = 'post' AND \
p.post_status = 'publish';

sql_attr_uint = post_id
sql_attr_str2ordinal = title
sql_attr_timestamp = date_added

sql_query_info = SELECT ID, post_title FROM wp_posts WHERE id=($id - 1)/2
}

index idx_my_blog {
source = src_my_blog
path = /var/data/idx_my_blog

docinfo = extern
mlock = 0
morphology = stem_enru
min_stemming_len = 4
min_word_len = 1
charset_type = sbcs # or utf-8
html_strip = 0
html_index_attrs = img=alt,title; a=title;
html_remove_elements = style, script, object, embed, span
}

For the sake of example, we’re going to save this file in /etc/sphinx/sphinx.conf.

You’ll need to change your sql_host, sql_user, sql_pass, sql_db values accordingly to match your environment.

A couple configuration settings worth noting:

  • searchd { listen = 9312 } – this tells the Sphinx daemon what port to listen on.
  • source src_my_blog { sql_query } – this tells Sphinx using the SQL connection info above, what data from what table(s) to index. The important thing to note here is that your query must return a unique ID for every row. You can test your query first using the MySQL command line, or something like phpMyAdmin
  • source src_my_blog { sql_query_info } – the sql_query_info directive allows you to use the command line tool search to test the index(es).
  • index idx_my_blog { html_index_attrs = img=alt,title; a=title; } and index idx_my_blog { html_remove_elements = style, script, object, embed, span } – These two settings tell Sphinx that we want to index the alt and/or title attributes of an image, and the title attribute of links. The html_remove_elements tells Sphinx that we don’t want to index those HTML tags or anything in between them (JavaScripts, embed tags, etc.)

You can read up on all the available configuration settings and recommended values in the Sphinx API reference.

Building the Index

Now that we have defined a basic source and a basic index, we’ll want to get the data indexed. We’re going to use the indexer command which is responsible for gathering the index and storing it where you defined path in your index (in the above example: /var/data/idx_my_blog):

$ indexer --config /etc/sphinx/sphinx.conf --all

Start searchd

So we have a working configuration, and have built our first index. Now, we’ll want to start the searchd process that will interface our application to the previously built indexes.

$ searchd --config /etc/sphinx/sphinx.conf

Re-indexing

At this point, any new posts you publish won’t be added to your Sphinx index. How come? Well, you have to tell Sphinx to rebuild the index. Since we don’t want to manually have to build our index, we’ll add it to cron to rebuild every 5 minutes:

*/5 * * * * /usr/bin/indexer --config /etc/sphinx/sphinx.conf --all --rotate
  • note: the path to your ‘indexer’ command may vary…

This time, we passed in the –rotate option. Since it’s not practical to take your index offline to rebuild it, the –rotate option will build your index in parallel and send a SIGHUP to your searchd process.

For larger indexes, you can also use ‘delta’ indexing. ‘Delta’ indexing will create a ‘master’ index, that will take some time to build initially (depending on your query/size of your data). A second ‘delta’ index is maintained that only adds records to it that are greater than the defined field.

For example, you can build your ‘master’ index, and upon completion, have Sphinx save the highest auto-incrementing ID of the table. You then instruct your ‘delta’ index to only index records greater than the previously stored ID. See “Live index updates” in the Sphinx docs for more info.

Read Extending WordPress search with Sphinx Part II for some code examples on how to tie your above setup and configuration into WordPress…

Filed in Development, Programming, Search, WordPress

Add Your Comment14 Responses to “Extending WordPress search with Sphinx (Part I)”

anton on July 20th, 2010 at 8:16 pm

I got an error, help please, thanks

indexer –config sphinx.conf –all Sphinx 1.10-beta (r2420) Copyright (c) 2001-2010, Andrew Aksyonoff Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file ‘sphinx.conf’… indexing index ‘idx_my_blog’… ERROR: index ‘idx_my_blog’: sql_query: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘FROM wp_posts as p WHERE p.post_type = ‘post’ AND p.post_status = ‘publish” at line 1 (DSN=mysql://user_x:***@localhost:3306/db_x). total 0 docs, 0 bytes total 0.003 sec, 0 bytes/sec, 0.00 docs/sec total 0 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg total 0 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg

Sean O'Shaughnessy on July 21st, 2010 at 12:33 pm

Hey Anton,

There was a mistake in my above configuration. The line that reads “UNIX_TIMESTAMP(p.post_date) AS date_added \” should not have a comma after “date_added”. This has been corrected above, but remove that comma from your index and try again.

anton on July 25th, 2010 at 2:42 am

Yaw… now work..

using config file ‘sphinx.conf’… indexing index ‘idx_my_blog’… collected 18 docs, 0.3 MB sorted 0.0 Mhits, 100.0% done total 18 docs, 260195 bytes total 0.080 sec, 3215181 bytes/sec, 222.42 docs/sec total 6 reads, 0.000 sec, 19.7 kb/call avg, 0.0 msec/call avg total 12 writes, 0.000 sec, 23.9 kb/call avg, 0.0 msec/call avg

Thanks very very much….

Dave Weiss on August 8th, 2010 at 3:36 am

Very nice!

The Sphinx install was pretty straight forward. One small issue with your plugin – there’s no reference to sphinxapi.php anywhere, so the call to instantiate a new SphinxClient fails, and the search function falls back to the standard WordPress search.

No problems, though – I grabbed a copy of the sphinxapi.php from the api directory and stuck it in the plugin’s directory, then added a ‘require_once’ for the file.

All works splendidly!

Sean O'Shaughnessy on August 9th, 2010 at 10:55 am

@Dave Weiss,

Thanks for pointing that out – the sphinxapi.php works great, or you could also install the Sphinx library from PECL (http://pecl.php.net/package/sphinx) – assuming you/your setup have access/capabilities to do so.

Michael Pretty (prettyboymp) on August 9th, 2010 at 11:19 am

@Dave Weiss, I’ll look into adding the sphinxapi.php as a fall back for when the library isn’t installed. Thanks for the feedback.

Dave Weiss on August 10th, 2010 at 10:04 pm

Didn’t know about the PECL library. I’ll check that out.

One additional thing I found – the 1000 search results in the conf file is a bit high for me, so I went into the sphinx.conf file and cranked it down to 100. I don’t really need more than 100 or maybe 200 results.

But when I did that, something went a little haywire, with the results coming from the search exceeding the number of results allowed via the conf file, which caused an error. The search would fall back to the standard WP search nicely, but the WP search is not configurable to handle custom post types, which is a feature I need.

I would not have figured out the number of results issue without fiddling with deactivating and reactivating, removing and reinstalling the plugin. A little error message popped up at that point.

But all-in-all, this works very well, and I like that the search index is automatically re-indexed through a cron job. There are other WP search plugins where this is not the case, or the plugins are not compatible with the new post type feature in 3.0 – a feature I use heavily.

I’m going to dig into the search parameters a bit and see what I come up with. Some of my search results are not as narrowly tied to the search terms as I would like them to be, although I still give Sphinx high marks for search relevancy.

I shall sing the praises of Sphinx and your plugin as much as possible. Thanks for the good work.

Sarah Johnson on August 31st, 2010 at 1:53 pm

Hello! Sean,

Since I’m new to the whole programming thing can you please provide the code for “Delta indexing” or an instruction? I would really appreciate your kind help.

Regards, Sarah

ovidiu on September 3rd, 2010 at 9:34 pm

I am interested in using this on a wp 3.0.1 installation with multisite enabled so I need the search to search all blogs not just the main blog. Could you maybe expand your tutorial a bit to cover this?

basically it means not only searching wp_posts but rather %prefix%_5blog_id%_posts but I don’t know what those variables would be :-) I just made these up to explain my case

Sean O'Shaughnessy on September 8th, 2010 at 8:31 am

@Sarah -

I’m in the process of putting together a post that covers delta indexing, distributed indexing, index merging and some other tips and hints.

@ovidiu -

You’ll need to manually add a new source and index for each blog since there is seemingly no way at this point to wildcard the table names in the sql_query config option.

poer on December 5th, 2010 at 3:23 am

i would like to use sphinx in several wp-blog in my vps.

do i need to install sphinx for each user/blog or can i install one sphinx as root, then point it to different conf file for each website?

thanks.

Sean O'Shaughnessy on December 6th, 2010 at 11:20 am

@poer

You only need to install sphinx on the system once, but have each configuration file listen on a different port. Then, have the individual applications connect to their designated port.

Jerry Craig on May 6th, 2011 at 10:09 pm

Have you figured out how to wildcard the table names yet like ovidiu was looking for. I’m in the same boat. I’m basically going to be using multisite for the directory structure and to allow some customization in each companies info. But, it’s really just 1 site and I need to have 1 main search for all of the listings (which are custom post_type thanks to CMS Press :) ) Thanks,

Cupbearer

Pvicky on January 15th, 2012 at 9:40 am

Is there a way to get this Google suggest and did you mean… functions in WP?