Sphinx is a powerful open source SQL full-text search engine. It runs as a single process in the background, and can be connected to over a specified IP and port.

It supports weighted ranking of search results, different search matching modes (all words in the query, any words in the query, exact phrase), and filtering on specific attributes. Sphinx also supports distributed searching, and phrase proximity ranking for better relevance.

Keep an eye out tomorrow – Mike Pretty will be providing some code examples in Part II on how to tie this setup and configuration into WordPress…

So why Sphinx? What’s wrong with the default WordPress search?

Nothing is wrong with the default WordPress search, it suits plenty of use-cases out of the box. Sometimes, you just need more features than the standard search can provide with post_content LIKE ‘%phrase%’.

What if searching comments was desired? Or custom post types? Using Sphinx can help extend the default search to include posts, pages, comments, custom post types and any other data that can be retrieved with a MySQL query.

There are plenty of resources out there that can help you install Sphinx. RPMforge carries a Sphinx package, and you can also download, compile, and install yourself.

Sources and Indexes

The two fundamental components to Sphinx are a source and an index. The source tells Sphinx where to get the data from and an index uses the source to define how to store the data. There are also various settings you can apply to the indexer process itself (such as port to listen on, memory limit, and where to store logs).

For now, we’re going to start with a simple configuration file that will allow us to search through posts in our blog.

[cc lang=”text”]
#
# GENERAL SETTINGS
#
indexer {
mem_limit = 32M
}

searchd {
listen = 9312

log = /var/log/searchd.log
query_log = /var/log/query.log

read_timeout = 5
client_timeout = 300

max_children = 30

pid_file = /var/run/searchd.pid

max_matches = 1000
seamless_rotate = 1
preopen_indexes = 0
unlink_old = 1

mva_updates_pool = 1M
max_packet_size = 8M

max_filters = 256
max_filter_values = 4096
}

#
# INDEX GROUP:
# MY BLOG
#
# SOURCES:
# src_my_blog
#
# INDEXES:
# idx_my_blog
#
#
source src_my_blog {
type = mysql
sql_host = localhost
sql_user = mysql_user
sql_pass = mysql_user_password
sql_db = mysql_table

sql_query_pre = SET NAMES utf8

sql_query =
SELECT
p.ID*2+1 AS ID,
p.ID as post_ID,
p.post_title as title,
p.post_content as body,
UNIX_TIMESTAMP(p.post_date) AS date_added
FROM
wp_posts as p
WHERE
p.post_type = ‘post’ AND
p.post_status = ‘publish’;

sql_attr_uint = post_id
sql_attr_str2ordinal = title
sql_attr_timestamp = date_added

sql_query_info = SELECT ID, post_title FROM wp_posts WHERE id=($id – 1)/2
}

index idx_my_blog {
source = src_my_blog
path = /var/data/idx_my_blog

docinfo = extern
mlock = 0
morphology = stem_enru
min_stemming_len = 4
min_word_len = 1
charset_type = sbcs # or utf-8
html_strip = 0
html_index_attrs = img=alt,title; a=title;
html_remove_elements = style, script, object, embed, span
}
[/cc]

For the sake of example, we’re going to save this file in /etc/sphinx/sphinx.conf.

You’ll need to change your sql_host, sql_user, sql_pass, sql_db values accordingly to match your environment.

A couple configuration settings worth noting:

  • searchd { listen = 9312 } – this tells the Sphinx daemon what port to listen on.
  • source src_my_blog { sql_query } – this tells Sphinx using the SQL connection info above, what data from what table(s) to index. The important thing to note here is that your query must return a unique ID for every row. You can test your query first using the MySQL command line, or something like phpMyAdmin
  • source src_my_blog { sql_query_info } – the sql_query_info directive allows you to use the command line tool search to test the index(es).
  • index idx_my_blog { html_index_attrs = img=alt,title; a=title; } and index idx_my_blog { html_remove_elements = style, script, object, embed, span } – These two settings tell Sphinx that we want to index the alt and/or title attributes of an image, and the title attribute of links. The html_remove_elements tells Sphinx that we don’t want to index those HTML tags or anything in between them (JavaScripts, embed tags, etc.)

You can read up on all the available configuration settings and recommended values in the Sphinx API reference.

Building the Index

Now that we have defined a basic source and a basic index, we’ll want to get the data indexed. We’re going to use the indexer command which is responsible for gathering the index and storing it where you defined path in your index (in the above example: /var/data/idx_my_blog):

[cc lang=”text”]
$ indexer –config /etc/sphinx/sphinx.conf –all
[/cc]

Start searchd

So we have a working configuration, and have built our first index. Now, we’ll want to start the searchd process that will interface our application to the previously built indexes.

[cc lang=”text”]
$ searchd –config /etc/sphinx/sphinx.conf
[/cc]

Re-indexing

At this point, any new posts you publish won’t be added to your Sphinx index. How come? Well, you have to tell Sphinx to rebuild the index. Since we don’t want to manually have to build our index, we’ll add it to cron to rebuild every 5 minutes:

[cc lang=”text”]
*/5 * * * * /usr/bin/indexer –config /etc/sphinx/sphinx.conf –all –rotate
[/cc]
* note: the path to your ‘indexer’ command may vary…

This time, we passed in the –rotate option. Since it’s not practical to take your index offline to rebuild it, the –rotate option will build your index in parallel and send a SIGHUP to your searchd process.

For larger indexes, you can also use ‘delta’ indexing. ‘Delta’ indexing will create a ‘master’ index, that will take some time to build initially (depending on your query/size of your data). A second ‘delta’ index is maintained that only adds records to it that are greater than the defined field.

For example, you can build your ‘master’ index, and upon completion, have Sphinx save the highest auto-incrementing ID of the table. You then instruct your ‘delta’ index to only index records greater than the previously stored ID. See “Live index updates” in the Sphinx docs for more info.

Read Extending WordPress search with Sphinx Part II for some code examples on how to tie your above setup and configuration into WordPress…