Sphinx Search Engine – Overview

Sphinx is an open source full-text search server. Working with Sphinx is pretty much as with a database server . If you consider how to upgrade search engine in your application it can be an easiest way and number of relevance functions ensures you can tweak search quality as well.

Crew

Sphinx Technologies is US company, created in late 2007 by Andrew Aksyonoff, creator and primary developer of Sphinx, and Peter Zaitsev, former head of High-Performance group in MySQL AB, and a world-class expert in database technologies. Today, they are an office-less company with about 10 employees spanning across all the time zones, working online.

Sphinx API’s

  1. Binary API – not recommended, it can add an extra CPU load on the client.
  2. SphinxQL   – with MySQL protocol – I use it in my example.
  3. SphinxSE – plugin for MySQL – MySQL storage engine (SphinxSE)
  4. HTTP – since version 2.3.2 (October 2016) – currently in developing stage, information from official page.

SphinxQL   – example with WordPress DB

Installation on Ubuntu

sudo add-apt-repository ppa:builds/sphinxsearch-beta
sudo apt-get update
sudo apt-get install sphinxsearch

Configuration with SphinxQL

# Simple WordPress posts search
source wp
{
	type			= mysql
	sql_host		= 127.0.0.1
	sql_user		= wp-user
	sql_pass		= wp-pass
	sql_db			= ch
	sql_port		= 3306	# optional, default is 3306
	sql_query		= select ID, post_title, post_content from wp_posts where post_type = 'post'
	#sql_attr_uint		= group_id
	sql_attr_timestamp	= date_added
	sql_ranged_throttle	= 0

}

## index definition
index wp
{
	source			= wp
	path			= /var/lib/sphinxsearch/data/test1
	docinfo			= extern
	dict			= keywords
	mlock			= 0
	morphology		= none
	html_strip		= 1 # this is important for WP

}


# searchd settings
searchd
{
	listen			= 127.0.0.1:9306:mysql41 #for mysql(with SphinxQL) client on 9306
	log			= /var/log/sphinxsearch/searchd.log
	query_log		= /var/log/sphinxsearch/query.log
	read_timeout		= 5
	pid_file		= /var/run/sphinxsearch/searchd.pid
	workers			= threads # for RT to work

}

Run It.

# Run indexer
/usr/bin/indexer wp
# Start service
service sphinxsearch start
# connect to sphinx via mysql client
mysql -h 127.0.0.1 -P 9306
# Test result via SphinxQL
select id, WEIGHT() from wp where match('Search*')

Check results in PHP.

<?php
$search = 'John';

$wp = new PDO(
	'mysql:host=127.0.0.1;dbname=ch;port:3307',
	'ch-local',
	'test'
);

$sphinx = new mysqli(
	'127.0.0.1','','','','9306'
);

$sphinxQl = "select * from wp where match ('".$search."*')";
$res = $sphinx->query($sphinxQl);
$sphinxIds = [];
foreach($res as $row)$sphinxIds[] = $row['id'];

echo "Sphinx results:";
print_r($sphinxIds);
$wpSql = 'select ID, post_title from wp_posts where ID in ('.implode(',',$sphinxIds).') order by FIELD(ID,'.implode(',',$sphinxIds).')';

echo "WP results:";
foreach ($wp->query($wpSql) as $row) {
	echo $row['ID'] . ":".$row['post_title'] .PHP_EOL;        
}

To see results we must run 2 queries, on 2 databases to have one result. In my example I use MySQLi object to connect with Sphinx because I can’t connect with Mysql PDO without user and password. First query ($sphinxQl) is specific for sphinx but it’s based on MySQL and it’s very intuitive to use. Second query is lunched on WP MySQL. Last problem to solve is how to get our posts in correct order? MySQL answer – simply by order by FIELD operator. That’s it.

Rules

  • First thing with Sphinx is that we must use integer numbers as indexes  – there is no other way.
  • Second – we have only index in our results (maybe with extra calculations) without text values from our data.

Moreover, it’s impossible to reconstruct original text, because it’s not stored that way.  Sphinx is doing his job and it stores data in optimized way ( specific method for whitespace, capitalization, punctuation, etc.). But the sentence of sphinx is in it, we have a lot of operators to use with match() function.

Match operators

To tweak results we have huge amount of operators: OR, MAYBE, NOT, NEAR, SENTENCE, field search, field position limit modifier, multiple-field search, ignore field search, ignore multiple-field search, all-field search, phrase search, proximity search, quorum matching, strict order, exact form modifier, field-start and field-end modifier, and more…

select * from wp where 
match( "John Doe" @post_title "results" @post_content "List" -(Johnatan Johnnie) )

The full meaning of this search is:

  • Find the words ‘John’ and ‘Doe’ adjacently in any field in a post;
  • Additionally, the same post must also contain the words ‘results’ in post_title field
  • Additionally, the same post must contain the word ‘List’ in the post_content field, but not contain exact phrase  ‘Johnatan’ or ‘Johnnie’;

I hope it helps to consider Sphinx as your search engine.

Have nice results

Leave a Reply

Your email address will not be published. Required fields are marked *