Menu Close

How to: MySQL InnoDB search engine

It seems like all world besides me knew that InnoDB storage system (of MySQL) does not support full text indexes, but MyIsam does. Since MyIsam does not support transactions it was not an acceptable solution in my case. While searching through the web for solution I’ve found that quite common one is to use Sphinx. I’ve never used this software, and since target hosting environment was external with no guarantee to adapt to my requests I couldn’t rely on Sphinx being installed. Other commonly applied solution is to build twin table based on MyIsam for every table that you wont to search in. That approach is not what I particularly like, so I’ve decided to build my search engine without full text indexes.

The goal

I my particular case demands to the product were as follows:

  • Engine allows to search text columns for occurrences of given keywords.
  • Engine allows to estimate accuracy of results. Estimation is based on number of occurrences of keywords (details follows in implementation section, some additional factors are also taken under consideration).
  • Engine allows highlighting parts of result text, that matched keywords.
  • Assumption was made, that search phrase is structured as follows: keyword1 keyword2…keywordN[, Location] .Location in my case was city name, but it can be any keyword that is more important than others. Location part is optional. If phrase contained more than one or none “,” all keywords are treated equally.

Implementation

Search phrase preparation

First thing is to find “Location”(designated keyword). Then I prepared keywords as array. I assumed that keywords must be longer than 2 characters, otherwise keyword is considered not important and aliviated.

$rawKeywords = array();
$keywords = array();
$cityName = null;
$parts = explode(",", $phrase);
if(count($parts)==2)
{
	$rawKeywords = explode(" ", $parts[0]);
	//there is Location given
	$cityName = trim($parts[1]);
}
else
{
	//more than one or none "," - no Location, all keywords treated equaly
	foreach($parts as $part)
		$rawKeywords = array_merge($rawKeywords, explode(" ", $part));
}
//limiting keywords number to 10(may need adjusting), this limits maximum complexity of final query
$keywordsLimit =10;
foreach($rawKeywords as $rawKeyword)
{
	if(strlen($rawKeyword)>2)
		$keywords[] = trim($rawKeyword);

	if(count($keywords)==$keywordsLimit) break;
}

Building query

Where

This part is quite simple, just iterate over keywords table and glue some SQL like this (UPPER means that search is case-invariant):

WHERE UPPER(Name) LIKE UPPER('%keyword1%') OR Name LIKE UPPER('%keyword2%') OR ...

and consider Location:

if($cityName)
	$where.=" AND UPPER(CityName) LIKE UPPER('%$cityName%')";

Mind that Location is combined into query using AND, not OR like in others keywords case. This means that if Location keyword is specified all results that have other location than given will not be included in results, event if they match all other keywords. This is what makes Location more important.

Also mind that I’ve used ‘%word%’ construction. This allows user to “forget” few first or last characters in given location (nonetheless in this case accuracy of result will be affected, see Order By section)

Order By

Off course you can sort results any way you want (for example by searched item’s name), but most common is to sort them descending by accuracy. So lets estimate accuracy. Rules are:

  • + one point for each match of any keyword
  • + one point if Location exactly matches location of item found
  • – difference in length of searched phrase and searched field divided by length of longer one of these two. What this factor does is making sure that item with name “some item” matches phrase “some item” better than one with name “some other item” – keywords match count for this two is the same, but first one’s name is better covered by search phrase. To count this factor I use $rawKeywords (containing all keywords entered even if they are too short to consider them as valid keywords)

So the code is:

$orderByExpression = '';
$orderBySubExpressions = array();
//+one point for each match of any keyword
foreach($keywords as $keyword)
$orderBySubExpressions[] = '((LENGTH('.$searchedFieldMarker.') - LENGTH(REPLACE(UPPER('.$searchedFieldMarker.'), UPPER('.$adapter->quote($keyword).'), ""))) / '.strlen($keyword).")";

$orderByExpression = "( ".implode("+", $orderBySubExpressions);

//matching length
$rawKeywordsPhraseLength = strlen(implode(" ",$rawKeywords));
$orderByExpression.='-(ABS(LENGTH('.$searchedFieldMarker.')-'.$rawKeywordsPhraseLength.')/IF(LENGTH('.$searchedFieldMarker.')>'.$rawKeywordsPhraseLength.', LENGTH('.$searchedFieldMarker.'), '.$rawKeywordsPhraseLength.'))';

//+one point if Location exacly matches location of item found
if($cityName)
	$orderByExpression.='+ IF(UPPER('.$adapter->quote($cityName).')=UPPER('.$cityNameMarker.'), 1, 0)';

$orderByExpression.=") DESC";

 

Processing results

Estimating accuracy to display

Now we have our result back from database. Lets count maximum possible accuracy. 15 characters long string can contain 3 characters long keyword maximum 5 times (five accuracy points), no keyword overlapping is taken into account. If it contains this keyword only once it means that it reached just 1 accuracy point for this keyword. Assuming that single keyword can cover given string as many times as it can fit in string’s length is mathematically correct, but does not reflect human language. I assumed that single match is as good as multiple matches(there is room for improvement here):

$maximumScore = count($keywords)+($cityName?1:0);

This assumption is made only when estimating results accuracy in code, refer to Order By section to find that item with multiple matches will be in higher positions in results set fetched from database.

Now this method below calculates accuracy as a number from 0 to 1 (results with 0 are impossible in results set, because of WHERE part in query)

public function GetSearchEngineScore($maxScore, array $keywords, $phraseLength)
{
	$name = $this->Name;
	$score = 0;
	foreach($keywords as $word)
	{
		$numOfOccurences = 0;
		str_replace($word, '', $name, $numOfOccurences);
		if($numOfOccurences>0) $score+=1;
	}
	$score -= abs($phraseLength-strlen($name))/max(array(strlen($name), $phraseLength));
	return $score/$maxScore;
}

 

Highlighting matched parts of result

This one is really simple – just use str_replace to do like so:

$highlightedName = str_replace(array(keyword1, keyword2),
array('<span class="match">keyword1</span>', '<span>keyword1</span>')
$this->Name);

 

Possible improvements

What I’ve done so far is base for further development. Most obvious todos for the future are:

– Current solution to ignore keywords shorter than 3 characters is simplification of finding not important keyword. Such keywords can be held in set and used to filter search phrase, or at least get smaller weight, witch lead me to next point.

– Keywords weighting – match of keyword with more weight gives more accuracy points. Weight can be based on few factors, for example position in phrase.

– If keyword matches all word in searched string it should be considered more accurate match than when it matches part of word.

– When calculating accuracy to display multiple keywords matches should result in higher accuracy.

– Finding user typos like Google does, and suggesting right searching phrase. First thought is to make it work by splitting keywords in a half, matching these halves instead of all keyword. Then finding half, witch was matched significantly less times than others and suggest to the user that he made a mistake.

– Autocompletter should be added.

 

Disclaimer

Due to lack of time(last post almost 3 months ago!) I did not measure efficiency of this solution, so I can not make any claim in this subject. If you will, please let me know.
Case in this tutorial is slightly simplified to show main concepts.

Leave a Reply

Your email address will not be published. Required fields are marked *

*