sphinx 全文搜索应用(二)

默北 Sphinxsphinx 全文搜索应用(二)已关闭评论9,469字数 10008阅读33分21秒阅读模式

一.安装配置

安装参见: https://www.ttlsa.com/html/1236.html文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

二.索引说明文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

在查询语句中选择许多列包含在索引中。文档ID(documnet ID)必须位于第一列,并且是一个唯一的无符号的正整数。在这种情况下表中的id字段将被视为创建的索引。文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

文档ID在SQL查询语句中是第一个字段,它必须是一个唯一的非零无符号的非负整数。如果使用多个索引来源,所有来源的文档ID必须唯一。文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

除文档ID字段外,所有查询选择字段,默认情况下都视为创建全文索引字段。文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

如果想一个或多个字段作为索引属性,那么可以使用sql_arrt_*选项来定义。文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

下面的选项可用于不同类型的属性声明:文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_unit: 无符号整数属性(32位)文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_bool: 布尔属性文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_bigint: 符号整数属性(64位)文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_timestamp: unix时间戳属性文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_str2ordinal: 序串号属性文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_float: 浮点属性文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_multi: 多值属性(MVA)文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_multi格式如下:文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

sql_attr_multi = ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE [;QUERY] [;RANGE-QUERY]文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

选项参数如下:文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

ATTR-TYPE: uint或timestamp文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

ATTR-NAME: 属性名称文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

SOURCE-TYPE: field或query或ranged-query文章源自运维生存时间-https://www.ttlsa.com/sphinx/sphinx-search-engine-2/

QUERY: 通过一个SQL语句查询获取所有文档ID,属性值对

RANGE-QUERY: 通过SQL语句查询来获取最小和最大的ID值,类似于sql_query_range

例如:参见https://www.ttlsa.com/html/1346.html

sql_query = SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content FROM documents

sql_attr_uint = group_id

sql_attr_timestamp = date_added

id作为文档ID,group_id作为一个无符号整数属性, date_added作为时间戳属性。 而剩下的字段titel, content将作为全文索引字段。

当对索引进行搜索时,搜索字词匹配的title和content字段,而这两个属性可用于排序和筛选。

sql_query_info是可选的。主要用于在命令行界面(CLI)下搜索查询时,显示索引中匹配的属性。

如果加上sql_query_info = SELECT * FROM documents WHERE id=$id ($id为查询到的文档ID)

# search test
displaying matches:
1. document=1, weight=2578, date_added=Fri Jun 15 15:58:41 2012
id=1
group_id=1
group_id2=5
date_added=2012-06-15 15:58:41
title=test one
content=this is my test document number one. also checking search within phrases.

2. document=2, weight=1557, date_added=Fri Jun 15 15:58:41 2012
id=2
group_id=1
group_id2=6
date_added=2012-06-15 15:58:41
title=test two
content=this is my test document number two

words:
1. ‘test’: 2 documents, 3 hits

当把sql_query_info注释掉,查询结果如下:

# search test
displaying matches:
1. document=1, weight=2578, date_added=Fri Jun 15 15:58:41 2012
2. document=2, weight=1557, date_added=Fri Jun 15 15:58:41 2012

words:
1. 'test': 2 documents, 3 hits

 

三.blog系统下使用

1.新建相关表

mysql> CREATE DATABASE ttlsa_com ;

Query OK, 1 row affected (0.00 sec)

#帖子表

mysql> CREATE TABLE `ttlsa_com`.`posts` (
-> `id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
-> `title` VARCHAR( 255 ) NOT NULL ,
-> `content` TEXT NOT NULL ,
-> `author_id` INT UNSIGNED NOT NULL ,
-> `publish_date` DATETIME NOT NULL
-> ) ENGINE = INNODB;
Query OK, 0 rows affected (0.08 sec)

#作者表,每个post属于一个作者,一个作者可以有多个post

mysql> CREATE TABLE `ttlsa_com`.`authors` (
-> `id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
-> `name` VARCHAR( 50 ) NOT NULL
-> ) ENGINE = INNODB;
Query OK, 0 rows affected (0.06 sec)

#类别表,一个post可以属于多个类别,一个类别可以有多个post

mysql> CREATE TABLE `ttlsa_com`.`categories` (
-> `id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
-> `name` VARCHAR( 50 ) NOT NULL
-> ) ENGINE = INNODB;
Query OK, 0 rows affected (0.10 sec)

#posts表和categories表关联表

mysql> CREATE TABLE `ttlsa_com`.`posts_categories` (
-> `post_id` INT UNSIGNED NOT NULL ,
-> `category_id` INT UNSIGNED NOT NULL ,
-> PRIMARY KEY ( `post_id` , `category_id` )
-> ) ENGINE = INNODB;
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO `posts` (`id`, `title`, `content`, `author_id`, `publish_date`) VALUES (1, 'Electronics For You', 'EFY- Electronics For You is a magazine for people with a passion for Electronics and Technology. Since the first issue in 1969, EFY has delivered the best in Product Reviews, Hardware and Software comparisons, Technical News, Analysis, Electronics news, about Products, Components, Computer Hardware,Power Supply, Industry Automation, Circuit Designing provided by electronicsforu.com.', 2, '2010-08-02 10:29:28'), (2, 'What is PHP?', 'PHP Hypertext Preprocessor (the name is a recursive acronym) is a widely used, general-purpose scripting language that was originally designed for web development to produce dynamic web pages.', 3, '2010-03-09 10:31:01'),(3, 'Nintendo', 'Games that are easy to play and fun for anyone. Nintendo are one of them major players in gaming world. They also develop computer games these days.', 4, '2010-01-05 10:39:21'),(4, 'Sony PlayStation - Full of life', 'Sony Playstation is one of the leading gaming console of modern times. They are fun to play and people of all age groups enjoy it.', 1, '2010-08-17 10:48:23'),(5, 'Namespaces in PHP 5.3', 'One of the most significant and welcome features added in PHP 5.3 was that of namespaces. While this has been around in other programming languages, namespaces have finally found their place starting with PHP 5.3.', 2, '2010-04-19 10:50:11'),(6, 'Leadership Skills', 'Leadership skill is the key to success in any field, be it software industry, automobile industry or any other business.', 2, '2009-02-09 10:55:32'),(7, 'Ruby on Rails', 'RoR is a rapid web application development framework. It was one of the first framework for developing web applications.', 4, '2010-08-13 13:44:32'),(8, 'Sphinx search engine', 'Sphinx was created by Andrew Aksyonoff and it can be used along with any programming language.', 1, '2009-04-13 13:46:11');
Query OK, 8 rows affected (0.01 sec)
Records: 8 Duplicates: 0 Warnings: 0

mysql> INSERT INTO `authors` (`id`, `name`) VALUES
-> (1, 'Amit Badkas'),
-> (2, 'Aditya Mooley'),
-> (3, 'Rita Chouhan'),
-> (4, 'Dr.Tarique Sani');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0

mysql> INSERT INTO `categories` (`id`, `name`) VALUES
-> (1, 'Programming'),
-> (2, 'Games'),
-> (3, 'Electronics'),
-> (4, 'PHP'),
-> (5, 'Search'),
-> (6, 'Misc');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0

mysql> INSERT INTO `posts_categories` (`post_id`, `category_id`) VALUES
-> (1, 1),
-> (1, 2),
-> (1, 3),
-> (2, 1),
-> (2, 4),
-> (3, 2),
-> (3, 3),
-> (4, 2),
-> (4, 3),
-> (5, 1),
-> (5, 4),
-> (6, 6),
-> (7, 1),
-> (8, 1),
-> (8, 5);
Query OK, 15 rows affected (0.00 sec)
Records: 15 Duplicates: 0 Warnings: 0

2.sphinx.conf配置如下:

# vi sphinx.conf
source blog
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = ttlsa_com
sql_query = \
SELECT id, title, content, UNIX_TIMESTAMP(publish_date) \
AS publish_date, author_id FROM posts
sql_attr_uint = author_id
sql_attr_multi = uint category_id from query; SELECT post_id, category_id FROM posts_categories
sql_attr_timestamp = publish_date
sql_query_info = SELECT id, title FROM posts WHERE ID=$id
}

index posts
{
source = blog
path = /data/sphinx/
docinfo = extern
charset_type = sbcs
}

indexer
{
mem_limit = 32M
}

searchd
{
port = 9312
log = /var/log/searchd.log
query_log = /var/log/query.log
read_timeout = 5
max_children = 30
pid_file = /var/log/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 0
unlink_old = 1
}

#创建索引

# indexer --all
collected 8 docs, 0.0 MB
collected 15 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 8 docs, 1543 bytes
total 0.033 sec, 46475 bytes/sec, 240.96 docs/sec
total 35 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 11 writes, 0.000 sec, 0.6 kb/call avg, 0.0 msec/call avg

#使用sphinx搜索php并按照时间升序排序

# search php --rsort=date
index 'posts': query 'php ': returned 2 matches of 2 total in 0.000 sec
displaying matches:
1. document=2, weight=2678, publish_date=Tue Mar 9 10:31:01 2010, author_id=3, category_id=(1,4)
id=2
title=What is PHP?

2. document=5, weight=2703, publish_date=Mon Apr 19 10:50:11 2010, author_id=2, category_id=(1,4)
id=5
title=Namespaces in PHP 5.3

words:
1. 'php': 2 documents, 5 hits

#使用mysql搜索php

mysql> select * from posts where title like "%php%" or content like "%php%" order by publish_date\G
*************************** 1. row ***************************
id: 2
title: What is PHP?
content: PHP Hypertext Preprocessor (the name is a recursive acronym) is a widely used, general-purpose scripting language that was originally designed for web development to produce dynamic web pages.

author_id: 3
publish_date: 2010-03-09 10:31:01
*************************** 2. row ***************************
id: 5
title: Namespaces in PHP 5.3
content: One of the most significant and welcome features added in PHP 5.3 was that of namespaces. While this has been around in other programming languages, namespaces have finally found their place starting with PHP 5.3.
author_id: 2
publish_date: 2010-04-19 10:50:11
2 rows in set (0.02 sec)

#使用sphinx php api搜索

# vi search.php

<?php
require_once "sphinxapi.php";
$search = new SphinxClient;
$search->setServer("localhost", 9312);
$search->setMatchMode(SPH_MATCH_ANY);
$search->SetArrayResult ( true );
$search->setMaxQueryTime(3);
$search->SetSortMode(SPH_SORT_ATTR_ASC,"publish_date");
print_r($search->query("php"));
?>

# php search.php

Array
	(
	[error] =>
	[warning] =>
	[status] => 0
	[fields] => Array
			(
			[0] => title
			[1] => content
			)
	[attrs] => Array
			(
			[publish_date] => 2
			[author_id] => 1
			[category_id] => 1073741825
			)

	[matches] => Array
			(
			[0] => Array
				(
				[id] => 2
				[weight] => 2
				[attrs] => Array
						(
						[publish_date] => 1268101861
						[author_id] => 3
						[category_id] => Array
										(
										[0] => 1
										[1] => 4
										)
						)
				)

			[1] => Array
				(
				[id] => 5
				[weight] => 2
				[attrs] => Array
						(
						[publish_date] => 1271645411
						[author_id] => 2
						[category_id] => Array
										(
										[0] => 1
										[1] => 4
										)
						)
				)
		)

	[total] => 2
	[total_found] => 2
	[time] => 0.000
	[words] => Array
			(
			[php] => Array
					(
					
=> 2 [hits] => 5 ) ) )

#搜索包含php且作者是'Rita Chouhan'

# search --rsort=date --filter author_id 3 php

index 'posts': query 'php ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=2, weight=2678, publish_date=Tue Mar 9 10:31:01 2010, author_id=3, category_id=(1,4)
id=2
title=What is PHP?
words:
1. 'php': 2 documents, 5 hits

#使用sphinx php api搜索

# vi search.php

<?php
require_once "sphinxapi.php";
$search = new SphinxClient;
$search->setServer("localhost", 9312);
$search->setMatchMode(SPH_MATCH_ANY);
$search->SetArrayResult ( true );
$search->setMaxQueryTime(3);
$search->SetFilter("author_id",array(3));
print_r($search->query("php"));
?>

# php search.php

Array
	(
	[error] =>
	[warning] =>
	[status] => 0
	[fields] => Array
				(
				[0] => title
				[1] => content
				)
	[attrs] => Array
				(
				[publish_date] => 2
				[author_id] => 1
				[category_id] => 1073741825
				)
	[matches] => Array
				(
				[0] => Array
						(
						[id] => 2
						[weight] => 2
						[attrs] => Array
									(
									[publish_date] => 1268101861
									[author_id] => 3
									[category_id] => Array
													(
													[0] => 1
													[1] => 4
													)
									)
						)
				)
	[total] => 1
	[total_found] => 1
	[time] => 0.000
	[words] => Array
				(
				[php] => Array
						(
						
=> 2 [hits] => 5 ) ) )

#搜索包含sphinx且类型属于5的所有内容(这里将使用到MVA)

# search --filter category_id 5 sphinx

index 'posts': query 'sphinx ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=8, weight=2795, publish_date=Mon Apr 13 13:46:11 2009, author_id=1, category_id=(1,5)
id=8
title=Sphinx search engine
words:
1. 'sphinx': 1 documents, 2 hits

通过mysql数据库查询:

mysql> select p.* from posts as p, posts_categories as pc where p.title like "%sphinx%" and p.id=pc.post_id and pc.category_id=5\G
*************************** 1. row ***************************
id: 8
title: Sphinx search engine
content: Sphinx was created by Andrew Aksyonoff and it can be used along with any programming language.
author_id: 1
publish_date: 2009-04-13 13:46:11
1 row in set (0.01 sec)

如需转载请注明出处:https://www.ttlsa.com/html/1354.html

weinxin
我的微信
微信公众号
扫一扫关注运维生存时间公众号,获取最新技术文章~
默北
  • 本文由 发表于 16/06/2012 16:21:09
  • 转载请务必保留本文链接:https://www.ttlsa.com/sphinx/sphinx-search-engine-2/