Converting Sphinx original indexes to real-time indexes

If you are using the Sphinx server with many indexes and you decide to move to real-time indexes then this article is for you. I will describe how to simply convert a large number of original indexes to real-time using indexer and the new Sphinx command ATTACH INDEX.

Before Sphinx version 2.0.2-beta was available, the only way to update real-time (RT) indexes was to use the SphinxQL commands (INSERT/DELETE/REPLACE) which needed to be executed through the MySQL protocol to update RT indexes. At the same time, the original indexes used the indexer tool, which made updating indexes very simple. However, this kind of tool doesn’t exist for real-time indexes.

The only solution was to write a custom script.
Why isn’t a custom script good?

If we compare a custom script to the indexer: indexer was written in C++, it was very well optimized, and tested by many users on different amounts of data. If you want to write a custom script, do so using the language you know best, probably PHP/Ruby/Python, you will need to take care of:

  • long term stability
  • memory leaks
  • data preparation and escaping
  • performance optimization, nobody would wait a month to index a terabyte of data

Furthermore, this kind of script will load searchd which causes RT indexes to use a massive amount of memory. For example, 300 indexes with 100mb rt_mem_limit could require up to 30GB of free RAM.

Good news! Sphinx version 2.0.2-beta has added a new command which allows you to convert original indexes into real-time indexes. This command looks like: ATTACH INDEX diskindex TO RTINDEX rtindex.

After a successful ATTACH, the data originally stored in the source disk index becomes a part of the target RT index, and the source disk index becomes unavailable (until the next rebuild). ATTACH does not result in any index data changes. Basically, it just renames the files (making the source index a new disk chunk of the target RT index), and updates the metadata. So it is a generally quick operation which might (frequently) complete as fast as under a second.

So, now using the MySQL protocol you can convert the original index into a real-time one. In order to achieve this, you could write custom script which will execute the attach command for each index using MySQL client, but I found a better way.

I suggest solving this task in a more traditional way using the indexer tool.
To keep the example simple, I will show you how to convert just one index, you can use the code below to extend your sphinx.conf.

Some terminology I use in the sphinx.conf:

  • orig – original type of index
  • attach – special empty index which will executive attach command
  • rtinex – real-time type of index

Before you start the real-time index should be empty and of course the original index should contain some data. The sphinx.conf structure should be as follows:

source orig
{
    type            = mysql
    sql_host        = localhost
    sql_user        = root
    sql_pass        =
    sql_db          = test
    sql_port        = 3306  # optional, default is 3306
    sql_query       = \
        SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
        FROM documents
    sql_attr_uint       = group_id
    sql_attr_timestamp  = date_added
    sql_query_info      = SELECT * FROM documents WHERE id=$id
}
index orig
{
    source          = orig
    path            = idx/orig
    docinfo         = extern
    charset_type        = sbcs
}
index rtindex
{
    type            = rt
    rt_mem_limit        = 32M
    path            = idx/rtindex
    charset_type        = utf-8
    rt_field        = title
    rt_field        = content
    rt_attr_uint        = group_id
    rt_attr_timestamp = date_added
}
source attach
{
    type            = mysql
    sql_host        = 127.0.0.1
    sql_user        =
    sql_pass        =
    sql_db          =
    sql_port        = 9306  # optional, default is 3306
    sql_query       = select 1 from testrt
    sql_query_post = ATTACH INDEX orig TO RTINDEX rtindex
}
index attach
{
    source          = attach
    path            = idx/attach
    docinfo         = extern
    charset_type        = sbcs
}

Notice that ‘attach’ index is connected to searchd not to mysql and
using sql_query_post executes the attach index command.

The configuration is ready, lets start converting:

Start searchd:

./bin/searchd  -c ./etc/sphinx.conf

Ignore the warning about the ‘attach’ index being empty.

I don’t have any data at ‘orig’ index, so I indexed it before running the transformation:
Skip this command if you have data in your ‘orig’ index:

./bin/indexer -c ./etc/sphinx.conf orig --rotate

Time to convert:

 ./bin/indexer -c ./etc/sphinx.conf attach

Pretty simple.

Let’s now check RT:

mysql -P9306 -h127.0.0.1
mysql> select * from rtindex;
+------+--------+----------+------------+
| id   | weight | group_id | date_added |
+------+--------+----------+------------+
|    1 |      1 |        1 | 1322419937 |
|    2 |      1 |        1 | 1322419937 |
|    3 |      1 |        2 | 1322419937 |
|    4 |      1 |        2 | 1322419937 |
+------+--------+----------+------------+

Yes, the data is in place, so the index was converted successfully!

Conclusion
The big improvement is that indexing and converting can now be done using the same indexer tool. One thing you need to be sure of is to extend the sphinx.conf generator script with the ‘attach’ and ‘rtindex’ indexes.

Read more: http://www.ivinco.com/blog/converting-sphinx-original-indexes-to-real-time-indexes/