Sphinx Searches Gmail with MySQL and PHP

The Idea

If you’re like me, there doesn’t seem to be an easy way to export all emails from Gmail. They do provide a service that allows you to download your email archive, and here are the FAQs, but maybe (like me) you went to the datatools page and didn’t see any helpful options. No worries! We can just use PHP to insert them into MySQL. And, once they’re stored in the db, we can use Sphinx’s expressive fulltext search syntax to dig through our email archive with ease.

Also, I babbled more on this topic here (my first steps in learning how to do this). A few things have changed since I wrote that, but you may still want to take a look.

Code  (github stefobark/mail)

Go here to get the files I used to do this. Open up form.php in your browswer, which will ask for your username and password and then send you to folders.php where you will be prompted to choose one of your gmail folders. After choosing a folder, gmail.php will grab all the emails and start inserting them into MySQL. This could take a long time (if you have a ton of emails). Exporting your archive from Google would also take a long time (which they mention in the FAQs).

I’ve also put up a sample Sphinx configuration file, gmail.conf. Take a look at it. It expects to index a MySQL table namedemails” with ‘sender’, ‘subject’, ‘date’, and ‘messagefields at 127.0.0.1:9306 with the ‘root’ user and no password.
Index and Search

When you’ve got everything in your db, it’s time to index. The configuration file is named gmail.conf, the index is named ‘emails’, and depending on where you put this project, the indexing command will look something like this:

indexer -c /path/to/project/gmail.conf emails

Indexing will be way faster than inserting into MySQL from Gmail.

Now that these emails are indexed we can search them. Open the command line interface:

mysql -h0 -P9306

I searched for emails that mention MySQL in the ‘subject’ field, like this:

Sphinx defaults to a limit of 20. If you want to exceed this limit, just add ‘limit 1000′ to the end of the query.

Conclusion

So, that’s about it. An easy way to collect and search your g(e)mails with Sphinx. Let me know if you see ways to improve this process

Happy Sphinxing!

Source: sphinxsearch.com