2 ways to your offline wikipedia

… for whatever reason you might need that ;)

For both ways you will need the dumps, you get them here:
http://dumps.wikimedia.org/dewiki/latest/
The main dump is the pages-articles.xml.bz2 if you want the categories as well, you need the
category.sql.gz and categorylinks.sql.gz too.
The pages file is quite huge and will take you probably about 4 hours to download, depending on your connection.

You will also need a fresh and running installation of the mediawiki software.
Install it and here we go:

1. way: mirror wikipedia
Hereby you mirror the official Wikipedia Site on your machine.
This is the usual way, but its caveat is: it takes a lot of time to install!
You will need some extra tools that help you to import the huge xml dumps but generally its the normal LAMP stack, which is good.
The main instructions for this are from here.

The recommended way is to convert the xml to sql to import it.
Therefor we will need a tool for the conversion, a fast tool written in C: xml2sql
On my Ubuntu machine i did so to install:

 git clone https://github.com/Tietew/mediawiki-xml2sql.git
 cd mediawiki-xml2sql/
 ./configure
 make
 make install

And now convert!:

bunzip2 -c dewiki-20101013-pages-articles.xml.bz2 | grep -v '    <redirect />' | xml2sql

(The grep command is necessary because the converter cant handle that element and will fail otherwise.)
This will take some time.
In the meantime install the mediawiki.

We also will need some mysql tuning to handle this huge mass import:
So switch on innodb, if not already happened and raise those values in /etc/mysql/my.conf:

 innodb_log_file_size=512M
 max_allowed_packet=32M

You best delete all inno logfiles as well, before you restart mysql:

rm /var/log/mysql/inno*

Now start the imports:

 mysqlimport -u root -p --local wikipedia `pwd`/{page,revision,text}.txt

And do the same for the category dumps.
So this will take up to 30h! (on a normal machine)
Wow, but after that your wikipedia should be ready to browse.

2. way: use the offline reader extension
This approach will try to avoid the endless SQL import by reading the contents directly from the dump.
The reader comes as an mediawiki extension and uses a very smart mix of console tools.
It will need xapian, bzip2recover and some C++ and python scripts.

Basically it will split the archived xml dump in smaller->faster chunks, create an index over the chunks and use the index as entry point.

Read the whole procedere here.

Well i have to admit I didnt finished that approach but it seemed faster and i liked the smart combination of tools.

In Conclusion i can say: the large amount of data makes handling wikipedia dumps quite a challenge.

One Reply to “2 ways to your offline wikipedia”

it is not working…

it says that there was an “unexpected element”, , , … I guess I will just write my own importer. Wikimedia changed something and now xml2sql is not working anymore.

Comments are closed.

Related posts:

One Reply to “2 ways to your offline wikipedia”